Blogs are best served static

    Blogs are best served static

    Or there and back again.

    Earlier this month I moved to Ghost. I did that because I wanted to have a nice editor, I wanted to be able to write easily from anywhere and I wanted to get spend less time in the terminal. But I knew moving to a dynamic site would have performance penalties.

    When looking at the loading time of a single page in the browser's inspector tools, everything seemed fine: load times around 1.3s, seemingly even faster than my old site. But then, I did some load tests using to see how my new blog performs. The results were not pretty: my blog fell over with 2-3 concurrent requests sustained for 10 seconds. Ghost couldn't spawn more threads, segfaulted and then restarted after 20 seconds.

    I am using the smallest DigitalOcean instance, with 1 virtual CPU and 1 GB of RAM. I temporarily resized the instance to have 2 vCPU and then 3 vCPUs, but the results were still pretty poor: even the 3 vCPU instance couldn't handle more then 10 connections per second for 10 seconds.

    While this would not be a problem with my current audience (around 50-100 page views per day), I have great dreams of my blog making it to the front page of Hacker News and getting ten thousand views in one day. And I would rather not have my site fall over in such cases.

    I knew my static blog could easily sustain 1000 simultaneous connections, so I went back and combined the two, to get the best of both worlds: a nice frontend to write posts and preview them, with the speed of a static site.

    I looked a bit into using some tools like Gatsby or Eleventy to generate the static site, but they were quite complicated and required maintaining my theme in yet another place. But I found a much simpler solution: wget. Basically I wrote a crawler for my own website, dumped everything to HTML and reuploaded it to my website.

    In order to do this, I set up nginx to proxy a subdomain to the Ghost blog. Initially I wanted to set it up as a "folder" under my domain, but Ghost Admin doesn't play nice with folders. I won't link to it here, both because I don't want it widely available and for another reason I'll explain later.

    Then I used the following bash script:

    # Define urls and https
    # Copy blog content
    wget --recursive --no-host-directories --directory-prefix=static --timestamping --reject=jpg,png,jpeg,JPG,JPEG --adjust-extension --timeout=30  ${from_url}/
    # Copy 404 page
    wget --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent --content-on-error --timestamping ${from_url}/404.html
    # Copy sitemaps
    wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap.xsl
    wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap.xml
    wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap-pages.xml
    wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap-posts.xml
    wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap-authors.xml
    wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap-tags.xml
    # Replace subdomain with real domain
    LC_ALL=C find ./static -type f -not -wholename *.git* -exec sed -i -e "s,${from_url},${to_url},g" {} +;

    I start by crawling the front page of the blog. I exclude images, because they will be on the same server, so if I upload them with Ghost, I can then have nginx serve them from where Ghost uploads them, with the correct URL. The adjust-extension is needed to automatically create .html files, instead of leaving extensionless files around.

    Then I crawl the 404 page and the sitemap pages, which are all well defined.

    Wget has a convert-links option, but it's not too smart. It fails badly on image srcsets, screwing up extensions. Because of this I didn't use it, instead opting to use good ol' sed to replace all appearances of the subdomain URL to the normal URL. And because I don't want to add a special exception to this post, I can't include my actual subdomain in the text, or it would get converted too.

    For now, I run this locally, but I will set up a script that is called on a cronjob or something.

    After downloading all the HTML, I upload the content to my blog with rsync:

    rsync -pruv static

    Downloading takes about 2 minutes, uploading about 5 seconds.

    After doing all this, I redid the tests.

    Left: latency, right: number of clients

    The static site managed to serve up to 1800 clients per second, averaging 1000/s for a minute. That's good enough for now!

    Longer term, I plan to rewrite the crawler, to be smarter/faster. For example, wget doesn't have any parallelism built in. Old posts don't change often, so they could be skipped most of the time. But doing that correctly takes more time, so for now, I'll stick to this simple solution!