I have just moved my blog to Acrylamid. I have been planning this move for a long time, because Wordpress is way too large, slow and has too many features I don't need. I wanted to be able to blog using just Markdown, use Git for versioning and be able to publish with rsync.
There are many static blogging platforms, the most popular being Jekyll. That one fell of the list right away because it was written in Ruby. Then I looked at Hakyll, which is written in Haskell, but it is waaaay too complicated and requires too much work just to be able to make a simple blog. The next logical choice was Pelican, the most starred Python version on GitHub. I started working on moving to it, but after a while I noticed that it's license is AGPL, which I don't really like, so I moved to Acrylamid, which is licensed under BSD and, as a bonus, does incremental compiling, so if I change a post, it doesn't change the whole thing.
Both Pelican and Wordpress offer some tools to import from other sources, including Wordpress, but they are lacking. Wordpress often generates weird stuff (and stores weird stuff), which trips up the converting tools.
One of the problems was that I used the Jetpack plugin, which converted my images to be hosted on the Wordpress.com for speed. I don't want to keep my images there anymore, so I needed to change that. And I started using sed for that. I needed to replace any link that started with their CDN, to a relative URL pointing to the images folder on my site.
sed -i "s/http:\/\/i.\.wp\.com\/rolisz\.ro\/wordpress\/wp-content\/uploads/\/images/g"
Another problem: Wordpress does some interesting things with images you embed in your posts, it wraps them with links to the image itself and presents a smaller version of the image. Because the converting tools are "intelligent", they transformed the link and image into their Markdown equivalent, from where I now needed to extract the image only.
sed -i "s/\[\(!\[.*\]\).*\]/\1/g"
However not all images were so lucky. The ones that were in a gallery didn't get converted, because Wordpress stores in the database only the ids of the images and then when serving it compiles to HTML. The export of course contains only the original text. So I had to manually copy-paste the HTML of the galleries for several posts and then I had to process them to prettify the links and to remove the extra cruft it added:
sed "s/<a href='.*'>\(<img src\=\".*\)?resize=150%2C150\".*\" \/><\/a>/\1\" \/>/g" post.md
The last two things I had to correct are not actually Wordpress's faults, but mine, because I switched from Pelican to Acrylamid, and they have a bit different conventions regarding how to represent pages and tags. In Pelican pages belong to a pages folder, while in Acrylamid they have the Type: page at the beginning of the post. In Pelican tags are just comma separated, while in Acrylamid they also have to be in square brackets. The following two sed commands do these changes:
sed -i '3 a\Type: page' post.md sed -i 's/Tags: \(.*\)/Tags: [\1]/' post.md
Another problem was that Wordpress was trying to be super helpful and generate thumbnails for my images, but it got a bit overzealous and generated 4-5 thumbnails for each image. That's a bit excesive and for now I didn't want to bother with it, so I wanted to save the space and remove all the thumbnails, keeping only the original image. find to the rescue. All the thumbnails have appended to their name the resolution they are, so it's easy to write a find command that looks for that pattern and deletes it.
find static/images/ -type f -name '*-[0-9]*x[0-9]*.png' -delete
All in all, this was a fun exercise in using sed. I haven't used it much outside of the Operating Systems class, but it was cool to see that it can be quite useful and that it is so powerful.