<![CDATA[ rolisz's site - Technical Posts ]]> https://rolisz.ro https://rolisz.ro/favicon.png rolisz's site https://rolisz.ro Sat, 11 Jul 2020 18:24:32 +0300 60 <![CDATA[ Giving code presentations ]]> https://rolisz.ro/2020/07/04/giving-code-presentations-in-jupyter-notebook/ 5efb28a717253e7fe6dd646b Sat, 04 Jul 2020 23:28:33 +0300 I sometimes give talks at the local tech hub in the city where I live. It's not a big community, but I enjoy giving talks and they often provide a necessary deadline and motivation to finish some projects.

Last week I gave a talk about Rust. Given that there are still some restrictions on how many people get be in one room, the physical audience was only 10 people, but there was a livestream as well.

Until now, I had used Google Slides for my presentation. For talks that don't have a lot of code, it works fine. But when you are presenting lots of code (such as a tutorial for a programming language), I found Slides to be lacking. If you paste in the code directly, you can't have syntax highlighting. You can paste in a screenshot, but then any later modifications to the slide mean retaking the screenshot and replacing it, so it's more work.

You can present in an IDE, but sometimes you want to have slides with normal text between pieces of code, where you explain some things. Switching between two apps can quickly get annoying. Also, it's hard to prepare just "bite-sized" content in an IDE, but that is needed so that the audience is focused only on what you are explaining right now.

So I decided to try something new for my intro to Rust presentation: I used Jupyter Notebook with a several extensions and I think it worked pretty well (except for a bug towards the end of the presentation).

For this I used the RISE extension, which adds live slide show support to Jupyter, using reveal.js. Each cell can be either a new slide, a sub-slide (so to get to it you have to "navigate down", in reveal.js style), a fragment (so it shows up on the same slide, but on a subsequent click), or notes. You can write new code and run it even during slideshow mode, so it's very useful if someone in the audience has a question, you can quickly write down and execute code to answer them. RISE is simple to install:

> pip install RISE

Then I used a bunch of extenstions that are bundled together in the jupyter_contrib_nbextensions package. By default, you have to enable and configure them by editing JSON files, but there is another plugin to add a dashboard for them, called jupyter_nbextensions_configurator. They can be installed with:

> pip install jupyter_contrib_nbextensions
> jupyter contrib nbextension install --user
> pip install jupyter_nbextensions_configurator
> jupyter nbextensions_configurator enable --user

You have to restart the Jupyter process and now you will see a new tab on the home page of the local Jupyter page, where you can enable and configure all the installed extensions.

I used the "Hide input" extension. Most of my code was organized into two cells. One which didn't contain all the code, just a snippet on which I wanted to focus (for example, I made a small change to a previously defined function), and another one which could be run and showed output. The latter cell was hidden with this extension, so that only the output could be seen.

Initially I also used the "Split cell" extension. This extension gives you a button which can make a cell to be half width. If two consecutive cells are half width, they align next to each other, making two columns. I wanted to use this to have code on the left column and explanations on the right column. This would have worked if the presentation was online only, because I wouldn't have had to zoom in too much. But because in the last week before the presentation we found out that it was allowed to hold the presentation in person (with 10 people in the audience) and I had to present on a projector and zoom in, I ended up removing all the split cells because the content wouldn't fit in any longer.

Making Rust work with Jupyter

All the above is generic and can be made to work with anything that works in Jupyter. To make Rust work in Jupyter you need a kernel for it. Some guys from Google have made one called evcxr_jupyter.

It's fairly straightforward to install. On Windows you need to first install CMake and then you run:

> cargo install evcxr_jupyter
> evcxr_jupyter --install

After restarting the Jupyter process, you now have the option of using a Rust kernel. To include Cargo dependencies, you can insert the following into a cell:

:dep reqwest = { version = "0.10", features = ["json", "blocking"] }

This downloads reqwest , compiles it and makes it available for using in other cells.

The notebook for presentation that I gave can be found in a Github repo and the recording can be found here.

I’m publishing this as part of 100 Days To Offload - Day 27.

]]>
<![CDATA[ My operating system journey ]]> https://rolisz.ro/2020/06/14/operating-system-journey/ 5ec2eed817253e7fe6dd5a49 Sun, 14 Jun 2020 21:46:40 +0300 Getting started with Linux

Ten years ago I was writing how I'm a big fan of Windows (and Android). I would regularly get into friendly debates with Cătălin, who was a staunch supporter of Linux. I kept dipping my toes into Linux, but for a long time, I got burned.

At my first internship and then even more so at my first job, I learned more and more about Linux and got comfortable in it. I started dual booting. By 2014, the most used OS on my laptop was Fedora.

When I built a desktop in 2015, I first installed Linux on it, even though I had to try several distributions until I found one that worked. I was in my "command-line" minimalist phase, so I set up i3, tmux, and fish. I was quite happy with it, but eventually I installed Windows 10 on it so that I could play games, run the Swiss tax form application and YNAB, a budgeting app.

Trying out Mac OS

My work laptop at the time was a MacBook. I thought I would like it and I was looking forward to trying out all the cool and hipster apps that were only on Mac OS, such as Alfred. In the end, while working at Google, I used only a browser and a terminal, and I never got around to really work with any other apps, because I didn't need them. The terminal experience in Mac requires a bit more searching around to get things working. Macs come with old libraries out of the box, you have to update them using copy pasted shell commands and I managed to screw things up once with Homebrew.  I was not impressed by Mac OS and I didn't want to spend my own money on that crappy keyboard.

Slowly turning back to Windows

But Windows (and it's ecosystem) has changed a lot since then. When I bought my new laptop in 2018, it came with Windows and I never bothered installing Linux on it. Why? Windows Subsytem for Linux. You get pretty much all the CLI goodies from Linux and all the other nice stuff from Windows. For example, as far as I know, there's almost no laptop where Linux has comparable battery life with Windows and that is an important factor for me, because I work remotely.

On my desktop I still had ArchLinux, because running Tensorflow was easier on Linux than on Windows (modulo the Nvidia driver updates) . But slowly I got bored of the "command line" minimalism. I tried other desktop environments on Linux, such as KDE and Gnome, but they never stuck. KDE is too bloated, and I find the default theme to be outdated. Gnome looks nice, but I never got around to feeling comfortable in it. The others are too "fringe" for me and I think that it's too hard to find solutions to the problems that inevitably crop up, just because the community is too small.

For the last two months, I have found myself using almost only Windows, even on my desktop. This way, I can watch Netflix at the highest resolution (on Linux, you can watch only in the browser, where's it's capped at 720p), I can play games. Rust works just as well on Windows as on Linux. WSL satisfies my very few needs for Linux only apps. And I never had problems with Nvidia drivers on Windows (unlike on Linux). The new Terminal app on Windows is pretty sweet. Powershell is pretty cool too, even though I don't know much of the syntax so far.

And honestly, I just like the default UI on Windows more. 10 years ago I had the patience to tinker with themes and to customize my desktop endlessly, but now I don't have the time and energy to deal with that anymore. I see plenty of nice Linux themes on Reddit and I tried to replicate one, but abusing fonts to get some nice "symbols" in i3-bar? Ewww.

Even though many people complain about Windows updates messing things up, that has never happened to me in the last 5 years, even though I am running on the insider preview version of Windows 10. On the other hand, I did manage to screw things up with ArchLinux updates, but it was my fault usually, because I didn't read the instructions or I let too much time pass between updates.

Servers

That's the story for my desktops and laptops. On servers, it's Linux all the way. My NAS runs Linux. My VPSs run Linux. And I plan to keep it that way. But there I'm not bothered by the fact that I SSH in and do the half an hour at most every week from the command line.

The only thing that I didn't try was a variant of BSD. Five years ago I might have given it a shot, but now I don't want to relearn a lot of things, from command line flags to concepts like jails. The strongest argument for BSD would be security, but Linux is secure enough for me, for now.

The future

But who knows what will happen in the future? Maybe in five years I'll get bored again of Windows and I'll try something new. Maybe Fuchsia will become mature by then :D

I’m publishing this as part of 100 Days To Offload - Day 23.

]]>
<![CDATA[ DuckDuckGo ]]> https://rolisz.ro/2020/05/26/duckduckgo/ 5ecd754317253e7fe6dd5d9e Tue, 26 May 2020 23:52:41 +0300 Some of my more astute readers might have noticed the "word" "DuckDuckGo" appearing several times on my blog this year, both as a noun and as a verb. Contrary to how the name sounds, it's not a board game; it's a search engine. It's an alternative to Google.

DuckDuckGo promises to be more privacy friendly. They say they don't track you and they don't show targeted ads, only keyword related ads.

Below are the results if you search for me on DuckDuckGo and on Google:

Google has a bit fancier organization of my blog content. But actually DuckDuckGo has only my accounts on the first page, while Google quickly veers off to showing links of other people, such as a LoL gamer.

One of the really cool features of DDG are the bangs. If you enter !g in your search query, DuckDuckGo will redirect you to search on Google (supposedly with some fewer cookies). yt will search on YouTube, gm on Google Maps and so on. They have several thousand bangs already. This is actually a time saving feature, because you can search directly from the address bar, without having to go to that other site first. And when DDG search results are not good enough, you can just add a bang and search on Google.

Another cool feature for programmers: DDG has much better StackOverflow integration. When you search for something programming related, they often show a snippet on the side with the top answer from StackOverflow. It makes copy pasting sooo much easier.

If you think it's funny that an ex-Googler uses DuckDuckGo, you should have seen the faces of my colleagues at Google when they saw that I was using DDG at Google. And even funnier was when my manager noticed this and then we had a discussion about why he doesn't use it: local queries (searching for nearby restaurants for example) were not working very well in DuckDuckGo and he used them very often. Well Suman, if you are reading this, DuckDuckGo now has you covered on that front too.  

If you are interested in alternatives to the big tech companies, I highly recommend using DuckDuckGo.

I’m publishing this as part of 100 Days To Offload - Day 16.

]]>
<![CDATA[ Bridging networks with a Synology NAS ]]> https://rolisz.ro/2020/05/19/bridging-networks-with-a-synology-nas/ 5ec433a417253e7fe6dd5a50 Tue, 19 May 2020 23:25:07 +0300 Warning: potentially bad idea ahead

For various reasons, I had to reorganize my working setup at home, including moving my home office to the room where my Synology NAS is. In this room, I had only one Ethernet outlet, but I needed two connections, one for the NAS and one for my desktop. I was too lazy to go to the nearest electronic shop to buy a switch this evening and I didn't want to unplug the NAS, but then I had an idea:

My NAS has two Ethernet ports. What if I use it as a sort of router, connecting the second Ethernet port directly to my desktop?

Let's give it a shot. I connected the desktop to the NAS. Both of them perceive that there is something connected, but no IP addresses are assigned.

I tried to fool around in the Control Panel of the Synology to enable a DHCP server for the second LAN interface. Eventually I got an IP on the desktop and I could load the Synology server, but I couldn't access the internet.

After some DuckDuckGo-ing and wading through all the posts saying that this is a bad idea and it's not how the Synology should be used, I found a Github repo that said it can bridge the two networks of a Synology. The script there is a bit of an overkill for what I needed, so here is the gist that I needed to get things working:

First, enable vSwitch:

Where to find the vSwitch Settings

Then SSH into the NAS and run the following two commands:

> sudo ovs-vsctl del-br ovs_eth1
> sudo ovs-vsctl add-port ovs_eth0 eth1
> sudo ovs-vsctl show 
    Bridge "ovs_eth0"
        Port "eth1"
            Interface "eth1"
        Port "eth0"
            Interface "eth0"
        Port "ovs_eth0"
            Interface "ovs_eth0"
                type: internal

If the output of the final command shows a bridge and two associated Ports, you're good to go and browse the Internet!

I don't actually intend to keep this as a long term solution. A NAS is not meant to function as a router or switch, so it's not optimized for this. A real switch is probably faster, but a Speedtest shows that I have 250 Mb/s download, so it's pretty good for now, until I get around to buying a switch.

I’m publishing this as part of 100 Days To Offload - Day 13.

]]>
<![CDATA[ An unexpected error in Rust ]]> https://rolisz.ro/2020/05/18/an-unexpected-error-in-rust/ 5ec2e34517253e7fe6dd59e2 Mon, 18 May 2020 23:04:11 +0300 As I continue my journey to learn Rust, I occasionally run into various hardships and difficulties. Most of them are related to lifetimes and ownership, but they lead to a pleasant moment of enlightenment when I figure it out. But for the last couple of days, I struggled with a much dumber error, one which I should have figured out much faster.

I was trying to load some word vector embeddings, from the finalfusion package. Word vectors have grown up since I used them (back in the word2vec days). These are 3.85 Gb.

I tried to load them up and to play with the API on my Linux desktop. The loading time was about 10 seconds, but then it was fast. And it worked.

Fast forward a month, during which I worked on other project and I get around again to working with the word vectors, this time from my Windows desktop. The rest of the project runs fine, but when it comes to loading the word vectors, it errors out with a menacing stack trace

thread 'main' panicked at 'capacity overflow', src\liballoc\raw_vec.rs:750:5
...

I look at the libraries repo, there was a new release in the meantime. They warn something about a breaking change, maybe the old code can't read newer vectors? I update the library; I change the code; I still get the same error.

Maybe it's my Rust version that's old and something broke. I update Rust. Nope, not it.

I try to DuckDuckGo the error, but I don't find anything relevant. So I open an issue on the GitHub repo of the library and I ask about this. I get an answer about it in 5 minutes (thank you for the prompt answer Daniel!): am I using the 32-bit or the 64-bit toolchain?

I facepalm hard, because I realize that's probably the error: the word vector is right around the size that can be loaded into memory  in 32 bit systems, there might be some extra allocations done while loading, so it goes overboard.

I check with rustup what toolchain I have:

> rustup toolchain list
stable-i686-pc-windows-msvc (default)

That's the 32 bit toolchain, my friends. So I install the x86_64 toolchain and set it as default:

> rustup toolchain install stable-x86_64-pc-windows-msvc
> rustup default stable-x86_64-pc-windows-msvc
> rustup toolchain list
stable-i686-pc-windows-msvc
stable-x86_64-pc-windows-msvc (default)

And lo and behold, the word vectors are now successfully loaded and I can start playing around more seriously with them.

Why is the 32bit toolchain the default one on Windows in 2020?

I’m publishing this as part of 100 Days To Offload - Day 12.

]]>
<![CDATA[ Context Variables in Python ]]> https://rolisz.ro/2020/05/15/context-variables-in-python/ 5eba763817253e7fe6dd56d8 Fri, 15 May 2020 22:59:02 +0300 Recently I had to do some parallel processing of nested items in Python. I would read some objects from a 3rd party API, then for each object I would have to get it's child elements and do some processing on them as well. A very simplified sketch of this code is below:

example_record = {'id': 0, 'children':
        [{'id': 'child01', 'prop': 100}]
}
     
executor = ThreadPoolExecutor(20)

def fetch_main_item(i):
    obj = records_to_process[i]
    return process_main_item(obj)

def process_main_item(obj):
    results = executor.map(process_child_item, obj['children'])
    return sum(results)

def process_child_item(child):
    sleep(random.random()*2)
    return child['prop']

results = executor.map(fetch_main_item, range(4))

for r in results:
    print(r)

The code ran just fine, but we wanted to have some visibility in how the processing is going, so we needed to add some logging. Just sprinkling some log statements here and there is easy, but we wanted all the logs to contain the index of the main record, even when processing the child records, which otherwise doesn't have a pointer to the parent record.

The easy and straightforward way would be to add the index to all our functions and always pass it along. But that would mean changing the signature of all our functions, which were much more, because there could be several different kinds of child objects, each being processed in a different way.

A much more elegant way would be to use contextvars, which were added in Python 3.7. These context variables act like a global variable, but per thread. If you set a certain value in one thread, every time you read it again in the same thread, you'll get back that value, but if you read it from another thread, it will be different.

A minimal usage example:

import contextvars
from concurrent.futures.thread import ThreadPoolExecutor
from time import sleep

ctx = contextvars.ContextVar('ctx', default=10)
pool = ThreadPoolExecutor(2)

def show_context():
    sleep(1)
    print("Background thread:", ctx.get())

pool.submit(show_context)
ctx.set(15)
print("Main thread", ctx.get())

The output is:

Main thread 15
Background thread: 10

Even though the background thread prints the value after it has been set to 15 in the main thread, the value of the ContextVar is still the default value in that thread.

This means that if we add the index to a context variable in the first function, it will be available in all other functions that run in the same thread.

import contextvars

context = contextvars.ContextVar('log_data', default=None)

def fetch_main_item(i):
    print(f"Fetching main item {i}")
    obj = records_to_process[i]
    context.set(i)
    result = process_main_item(obj)

    return result

def process_main_item(obj):
    ctx = context.get()
    results = executor.map(process_child_item, obj['children'])
    s = sum(results)
    print(f"Processing main item with {obj['id']} children at position {ctx}")
    return s
    
def process_child_item(child):
    sleep(random.random()*2)
    ctx = context.get()
    print(f"Processing child item {child['id']} of main item at position {ctx}")
    return child['prop']

What we changed was that in the fetch_main_item we set the context variable to the index of the record we process, and in the other two functions we get the context.

And it works as we expect in the process_main_item function, but not in the process_child_item function. In this simplified example, the id of each main record is the same as their index, and the first digit of the id of a child record is the parents id.

Fetching main item 0
Fetching main item 1
Fetching main item 2
Fetching main item 3
Processing child item child11 None
Processing child item child01 None
Processing child item child02 None
Processing child item child31 None
Processing child item child32 None
Processing main item with id 3 with 3
Processing child item child21 None
Processing child item child22 3
Processing child item child03 3
Processing main item with id 0 with 0
Processing child item child12 3
Processing main item with id 1 with 1
Processing child item child23 None
Processing main item with id 2 with 2

What is going on in child processing function? Why is the context sometimes None and sometimes 3?

Well, it's because we didn't set the context on the new thread. When we spawn a bunch of new tasks in the thread pool to process the child records, sometimes they get scheduled on threads that have never been used before. In that case, the context variable hasn't been, so it's None. In other cases, after one of the main records is finished processing, some of the child tasks are scheduled on the thread on which the main record with id 3 was scheduled, so the context variable has remained on that value.

The fix for this is simple. We have to propagate the context to the child tasks:

def process_main_item(obj):
    ctx = context.get()
    results = executor.map(wrap_with_context(process_child_item, ctx), obj['children'])
    s = sum(results)
    print(f"Processing main item with id {obj['id']} with {ctx}")
    return s

def wrap_with_context(func, ctx):
    def wrapper(*args):
        token = context.set(ctx)
        result = func(*args)
        context.reset(token)
        return result
    return wrapper

When calling map, we have to wrap our function in another one which sets the context to the one we pass in manually, calls our function, resets the context and then returns the result of the function. This ensures that the functions called in a background thread have the same context:

Fetching main item 0
Fetching main item 1
Fetching main item 2
Fetching main item 3
Processing child item child11 1
Processing child item child12 1
Processing main item with id 1 with 1
Processing child item child02 0
Processing child item child01 0
Processing child item child03 0
Processing main item with id 0 with 0
Processing child item child32 3
Processing child item child31 3
Processing main item with id 3 with 3
Processing child item child22 2
Processing child item child23 2
Processing child item child21 2
Processing main item with id 2 with 2

And indeed, all the indexes are now matched up correctly.

Context variables are a very nice mechanism to pass along some information, but in a sense they are global variables, so all the caveats that apply to global variables apply here too. It's easy to abuse them and to make it hard to track how the values in the context variable change. But, in some cases, they solve a real problem. For example, distributed tracing libraries, such as Jaeger, use them to be able to track how requests flow inside the program and to be able to build the call graph correctly.

Kudos to my colleague Gheorghe with whom I worked on this.

I’m publishing this as part of 100 Days To Offload - Day 10.

]]>
<![CDATA[ Adding search to a static Ghost blog ]]> https://rolisz.ro/2020/05/13/adding-search-to-static-ghost/ 5eba4dec17253e7fe6dd56d4 Wed, 13 May 2020 22:00:27 +0300 When I switched over to Ghost, my new theme didn't have a search function. Due to a combination of popular demand (read: one friend who asked about it) and me running across a plugin to enable searches for Ghost, I decided to do it today.

It was actually surprisingly simple, taking less than 2 hours, again, with most of the time being spent on CSS.

I used the SearchinGhost plugin, that does the indexing and searching entirely in the browser. It pretty much works out of the box, I just had to add an API key and I changed some of the templates a bit.

Normally, the plugin connects to the Ghost API to retrieve all the posts, but it does so via a GET request, so if I save a file in the right folder hierarchy, containing the JSON response, I can get the plugin to work on my statically served site.

The posts are loaded when clicking in the search bar. It takes a bit, because I have written 1.9Mb of content by now. But after it's sent over the network, searching is blazing fast.

Happy searching!

I’m publishing this as part of 100 Days To Offload - Day 8.

]]>
<![CDATA[ Productivity Tips: Pomodoros ]]> https://rolisz.ro/2020/05/12/productivity-tips-pomodoros/ 5ebaf1e317253e7fe6dd56f2 Tue, 12 May 2020 22:20:29 +0300 I sometimes have periods when I find it very hard to concentrate. When I was working in an office, this was made worse if there were many conversations going on around me. With the popularity of open offices, even if two people are talking 20 meters away, you can still hear them pretty well. But sometimes it happens even now that I am working from home, where I have my own office, with minimal distractions.

Sometimes the lack of focus happens because the task at hand is particularly boring or frustrating, with long wait times, such as configuring a build pipeline in Azure Pipelines, where you have to make a separate commit to test anything out and the documentation is not always clear. Sometimes the task I'm working on is unclear and I'm not sure how to proceed or I don't know how to solve the problem. At other times, my head is just so full of ideas that I constantly just from one thing to another and I can't get anything done. And of course, sometimes I'm just too tempted by Hacker News/Reddit/Facebook/Twitter/RSS feeds.

One thing that I have found to help in cases like these is to use a timer for 20 minutes of work and then a 5 minute break. There's plenty of apps for all platforms to help you schedule these timers. The rule is simple: during the work time, I either do something directly related to my task (usually coding) or I stare blankly out the window. When the timer rings, I can browse whatever I want, chat, watch YouTube videos. But during work time, I either do some useful, or I do nothing.

This part of doing nothing "guilt free" is important. If I do spend a full 20 minutes doing nothing, it means something is wrong with the task I'm working on and maybe I need to redefine it or get some clarifications. Or maybe it's something that I really don't think should be done. But most of the time I don't spend all the time looking into space and I get something done and this makes it easy to get into a more productive mood and get working. Sometimes after I do 2-3 rounds of these Pomodoro's[1], I even stop the timer, because I don't need it anymore.

What are your favorite productivity tips?

I’m publishing this as part of 100 Days To Offload - Day 7.


  1. The name comes from Italian, where it means tomato, because the guy who invented it used a kitchen timer shaped like a tomato. ↩︎

]]>
<![CDATA[ On text editors ]]> https://rolisz.ro/2020/05/08/on-text-editors/ 5eb50bbf81fef554fabb42bc Fri, 08 May 2020 22:33:35 +0300 Or why I switched away from Vim.

Getting into Vim

At my first internship, my mentor scolded me for using RubyMine to edit a Ruby on Rails project, claiming it was too enterprisey and that Real Programmers™️️ use Vim and live in the terminal, but the lesson stopped there. I slowly tried to use Linux more and more, but as can be seen on this blog, my first several attempts didn't go well. But I did learn some things and I did learn some basic things about Vim (at least I knew how to exit it 🙃).

Then at my first job, my team lead, Sever, knew Vim really well and pretty much lived only in the terminal. Because I pair programmed a lot with him, I learned a lot. He took the time to explain to me some useful settings and plugins. He even showed me Gundo (to visualize the undo trees), but I have to admit I never used it since.

Under Sever's tutoring, I got hooked on Vim. I started customizing it, adding plugins to it. I memorized a lot of commands and I became pretty proficient. I read blog posts about it and looked up to Tim Pope, who had written lots of great plugins for Vim.  I even started evangelizing it to other colleagues, saying how awesome it is, how great the action composition model is and so on. I had YouCompleteMe for autocompletion, vim.vinegar for file browsing, and about 20 other plugins. My .vimrc file eventually grew to over 250 lines.

Then I went to work at Google. At first I was working in Python and later in Go, and I kept using Vim. Because there were some Google-specific vim plugins, I had to split up my configuration, to have some parts which are common and some parts which are home/work specific only. This meant that keeping things was more difficult.

Photo by Gigi / Unsplash

Vim falls into disuse

But then I switched to another team and I started coding only in Java. I discovered that Java can be pretty nice, if the code is not overengineered like I learned in college. But still, Java has a lot of boilerplate code, so using an IDE is a must. At least back then (4 years ago), there were many features in IntelliJ that you couldn't replicate in Vim, such as accurate and always up to date "click to go to definition" or semantic renaming of variables.

During this time my .vimrc didn't get much love. I would still use Vim whenever I would code something at home, but that was not too often. And every now and then, something would break. Usually YCM, because the gcc/clang would get updated and YCM would fall behind. Some things I could fix, some I couldn't figure out in less than an hour, after which I would give up.

After I had left Google, I started working in Python again, but this time I went pretty quickly for PyCharm, which offers almost exactly the same features as IntelliJ, but for Python. Very similar UI, the same keyboard shortcuts, the same workflow.

But I often had to SSH into other machines, some of which are random production machines and I would have to edit some files there. Of course, I would reach for Vim. But I wouldn't have my configuration with me. So I would try to use some shortcuts and they wouldn't work. The one that still bites me to this very day is using - to open the file browser if in a file, or to go to the parent folder if in the file browser. I think this command is enabled by vim.vinegar. So I would become frustrated that I cannot edit using my regular workflow.

“You’re not creating if your not first enveloped with passion for your work”
Photo by Jason Strull / Unsplash

Vim today

Today the Vim ecosystem has improved a lot. There's Neovim which adds lots of fancy new features. There are new UI wrappers for Vim. There's a much more powerful plugin system which enable things you couldn't do before. If I really wanted to, I could probably replicate everything I need from PyCharm in Vim. But is that the best use of my time?

I don't think so. First of all, my experience lately has shown me that I often have to work with default Vim instances, so I would have to remember to sets of commands, one in my customized setup and one in default instances. Theoretically I could set up a script that would copy my .rc files to the remote computer every time I ssh, but I don't want to leave a mess behind me on every random cloud VM that I log in to.

Second of all, I don't think spending that much time on customizing Vim and memorizing keyboard shortcuts is going to make me that much more productive. I find that as a team lead, I don't spend as much time writing code. Luckily I don't spend too much time in meetings, but I do have almost every day at least 1:1 call with someone from my team to discuss how to proceed on some feature. I also spend a lot of time designing microservices and thinking about scaling and consistency issues. And when coding, I don't find that my ability to churn out code is a bottleneck, rather, but my ability to think about the interactions of the various microservices involved and how the data flow can go wrong in one way or the other. So I think investing in better tools for thought has a much better ROI for me.

Do I still like Vim? Of course. Do I still use it when I have to quickly open up some files in the terminal? Absolutely. Will I invest hours into learning more about it and customizing it more? Nope.

I’m publishing this as part of 100 Days To Offload - Day 4.

]]>
<![CDATA[ Splitting up my blog ]]> https://rolisz.ro/2020/05/07/splitting-up-my-blog/ 5eb4605481fef554fabb423e Thu, 07 May 2020 23:08:21 +0300

Considering that I plan to write one post every day for the next 97 days, this will mean a lot of spam for my readers. Most of my personal friends will be bored by my technical posts and most of the people who reached to my blog from Google/Hacker News/Reddit and subscribed will probably be uninterested by my hikes (well, when I used to hike), lockdown musings and board game reviews. I guess this is tolerable if I write only one or two posts a month, but with 30 a month, it can get pretty annoying.

Because of this I am "splitting" up the blog into two sections: one with technical posts and one with personal (or rather non-technical) posts. There will be a separate RSS feed for both categories[1]. And my MailChimp subscribers will have the option to select which posts they want to receive by email. Existing subscribers will have to click the "Update preferences" link at the bottom of any email to update... their preferences. If they don't do anything, they will still receive emails with all posts. New subscribers can choose when subscribing.

On the technical realization: going forward all technical posts will have a tech tag on them[2]. I created two new routes in YAML, one pointing to tech/rss and one to personal/rss. I created two almost identical RSS templates, the only difference being that in one I retrieved only posts which have the tech tag and in the other one only posts that don't have it.

Initially I thought about having two marker tags, tech and personal, but that would mean that if I forget to put one of them on a post, that post would not show up anywhere. This way, by default a post shows up in the personal RSS feed, so it's still shown to people, even if maybe to the wrong audience.

In MailChimp I created a new interest group, which contains two segments, each subscribed to one of the two RSS feeds. The most time on this project was spent fiddling with the subscription forms to align the checkboxes for selecting what emails to receive.

That's it folks. If anyone notices any bugs in this, either in the RSS feeds or in the emails, please leave a comment.

I’m publishing this as part of 100 Days To Offload - Day 3.


  1. Well, there's also a third RSS feed with all the posts, because I couldn't figure out in 10 minutes how to remove the default Ghost RSS feed. Oh well, if you use RSS, you can figure out which one you want to follow. ↩︎

  2. I wanted to use #tech, but for some reason Ghost didn't allow filtering based on private tags. ↩︎

]]>
<![CDATA[ Connecting to Azure Python package repositories ]]> https://rolisz.ro/2020/05/06/connecting-to-azure-pypi-repositories/ 5eb27c2781fef554fabb414d Wed, 06 May 2020 21:52:19 +0300 Lately I've been using the Azure DevOps platform for code repositories, issue tracking, wikis, CI/CD and for artifact hosting. It's a quite nice platform, the various tools are very well integrated with each other and they have nice integration with VS Code for example.

But I have one problem that keeps recurring: connecting to the private Python package repository, to which we push some of our internal libraries.

I want to have access to those packages locally, on my development machine, while I'm working on other projects that depend on those packages. The recommended way of authenticating is to use the artifacts-keyring package, which should initiate a web browser based authentication flow. For some reason, it doesn't work for me. It prompts for a username and a password in the CLI and then it says authentication failed. To be fair, it's still in preview and has been for quite some time now, so some bugs are to be expected.  

However, there is an alternate, older, authentication mechanism, but which has . For this you have to create a Personal Access Token, by clicking on the circled icon on any DevOps page and then going to the Personal Access Token page.  

On this page, click the New Token button. Fill out the fields and make sure you check one of the options under Packaging. If you only need read access (so you just want to download packages from the repository), then check only the Read box, otherwise, if you want to upload packages to Twine from the local machine, then check one of the other boxes.

Be careful, the PAT is shown only once and then you can't see it again in the UI, so make sure to copy it right away.

Now that you have the PAT, go to the Artefacts page in Azure DevOps and go to the Python repository you want to connect to. Click on the "Connect to feed" button and then on Pip and you'll find the URL to the custom repository.

Copy that thing and copy it into a pip.ini (on Windows) or pip.conf (on Linux) at the root of your virtual environment folder. These files don't exist unless you've customized something about your virtual environment, so most likely you will have to create them.

Now we have to modify that URL as follow:

extra-index-url=https://<feed_name>:<PAT>@pkgs.dev.azure.com/<organization>/_packaging/<feed>/pypi/simple/

Now time to test it out by trying to install something using pip. Pip should already showing that it's searching in the new index, instead of the public PyPI.

Keep in mind that these PATs have a limited duration. By default it expires after 30 days, but this can be extended up to 90 days from the dropdown form and 1 year from the date picker. Because I was lazy, I always picked the 90 days. After 90 days, when I would need a new version of one of our libraries, pip installs would start failing. Because three months had passed, I had usually forgotten how I set up everything, so I would have to go and figure it out again. I went through this 3 times already, the last time being today, so now I've decided to document it for myself somewhere I can find it easily.

I’m publishing this as part of 100 Days To Offload - Day 2.

]]>
<![CDATA[ Moving away from GMail ]]> https://rolisz.ro/2020/04/11/moving-away-from-gmail/ 5e8a359ee6cfb84549ab35bb Sat, 11 Apr 2020 19:59:33 +0300 tl;dr: I like PurelyMail and I will slowly switch all my online accounts to use it. I've been using it for almost three months and I didn't have any issues (except with some Hotmail/Outlook addresses).

I have a GMail account since 2006 (that's the date of the first email in my inbox), but I started actually using it more in 2009. I have around 120000 emails in total in my mailbox, about 17 Gb in total. I even worked as an SRE in the team responsible for the storage of GMail. So there's quite a bit of history between me and GMail, but I've started to move on.

It's been a long dream of mine to have an email address at my own domain. Technically, I had that ever since I started hosting my blog on a VPS from DigitalOcean, but I didn't set it up properly, so I never used it. I didn't want to use it only for forwarding the emails to GMail, I actually wanted to have send, receive and access emails from my own domain.

Why leave GMail?

First and foremost, because it's a single point of failure, which can fail due to many unrelated problems. Because my GMail address is my Google account basically, if for whatever reason I lose my GMail address, I lose all the things I have in my Google account: Google Drive, Google Photos, Play Store purchases, YouTube uploads and history, Google Calendar data and so on. While some of these can be backed up, such as Drive, Photos and YouTube, some cannot be, for example all the apps and movies I have purchased over the years.

But my GMail address can be lost for many reasons, many unrelated to GMail itself. I guess you can get the address suspended for spamming or doing other things, I'm not particularly worried about that. However, ToS violations of any kind, across other Google products, can lead to a ban on your account and implicitly, on your GMail address. There are many examples: reselling Pixel phones, writing too many emojis in YouTube chat, for publishing "repetitive content" in the Play Store. If you search on DuckDuckGo, I'm sure you can find many other examples.

Did the previous examples break the ToS? Some yes, some not. But the current ToS is 15 pages long and it changed 3 times in the last 4 years. And that's just a general ToS, Play Store has a separate ToS and YouTube as well. On top of this, they are quite vaguely formulated and many bans are unclear about the exact reason for which they have been instated.

The appeals process used to be completely opaque. Rumor has it that this has changed since they added the Google One service/platform/rebrand, but I haven't used the customer support from there myself, so I can't be sure. But even if now it works better and you can convince Google that your account was wrongfully suspended, it would be a couple of very stressful days.

Technically, I could use GMail with my own domain, but only by signing up for a business GSuite account. That comes with its own problems, such as not all features are available for business accounts and you can't move content that you've bought from the old, consumer address, to the new, business one.

Another reason because of which I'm leaving GMail is because it's kinda slow. A full load with an empty cache takes about 20 seconds on my desktop (good specs, wired Internet connection). Loading a folder with 184 emails takes 2.5 seconds. I miss computers being snappy.

My requirements for an email provider

  1. Custom domain.
  2. Decent amount of storage. I accrued 17 Gb in more than 10 years of usage. So even 5  Gb would do the trick for 1-2 years.
  3. Server side based full text search. I want to be able to search for random emails, based on some word from the body of the email  that I vaguely remember.
  4. Ability to use multiple aliases. I want to have site_name@rolisz.ro, besides the main address I will give out, but still have everything come in to my main inbox.
  5. Ability to use 3rd party apps to read my email. This means I want to read my email with any standards (mostly IMAP) compliant app.

Privacy stuff is not a part of my requirements. It's a nice feature to have, but it's not a blocking criterion. From the state mass surveillance perspective, I'm not going to actively implement many of the opsec measures because they are inconvenient. From the advertisement surveillance perspective, 90% of the emails I receive are automated, so there is a marketer on the other side who is probably selling my preferences anyway. If I want to keep something really private, I don't send it through email.

I have started looking at many email providers since last year, but not many of them fit the bill. And those that meet all the requirements are at least 50$/year. I signed up for the trial version of several, but none of them stuck.

Many providers boast that they are located in country X with strong privacy laws, but that is marketing only. Even Switzerland, which is famous for strong privacy laws, gave additional power to their intelligence agencies so that they could monitor the population more closely. The German intelligence agency, BND, cooperated with NSA: BND would spy on Americans, NSA would spy on Germans and they would exchange intel. So there is not much advantage in hosting your email in a Swiss underground bunker.

Some of the more notable email providers:

  • Tutanota
  • Protonmail
  • Fastmail
  • Mailbox

The first two are very security and privacy oriented. If that's your main concern, go for it. But it comes with some inconveniences: no server side search and you can't use 3rd party apps to read your email, just the apps they provide. They have some workarounds, but they just move the problem onto your machine (they offer a bridge that you install locally and which translates their custom protocol to IMAP and does indexing).

Fastmail was my top choice, from a usability perspective and I read many good reviews of it. It ticks all the boxes, at a cost of 50$ per year. I evaluated it a year ago, but because the construction of my house started, I put the email migration project on hold.

PurelyMail

Good thing I did that, because this year, in January, I read on Jan-Lukas Else's blog about PurelyMail. It immediately stood out to me because it's cheap: around 10$/year, so 5 times cheaper than Fastmail. At such a low price, I signed up for it right away.

PurelyMail is a one man show and it offers purely email, nothing else. It meets all my requirements and then some. Email data is encrypted on disk, but it also offers full text search, with the caveat that some information can be extracted from the search index, which seems like a very reasonable trade off to me. Scott Johnson, the owner, is very friendly and answered the support channel very quickly when I had some questions.

Besides the low price, PurelyMail has one more awesome thing: you pay for storage, not for users. So if you have a custom domain, you can create as many users as you want and you won't pay extra. All you will pay for is the total storage (and number of emails sent/received). All the other providers I saw charged extra for different addresses. They even had limitations on the number of aliases you could create for one user. But with PurelyMail, I can create an account for my wife, an account to use for sending automated mails from my servers (for monitoring) and accounts for many other things and not pay anything extra. ❤️

Setting up PurelyMail with my own domain was very easy. They have one page which shows all the information needed to set up MX, SPF, DKIM and DMARC records and validation happened almost instantly. All this is needed so that other email providers know that the emails really came from someone somehow related to the domain and it's not just spammers spoofing your domain. I tested this by sending out emails to several other addresses I own and they all got the mail, except for my Hotmail address. There, I first had to send an email from the Hotmail account to my own domain email address and then replies were received by the Hotmail account too.

PurelyMail default web interface

The web interface is based on RoundCube with a custom skin. UX wise it's decent and load times are faster than GMail :) One thing that I miss is the fact that in GMail folders are actually labels and you can have multiple labels for an email. However, I believe this is more of a limitation of the IMAP protocol, so it's not purely PurelyMail's fault.

The situation on Android is a bit worse. The recommended FOSS email client is FairEmail, whose UX sucks. Sometimes I still can't figure out how to check the details for an email and finding the forward button takes too long. So I cheat and have both FairEmail and the GMail app connected to my PurelyMail account. If I can't do something in FairEmail, I open the account in GMail.

Reliability wise, during the last three months I noticed only one outage (which happened while I was writing this post :) ), during which load times increased 10x. Otherwise the service has been stable.

Incremental switching

In the 11 years that GMail has been my main email account I have signed up for many online services with it. Logging in to all of them and changing the email address would take a long time. So I am taking an incremental approach: when I get a new email on my old account, I try to switch it to the new account.

The advantage of this method is that I don't have to spend a lot of time at once updating accounts and I might actually never waste time trying to change websites that don't send me emails.

During this process, I also review many of my accounts and I've decided to delete some of them and I've unsubscribed from many newsletters that were irrelevant. But I also discovered some oddities about some sites are: for example, Withings has one email address for the user account and another one for the user profile. I updated the first one, but then I still received weekly updates on my old email address. Similarly, The Atlantic confirmed my change of email address for my account, but I still receive their newsletter on the old address, because that's a separate thing.

Conclusion

I will still be checking out my GMail account for a while, but I'm glad that I've successfully convinced my main personal contact to send me emails to my domain: dad. The rest will follow soon.

]]>
<![CDATA[ Web crawler in Rust ]]> https://rolisz.ro/2020/03/01/web-crawler-in-rust/ 5e51262ae6cfb84549ab3291 Sun, 01 Mar 2020 19:56:35 +0300 I have heard many good things about Rust for several years now. A couple of months ago, I finally decided to start learning Rust. I skimmed through the Book and did the exercises from rustlings. While they helped me get started, I learn best by doing some projects. So I decided to replace the crawler that I used for my Ghost blog, which had been written in bash with wget, with something written in Rust.

And I was pleasantly surprised. I am by no means very knowledgeable in Rust, I still have to look up most of the operations on the Option and Result types, I have to DuckDuckGo how to make HTTP requests, read and write files and so on, but I was still able to write a minimal crawler in about 2-3 hours and then in about 10 hours of total work  I had something that was both faster and had fewer bugs than the wget script.

So let's start writing a simple crawler that downloads all the HTML pages from a blog.

Initializing a Rust project

After installing Rust, let's create a project somewhere:

 > cargo new rust_crawler

This initializes a Hello World program, which we can verify that it runs using:

> cargo run
   Compiling rust_crawler v0.1.0 (D:\Programming\rust_crawler)
    Finished dev [unoptimized + debuginfo] target(s) in 9.31s
     Running `target\debug\rust_crawler.exe`
Hello, world!

Making HTTP requests

Let's make our first HTTP request. For this, we will use the reqwest library. It has both blocking and asynchronous APIs for making HTTP calls. We'll start off with the blocking API, because it's easier.

use std::io::Read;

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
    println!("HTML: {}", &body[0..40]);
}
> cargo run
   Compiling rust_crawler v0.1.0 (D:\Programming\rust_crawler)
    Finished dev [unoptimized + debuginfo] target(s) in 2.30s
     Running `target\debug\rust_crawler.exe`
Status: 200 OK https://rolisz.ro/
HTML <!DOCTYPE html>
<html lang="en">
<head>

We create a new reqwest blocking client, create a GET request and we send it. The send call normally returns a Result, which we just unwrap for now. We print out the status code, to make sure the request returned ok and then we copy the content of the request into a mutable variable and we print it out. So far so good.

Now let's parse the HTML and extract all the links we find. For this we will use the select crate, which can parse HTML and allows us to search through the nodes.

use std::io::Read;
use select::document::Document;
use select::predicate::Name;

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
   
    Document::from(body.as_str())
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .for_each(|x| println!("{}", x));
}
> cargo run --color=always --package rust_crawler --bin rust_crawler
   Compiling rust_crawler v0.1.0 (D:\Programming\rust_crawler)
    Finished dev [unoptimized + debuginfo] target(s) in 2.65s
     Running `target\debug\rust_crawler.exe`
Status for https://rolisz.ro/: 200 OK
https://rolisz.ro
https://rolisz.ro
https://rolisz.ro/projects/
https://rolisz.ro/about-me/
https://rolisz.ro/uses/
https://rolisz.ro/tag/trips/
https://rolisz.ro/tag/reviews/
#subscribe
/2020/02/13/lost-in-space/
/2020/02/13/lost-in-space/
/author/rolisz/
/author/rolisz/
...
/2020/02/07/interview-about-wfh/
/2020/02/07/interview-about-wfh/
/2019/01/30/nas-outage-1/
/2019/01/30/nas-outage-1/
/author/rolisz/
/author/rolisz/
https://rolisz.ro
https://rolisz.ro
https://www.facebook.com/rolisz
https://twitter.com/rolisz
https://ghost.org
javascript:;
#

We search for all the anchor tags, filter only those that have a valid href attribute and we print the value of those attributes.

We see all the links in the output, but there are some issues. First, some of the links are absolute, some are relative, and some are pseudo-links used for doing Javascript things. Second, the links that point towards posts are duplicated and third, there are links that don't point towards something on my blog.

The duplicate problem is easy to fix: we put everything into a HashSet and then we'll get only a unique collection of URLs.

use std::io::Read;
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();

    let found_urls = Document::from(body.as_str())
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .map(str::to_string)
        .collect::<HashSet<String>>();
    println!("URLs: {:#?}", found_urls)
}

First we have to convert the URLs from str type to String, so we get objects that have a separate lifetime from the original string which contains the whole HTML. Then we insert all the strings into a hash set, using the collect function from Rust, which handles insertion into all kinds of containers, in all kinds of situations.

To solve the other two problems we have to parse the URLs, using methods provided by reqwest.

use std::io::Read;
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;
use reqwest::Url;

fn get_links_from_html(html: &str) -> HashSet<String> {
    Document::from(html.as_str())
        .find(Name("a").or(Name("link")))
        .filter_map(|n| n.attr("href"))
        .filter_map(normalize_url)
        .collect::<HashSet<String>>()
}

fn normalize_url(url: &str) -> Option<String> {
    let new_url = Url::parse(url);
    match new_url {
        Ok(new_url) => {
            if new_url.has_host() && new_url.host_str().unwrap() == "ghost.rolisz.ro" {
                Some(url.to_string())
            } else {
                None
            }
        },
        Err(_e) => {
            // Relative urls are not parsed by Reqwest
            if url.starts_with('/') {
                Some(format!("https://rolisz.ro{}", url))
            } else {
                None
            }
        }
    }
}

fn main() {
    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";
    let mut res = client.get(origin_url).send().unwrap();
    println!("Status for {}: {}", origin_url, res.status());

    let mut body = String::new();
    res.read_to_string(&mut body).unwrap();

    let found_urls = get_links_from_html(&body);
    println!("URLs: {:#?}", found_urls)
}

We moved all the logic to a function get_links_from_html. We apply another filter_map to the links we find, in which we check if we can parse the URL. If we can, we check if there is a host and if it's equal to my blog. Otherwise, if we can't parse, we check if it starts with a /, in which case it's a relative URL. All other cases lead to rejection of the URL.

Now it's time to start going over these links that we get so that we crawl the whole blog. We'll do a breadth first traversal and we'll have to keep track of the visited URLs.

fn fetch_url(client: &reqwest::blocking::Client, url: &str) -> String {
    let mut res = client.get(url).send().unwrap();
    println!("Status for {}: {}", url, res.status());

    let mut body  = String::new();
    res.read_to_string(&mut body).unwrap();
    body
}

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body = fetch_url(&client, origin_url);

    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
    	.difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while !new_urls.is_empty() {
        let mut found_urls: HashSet<String> = new_urls.iter().map(|url| {
            let body = fetch_url(&client, url);
            let links = get_links_from_html(&body);
            println!("Visited: {} found {} links", url, links.len());
            links
        }).fold(HashSet::new(), |mut acc, x| {
                acc.extend(x);
                acc
        })
        visited.extend(new_urls);
        
        new_urls = found_urls
        	.difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());

}

First, we moved the code to fetch a URL to its own function, because we will be using it in two places.

Then the idea is that we have a HashSet containing all the pages we have visited so far. When we visit a new page, we find all the links in that page and we subtract from them all the links that we have previously visited. These will be new links that we will have to visit. We repeat this as long as we have new links to visit.

So we run this and we get the following output:

Status for https://rolisz.ro/favicon.ico: 200 OK
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: InvalidData, error: "stream did not contain valid UTF-8" }', src\libcore\result.rs:1165:5
stack backtrace:
   0: core::fmt::write
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14\/src\libcore\fmt\mod.rs:1028
   1: std::io::Write::write_fmt<std::sys::windows::stdio::Stderr>
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14\/src\libstd\io\mod.rs:1412
   2: std::sys_common::backtrace::_print
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14\/src\libstd\sys_common\backtrace.rs:65
   3: std::sys_common::backtrace::print
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14\/src\libstd\sys_common\backtrace.rs:50
...

The problem is that our crawler tries to download, as text, pictures and other binaries. The Rust String has to be valid UTF-8, so when it tries to put there all kinds of bytes, we will have some that lead to invalid UTF-8 so we get a panic. We could solve this in two different ways: either download URLs as bytes and then convert to strings only those that we know are HTML, or we can skip the ones that are not HTML. Because I am interested in only the textual content of my blog, I will implement the latter solution.

fn has_extension(url: &&str) -> bool {
    Path::new(url).extension().is_none()
}

fn get_links_from_html(html: &str) -> HashSet<String> {
    Document::from(html.as_str())
        .find(Name("a").or(Name("link")))
        .filter_map(|n| n.attr("href"))
        .filter(has_extension)
        .filter_map(normalize_url)
        .collect::<HashSet<String>>()
}

To determine if it's an HTML, we look if there is an extension or not and we that as a filter to our function which retrieves link from the HTML.

Writing the HTML to disk

We are now getting all the HTML we want, time to start writing it to disk.

fn write_file(path: &str, content: &str) {
    fs::create_dir_all(format!("static{}", path)).unwrap();
    fs::write(format!("static{}/index.html", path), content);
}

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body= fetch_url(&client, origin_url);

    write_file("", &body);
    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
    	.difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while new_urls.len() > 0 {
        let mut found_urls: HashSet<String> = new_urls
        	.iter()
            .map(|url| {
                let body = fetch_url(&client, url);
                write_file(&url[origin_url.len() - 1..], &body);
                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                links
        })
        .fold(HashSet::new(), |mut acc, x| {
                acc.extend(x);
                acc
        })
        visited.extend(new_urls);
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());

}

We use the create_dir_all function, which works like mkdir -p in Linux to create the nested folder structure. We write the HTML page to the index.html file in the same folder structure as the URL structure. Most web servers will then serve the index.html file when going to the URL, so the output in the browser will be the same as the one from Ghost serving dynamic pages.

Speeding it up

Letting this run on my blog takes about 110 seconds. Let's see if we can speed it up by downloading the pages in parallel.

fn main() {
    let now = Instant::now();

    let client = reqwest::blocking::Client::new();
    let origin_url = "https://rolisz.ro/";

    let body = fetch_url(&client, origin_url);

    write_file("", &body);
    let mut visited = HashSet::new();
    visited.insert(origin_url.to_string());
    let found_urls = get_links_from_html(&body);
    let mut new_urls = found_urls
        .difference(&visited)
        .map(|x| x.to_string())
        .collect::<HashSet<String>>();

    while !new_urls.is_empty() {
        let found_urls: HashSet<String> = new_urls
            .par_iter()
            .map(|url| {
                let body = fetch_url(&client, url);
                write_file(&url[origin_url.len() - 1..], &body);

                let links = get_links_from_html(&body);
                println!("Visited: {} found {} links", url, links.len());
                links
            })
            .reduce(HashSet::new, |mut acc, x| {
                acc.extend(x);
                acc
            });
        visited.extend(new_urls);
        new_urls = found_urls
            .difference(&visited)
            .map(|x| x.to_string())
            .collect::<HashSet<String>>();
        println!("New urls: {}", new_urls.len())
    }
    println!("URLs: {:#?}", found_urls);
    println!("{}", now.elapsed().as_secs());
}

In Rust there is this awesome library called Rayon which provides a very simple primitive for running functions in parallel: par_iter, which is short for parallel iterator. It's an almost drop-in replacement for iter, which is part of the standard library for collections, and it runs the provided closure in parallel, taking care of boring stuff like thread scheduling. Besides changing iter to par_iter, we have to change the fold to reduce and provide a closure that returns the "zero" element, so it can generate multiple of them.

This reduces the running time to 70 seconds, down from 110 seconds.

Proper error handling

One more thing to fix in our program: error handling. Rust helps us a lot with error handling with it's builtin Option and Result types, but so far we've been ignoring them, liberally sprinkling unwrap everywhere. unwrap returns the inner type or panics if there is an error (for Result) or None value (for Option). To handle these correctly, we should create our own error type.

One appearance of unwrap that we can get rid of easily is in the normalize_url function. In the if we have new_url.has_host() && new_url.host_str().unwrap() == "ghost.rolisz.ro" This can't possibly panic, because we do a check first that the host string exists, but there is a nicer Rust way to express this:

if let Some("ghost.rolisz.ro") = new_url.host_str() {
	Some(url.to_string())
}

To my Rust newbie eyes, it looks really weird at a first glance, but it does make sense eventually.

For the other cases we need to define our own Error type, which will be a wrapper around the other types, providing a uniform interface to all of them:

#[derive(Debug)]
enum Error {
    Write { url: String, e: IoErr },
    Fetch { url: String, e: reqwest::Error },
}

type Result<T> = std::result::Result<T, Error>;

impl<S: AsRef<str>> From<(S, IoErr)> for Error {
    fn from((url, e): (S, IoErr)) -> Self {
        Error::Write {
            url: url.as_ref().to_string(),
            e,
        }
    }
}

impl<S: AsRef<str>> From<(S, reqwest::Error)> for Error {
    fn from((url, e): (S, reqwest::Error)) -> Self {
        Error::Fetch {
            url: url.as_ref().to_string(),
            e,
        }
    }
}

We have two kinds of errors in our crawler: IoErr and reqwest::Error. The first is returned when trying to write a file, the second when we try to fetch a URL. Besides the original error, we add some context, such as the URL or path that was accessed when we got the error. We provide implementation to convert from each library error to our own error type and we also define a Result helper type so that we don't always have to type out our Error type.

fn fetch_url(client: &reqwest::blocking::Client, url: &str) -> Result<String> {
    let mut res = client.get(url).send().map_err(|e| (url, e))?;
    println!("Status for {}: {}", url, res.status());

    let mut body = String::new();
    res.read_to_string(&mut body).map_err(|e| (url, e))?;
    Ok(body)
}

fn write_file(path: &str, content: &str) -> Result<()> {
    let dir = format!("static{}", path);
    fs::create_dir_all(format!("static{}", path)).map_err(|e| (&dir, e))?;
    let index = format!("static{}/index.html", path);
    fs::write(&index, content).map_err(|e| (&index, e))?;

    Ok(())
}

Our two functions that can produce errors now return a Result type. All the operations that can return an error have a map_err applied to the result, and we generate our own Error from the existing error.

let (found_urls, errors): (Vec<Result<HashSet<String>>>, Vec<_>) = new_urls
      .par_iter()
      .map(|url| -> Result<HashSet<String>> {
            let body = fetch_url(&client, url)?;
            write_file(&url[origin_url.len() - 1..], &body)?;

            let links = get_links_from_html(&body);
            println!("Visited: {} found {} links", url, links.len());
            Ok(links)
       })
       .partition(Result::is_ok);

Our main loop to download new URLs changes a bit. Our closure now returns either a set of URLs or an error. To separate the two kinds of results, we partition the iterator based on Result::is_ok, resulting in the vectors, one with HashSets, one with Errors, but both still wrapped in Results.

visited.extend(new_urls);
new_urls = found_urls
    .into_par_iter()
    .map(Result::unwrap)
    .reduce(HashSet::new, |mut acc, x| {
        acc.extend(x);
        acc
    })
    .difference(&visited)
    .map(|x| x.to_string())
    .collect::<HashSet<String>>();
println!("New urls: {}", new_urls.len());

We handle each vector separately. For the success one we have to unwrap and the merge all the HashSets into one.

println!(
   "Errors: {:#?}",
    errors
        .into_iter()
        .map(Result::unwrap_err)
        .collect::<Vec<Error>>()
)

For the Vec containing the Errors, we have to unwrap the errors and then we just  print them out.

And with that we have a small and simple web crawler, which runs fairly fast and which handles most (all?) errors correctly. The final version of the code can be found here.

Special thanks to Cedric Hutchings and lights0123 who reviewed my code on Code Review.

]]>
<![CDATA[ Blogs are best served static ]]> https://rolisz.ro/2020/01/21/blogs-are-best-served-static/ 5e2765e5b362a165945d5440 Wed, 22 Jan 2020 00:26:05 +0300 Or there and back again.

Earlier this month I moved to Ghost. I did that because I wanted to have a nice editor, I wanted to be able to write easily from anywhere and I wanted to get spend less time in the terminal. But I knew moving to a dynamic site would have performance penalties.

When looking at the loading time of a single page in the browser's inspector tools, everything seemed fine: load times around 1.3s, seemingly even faster than my old site. But then, I did some load tests using loader.io to see how my new blog performs. The results were not pretty: my blog fell over with 2-3 concurrent requests sustained for 10 seconds. Ghost couldn't spawn more threads, segfaulted and then restarted after 20 seconds.

I am using the smallest DigitalOcean instance, with 1 virtual CPU and 1 GB of RAM. I temporarily resized the instance to have 2 vCPU and then 3 vCPUs, but the results were still pretty poor: even the 3 vCPU instance couldn't handle more then 10 connections per second for 10 seconds.

While this would not be a problem with my current audience (around 50-100 page views per day), I have great dreams of my blog making it to the front page of Hacker News and getting ten thousand views in one day. And I would rather not have my site fall over in such cases.

I knew my static blog could easily sustain 1000 simultaneous connections, so I went back and combined the two, to get the best of both worlds: a nice frontend to write posts and preview them, with the speed of a static site.

I looked a bit into using some tools like Gatsby or Eleventy to generate the static site, but they were quite complicated and required maintaining my theme in yet another place. But I found a much simpler solution: wget. Basically I wrote a crawler for my own website, dumped everything to HTML and reuploaded it to my website.

In order to do this, I set up nginx to proxy a subdomain to the Ghost blog. Initially I wanted to set it up as a "folder" under my domain, but Ghost Admin doesn't play nice with folders. I won't link to it here, both because I don't want it widely available and for another reason I'll explain later.

Then I used the following bash script:

#!/bin/bash

# Define urls and https
from_url=https://subdomain.rolisz.ro
to_url=https://rolisz.ro

# Copy blog content
wget --recursive --no-host-directories --directory-prefix=static --timestamping --reject=jpg,png,jpeg,JPG,JPEG --adjust-extension --timeout=30  ${from_url}/

# Copy 404 page
wget --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent --content-on-error --timestamping ${from_url}/404.html

# Copy sitemaps
wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap.xsl
wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap.xml
wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap-pages.xml
wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap-posts.xml
wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap-authors.xml
wget --recursive --no-host-directories --directory-prefix=static --adjust-extension --timeout=30 --no-parent ${from_url}/sitemap-tags.xml

# Replace subdomain with real domain
LC_ALL=C find ./static -type f -not -wholename *.git* -exec sed -i -e "s,${from_url},${to_url},g" {} +;

I start by crawling the front page of the blog. I exclude images, because they will be on the same server, so if I upload them with Ghost, I can then have nginx serve them from where Ghost uploads them, with the correct URL. The adjust-extension is needed to automatically create .html files, instead of leaving extensionless files around.

Then I crawl the 404 page and the sitemap pages, which are all well defined.

Wget has a convert-links option, but it's not too smart. It fails badly on image srcsets, screwing up extensions. Because of this I didn't use it, instead opting to use good ol' sed to replace all appearances of the subdomain URL to the normal URL. And because I don't want to add a special exception to this post, I can't include my actual subdomain in the text, or it would get converted too.

For now, I run this locally, but I will set up a script that is called on a cronjob or something.

After downloading all the HTML, I upload the content to my blog with rsync:

rsync -pruv static rolisz.ro:/var/www/static/

Downloading takes about 2 minutes, uploading about 5 seconds.

After doing all this, I redid the loader.io tests.

Left: latency, right: number of clients

The static site managed to serve up to 1800 clients per second, averaging 1000/s for a minute. That's good enough for now!

Longer term, I plan to rewrite the crawler, to be smarter/faster. For example, wget doesn't have any parallelism built in. Old posts don't change often, so they could be skipped most of the time. But doing that correctly takes more time, so for now, I'll stick to this simple solution!

]]>
<![CDATA[ Moving from Acrylamid to Ghost ]]> https://rolisz.ro/2020/01/18/moving-from-acrylamid-to-ghost/ 5e1d798d4bdee83266544133 Sat, 18 Jan 2020 23:09:48 +0300 When I decided to move to Ghost at the beginning of the month, I realized that I needed to act quickly, because I've kept postponing this for years, so either I do it during the winter holidays, or it gets put off for who knows how long. So I set up the Ghost instance on DigitalOcean. That was a simple process. I also moved manually the last 20 posts and the 10 most viewed posts, so that there would be some content here and then I switched the DNS for rolisz.ro to point to the Ghost instance. Moving the posts manually took me about two days.

But I have started blogging 10 years ago. In the meantime, I have written over 400 posts. Some of them have not aged well and I deleted them. Some were pointing to YouTube videos that no longer exist, some are references to my exams from university and some are posts that I am simply too ashamed that I ever wrote. But that still leaves me with about 350 posts which I wanted to keep.

I didn't want to move another 330 posts by hand, so I wrote a tool to export my data from Acrylamid into JSON and then to import them into Ghost.

Ghost uses MobileDoc to store post content. The recommended way of importing posts from external sources is to use the Ghost Admin API to import HTML and then Ghost will do a best effort conversion to MobileDoc. Unfortunately, they say it's a lossy conversion, so some things might not look the same when Ghost renders an HTML from the MobileDoc.

My posts where in Markdown format. The easiest way to hack an exporter together was to piggyback on top of Acrylamid, by modifying the view that generated the search JSON. That view already exported a JSON, but it was stripped of HTML and it didn't contain some metadata, such as URL. I removed the HTML stripping, enabled all filters, added the needed metadata.  Because I had a custom picture gallery filter, I had to modify it to add <!--kg-card-begin: html--> before the gallery code and <!--kg-card-end: html--> after it. These two comments indicate to the Ghost importer that it should put what's between them in an HTML card.

The importer uses the recommended Admin API for creating the posts. To use the Admin API, you have to create a new custom integration and get the admin API key from there. To upload HTML formatted posts, you have to append ?source=html to the post creation endpoint.

# Split the key into ID and SECRET
id, secret = ADMIN_KEY.split(':')

def write_post(title, post_date, tags, content=None):
    # Prepare header and payload
    iat = int(date.now().timestamp())

    header = {'alg': 'HS256', 'typ': 'JWT', 'kid': id}
    payload = {
        'iat': iat,
        'exp': iat + 5 * 60,
        'aud': '/v3/admin/'
    }

    # Create the token (including decoding secret)
    token = jwt.encode(payload, bytes.fromhex(secret), algorithm='HS256', headers=header)

    # Make an authenticated request to create a post
    url = 'https://rolisz.ro/ghost/api/v3/admin/posts/?source=html'
    headers = {'Authorization': 'Ghost {}'.format(token.decode())}
    body = {'posts': [{'title': title, 'tags': tags, 'published_at': post_date, 'html': content}]}
    r = requests.post(url, json=body, headers=headers)

    return r
Python function to upload a new post to Ghost

Because I had already manually moved some posts (and because I ran the importer script on a subset of all the posts first), I needed to check whether a post already existed, before inserting it, otherwise Ghost would create a duplicate entry for me. To do this, I used the fact that Ghost would create the same slug from titles as did Acrylamid. This actually failed for about 5 posts (for examples one which had apostrophes or accented letters in the title), but I cleaned those up manually.

posts = json.load(open("posts.json"))

for f in search:
    key = "https://rolisz.ro"+f['url']
    resp = requests.get("https://rolisz.ro"+f['url'])
    sleep(0.5)
    d = datetime.datetime.strptime(f["date"], "%Y-%m-%dT%H:%M:%S%z")
    if resp.status_code != 200:
        if "/static/images/" in f['content']:
            f['content'] = f['content'].replace("/static/images/", "/content/images/")
        write_post(f['title'], d.isoformat(timespec='milliseconds'),
                   f['tags'], f['content'])
        sleep(1)
Code to prepare posts for upload

Ghost also expected the post publish date to have timezone information, which my exporter didn't add, so I had to do a small conversion here. I also corrected the paths of the images. Previously they were in a folder called static, while Ghost stores them in content.

Because my Ghost blog is hosted on a 5$ DigitalOcean instance (referral link), it couldn't handle my Python script hammering it with several posts a second, so I had to add some sleeps, after checking the existence of posts and after uploading them.

After uploading all posts like this, I still had to do some manual changes. For example, Ghost has the concept of featured image and I wanted to use it. In general I want my posts going forward to have at least one image, even if it's a random one from Unsplash. In some cases, I could use an existing image from a post as a featured image, in other cases I had to find a new one. Also, code blocks weren't migrated smoothly through the MobileDoc converter, so most of them needed some adjustment.

Going through all my old posts took me a couple of days (much less though than it would have took without the importer) and it was a fun nostalgia trip down what kind of things were on my mind 10 years ago. For example, back then I was very much into customizing my Windows, with all kinds of Visual Styles, desktop gadgets and tools to make you more "productive". I now use only one thing from that list: F.lux. Also, the reviews that I did of books, movies and TV shows were much more bland (at least I hope that I wrote in a more entertaining style).

]]>