<![CDATA[ rolisz's site - Technical Posts ]]> https://rolisz.ro https://rolisz.ro/favicon.png rolisz's site https://rolisz.ro Sat, 26 Sep 2020 22:02:18 +0300 60 <![CDATA[ Playing Codenames in Rust with word vectors ]]> https://rolisz.ro/2020/09/26/playing-codenames-in-rust-with-word-vectors/ 5f3adc474f71eb12e0abb8ca Sat, 26 Sep 2020 21:32:01 +0300 In a previous post I implemented the game of Codenames in Rust, allowing a human player to interact with the computer playing randomly. Now let's implement a smarter computer agent, using word vectors.

Word vectors (or word embeddings) are a way of converting words into a high dimensional vector of numbers. This means that each word will have a long list of numbers associated with it and those numbers aren't completely random. Words that are related usually have values closer to each other in the vector space as well. Getting those numbers from raw data takes a long time, but there are many pretrained embeddings on the internet you can just use and there are also libraries that help you find other words that are close to a target word.

Word vectors in Rust

Machine learning has embraced the Python programming language, so most ML tools, libraries and frameworks are in Python, but some are starting to show up in Rust as well. Rust's performance focus attracts people, because ML is usually computationally intensive.

There is one library in Rust that does exactly what we want: FinalFusion. It has a set of pretrained word vectors (quite fancy ones, with subword embeddings) and it has a library to load them and to make efficient nearest neighbor queries.

The pretrained embeddings come in a 5 GB file, because they pretty much literally have everything, including the kitchen sink, so the download will take a while. Let's start using the library (after adding it to our Cargo.toml file) to get the nearest neighboring words for "cat":

use std::io::BufReader;
use std::fs::File;
use finalfusion::prelude::*;
use finalfusion::similarity::WordSimilarity;

fn main() {
    let mut reader = BufReader::new(File::open("resources/ff.fifu").unwrap());

    // Read the embeddings.
    let embeddings: Embeddings<VocabWrap, StorageViewWrap> =
        Embeddings::read_embeddings(&mut reader)
            .unwrap();
    println!("{:?}", embeddings.word_similarity("cat", 5).unwrap());
}

After running it we get the following output:

[WordSimilarityResult { similarity: NotNan(0.81729543), word: "cats" },
WordSimilarityResult { similarity: NotNan(0.812261), word: "kitten" },
WordSimilarityResult { similarity: NotNan(0.7768222), word: "feline" },
WordSimilarityResult { similarity: NotNan(0.7760824), word: "kitty" },
WordSimilarityResult { similarity: NotNan(0.7667354), word: "dog" }]
Sidenote:

Loading a 5GB file from disk will take some time. If you have enough RAM, it should be in the OS's file cache after the first run, so it will load faster. Also, compiling this program with --release (turning on optimizations and removing debug information) will speed it up significantly.

One of the rules of Codenames is that hints can't be any of the words on the board or direct derivatives of them (such as plural forms). The finalfusion library has support for masking some words out, but to get plural forms I resorted to another library called inflector which has a method called to_plural, which does exactly what's written on the box.

use std::io::BufReader;
use std::fs::File;
use finalfusion::prelude::*;
use finalfusion::similarity::EmbeddingSimilarity;
use std::collections::HashSet;
use inflector::string::pluralize::to_plural;

fn main() {
    let mut reader = BufReader::new(File::open("resources/ff.fifu").unwrap());

    // Read the embeddings.
    let embeddings: Embeddings<VocabWrap, StorageViewWrap> =
        Embeddings::read_embeddings(&mut reader)
            .unwrap();
    let word = "cat";
    let embed = embeddings.embedding(word).unwrap();
    let mut skip: HashSet<&str> = HashSet::new();
    skip.insert(&word);
    let pluralized = to_plural(word);
    skip.insert(&pluralized);
    let words = embeddings.embedding_similarity_masked(embed.view(), 5, &skip).unwrap();
    println!("{:?}", words);
}

This is a slightly lower level interface, where we first have to obtain the embedding of the word, we build the set of words to skip and then we search for the most similar words to the vector that we give it. The output is:

[WordSimilarityResult { similarity: NotNan(0.812261), word: "kitten" },
WordSimilarityResult { similarity: NotNan(0.7768222), word: "feline" },
WordSimilarityResult { similarity: NotNan(0.7760824), word: "kitty" }, 
WordSimilarityResult { similarity: NotNan(0.7667354), word: "dog" }, 
WordSimilarityResult { similarity: NotNan(0.7471396), word: "kittens" }]

It's better. Ideally, we could also somehow remove all composite words based on words from the table, but that's a bit more complicated.

This can be wrapped in a function, because it's a common use case:

fn find_similar_words<'a>(word: &str, embeddings: &'a Embedding, limit: usize) -> Vec<WordSimilarityResult<'a>> {
    let embed = embeddings.embedding(&word).unwrap();
    let mut skip: HashSet<&str> = HashSet::new();
    skip.insert(&word);
    let pluralized = to_plural(&word);
    skip.insert(&pluralized);
    embeddings.embedding_similarity_masked(embed.view(), limit, &skip).unwrap()
}

Implementing the first spymaster

Let's implement our first spymaster which uses word vectors! First, let's define a type alias for the embedding type, because it's long and we'll use it many times.

type Embedding = Embeddings<VocabWrap, StorageViewWrap>;

Our spymaster will have two fields: the embeddings and the color of the player. The Spymaster trait requires only the give_hint function to be implemented.

pub struct BestWordVectorSpymaster<'a> {
    pub embeddings: &'a Embedding,
    pub color: Color,
}

impl Spymaster for BestWordVectorSpymaster<'_> {
    fn give_hint(&mut self, map: &Map) -> Hint {
        let enemy_color = opposite_player(self.color);
        let remaining_words = map.remaining_words_of_color(enemy_color);
        let mut best_sim = NotNan::new(-1f32).unwrap();
        let mut best_word = "";
        for word in remaining_words {
            let words = find_similar_words(&word, self.embeddings, 1);
            let hint = words.get(0).unwrap();
            if hint.similarity > best_sim {
                best_sim = hint.similarity;
                best_word = hint.word;
            }
        }
        return Hint{count: 1, word: best_word.to_string()};
    }
}

This spymaster uses a simple greedy algorithm. It takes each word that has to be guessed and find's the most similar word to it, while keeping track of the similarity. It returns as hint the word that had the highest similarity to any of the words that belong to the opposite team.

How does it do? I drew some random boards with a fixed seed and ran this spymaster on them. If you hover over the hint, it shows what are the words it's based on.

We have a problem: the word embeddings we use are a bit too noisy. The word embeddings are usually trained on large text corpora crawled from the internet, such as Wikipedia, the Common Crawl project or from CoNLL 2017 dataset (this is the one used above). The problem with these large corpuses is that they are not perfectly cleaned. For example "-pound" is considered a word. Let's try the CC embeddings:

Unfortunately, the CC embeddings give even worse results.

Cleaning up the embeddings

My fix for this was to write a script to prune down the embeddings to only "real" words (ones made of only letters). First, I had to get a set of all these words.

    let words = embeddings.vocab().words();
    let mut total = 0;
    let mut lowercase = 0;
    let mut select = HashSet::new();
    for w in words {
        total += 1;
        if w.chars().all(char::is_lowercase) {
            lowercase +=1;
            select.insert(w.clone());
        }
    }
    println!("{} {}", total,  lowercase);

Then I had to get the embeddings and the norms for each of these words:

    let mut selected_vocab = Vec::new();
    let mut selected_storage = Array2::zeros((select.len(), embeddings.dims()));
    let mut selected_norms = Array1::zeros((select.len(),));

    for (idx, word) in select.into_iter().enumerate() {
        match embeddings.embedding_with_norm(&word) {
            Some(embed_with_norm) => {
                selected_storage
                    .row_mut(idx)
                    .assign(&embed_with_norm.embedding);
                selected_norms[idx] = embed_with_norm.norm;
            }
            None => panic!("Cannot get embedding for: {}", word),
        }

        selected_vocab.push(word);
    }

And finally write the now much smaller embedding file:

    let new_embs = Embeddings::new(
        None,
        SimpleVocab::new(selected_vocab),
        NdArray::from(selected_storage),
        NdNorms::new(selected_norms),
    );
    let f = File::create("resources/smaller_embs.fifu").unwrap();
    let mut reader = BufWriter::new(f);
    new_embs.write_embeddings(&mut reader);

On the embeddings trained on the CoNLL dataset the reduction is about 6x: from 1336558 to 233453.

Let's give our Spymaster another shot with these embeddings, simply by changing the file from which we load the embeddings:

"superheroic" and "marched" look kinda against the rules, being too close to one of the words on the board, but "movie" is a really good one word hint.

Implementing a field operative

Now let's implement the other part of the AI team: the field operative which has to guess which words from the board belong to the enemy, based on the hints the spymaster gave.

pub struct SimpleWordVectorFieldOperative<'a> {
    embeddings: &'a Embedding,
}

impl FieldOperative for SimpleWordVectorFieldOperative<'_> {
    fn choose_words<'a>(&mut self, hint: &Hint, words: &[&'a str]) -> Vec<&'a str> {
        let hint_emb = self.embeddings.embedding(&hint.word).unwrap();
        let hint_embedding = hint_emb.view();
        let mut similarities = vec![];
        for w in words {
            let new_embed = self.embeddings.embedding(&w).unwrap();
            let similarity: f32 = new_embed.view().dot(&hint_embedding);
            similarities.push((w, similarity));
        }
        similarities.iter()
            .sorted_by(|(_, e), (_, e2)| e.partial_cmp(e2).unwrap())
            .rev().take(hint.count).map(|x| *x.0).collect()
    }
}

The field operative is even simpler: we go through all the words that are still on the board and get a similarity score between them and the hint. Sort the words by the similarity and take the top "count" ones.

Let's see how it does on the same maps as before. If you hover over the field operative, you can see the guesses it makes.

It looks like the two AI players are a good fit for each other: the field operative always guesses the word that the spymaster based the hint on. Now, let's try to make the spymaster give hints for multiple words.

Improving the spymaster

My first idea would be to generate the top n closest embeddings for all words of a color and see if there are any which are in common. n will be a tuneable parameter: lower values will give hints that are closer to the words, but they will match fewer words, higher values will match more words, potentially worse.

impl Spymaster for DoubleHintVectorSpymaster<'_> {
    fn give_hint(&mut self, map: &Map) -> Hint {
        let enemy_color = opposite_player(self.color);
        let remaining_words = map.remaining_words_of_color(enemy_color);

        let mut sim_words = HashMap::new();
        for word in remaining_words {
            let words = find_similar_words(&word, self.embeddings, self.n);
            for w in words {
                let count = sim_words.entry(w.word).or_insert(0);
                *count +=1;
            }
        }
        let mut best_word = sim_words.iter()
        		.max_by_key(|(&x, &y)| y).unwrap();

        return Hint{count: *best_word.1 as usize, word: best_word.0.to_string()};
    }
}

We store how often each suggested word is found in all suggestions in a hashmap, with the key being the word and the value being the occurence count. After we add all words there, we simply read out the maximum by occurence count. In Rust, this sort is nondeterministic, so if there are multiple words that occur the same number of times, which one will be returned is not guaranteed.

Around n=40, we start seeing hints for multiple words. At n=80 we have hints for two words for all three maps. At n=400 we have triple hints for two of the maps. But starting with n=80, the field agent no longer guesses correctly all of source words. Sometimes it's because the associations in the words is weird, but more often it's because the spy master only takes into account the words to which it should suggest related hints and it doesn't take into account the words from which the hint should be dissimilar.

There are several ways to address this issue, from simply seeing if the hint word is too close to a bad word and rejecting it, to more complex approaches such as finding the max-margin plane that separates the good words from the bad words and looking for hint words near it. But this post is already long, so this will come in part 3.

The whole code I have so far can be seen on Github.

]]>
<![CDATA[ Getting rid of vmmem ]]> https://rolisz.ro/2020/07/31/getting-rid-of-vmmem/ 5f23f7dd4f71eb12e0abb780 Fri, 31 Jul 2020 20:52:17 +0300 A while ago I noticed in Task Manager that there is a process called Vmmem running and it's using about 5-15% CPU constantly. A quick duckduckgoing revealed that it's a virtual process that represents the total system usage by VMs.

Alright, so it's not a malware. But I was not running any VMs. Where does it come from? My first suspicion was Docker, which runs Linux containers in a VM. But, I closed the Docker for Windows application, and the Vmmem was still chugging along, burning CPU. Then I suspected that the Windows Subsystem for Linux might be doing something funny, but no, it wasn't that, because I'm still using version 1, not version 2, which is the one that runs in a VM.

Well, after some more searching, it turns out that in some cases, Docker doesn't clean up properly when quitting and it leaves the VM open. To kill it, you must open the application called Hyper-V Manager and turn off the VM there manually.

To paraphrase Homer Simpson, "Docker, the cause of, and solution to, all our problems".

I’m publishing this as part of 100 Days To Offload - Day 32.

]]>
<![CDATA[ Showing current Kubernetes cluster in Powershell prompt ]]> https://rolisz.ro/2020/07/14/showing-current-kubernetes-cluster-in-powershell-prompt/ 5f0db64751d8dc2b1662a985 Tue, 14 Jul 2020 22:28:49 +0300 After nearly clearing the wrong Kubernetes cluster, I decided to add the name of the currently active cluster to my Powershell prompt. There are plenty of plugins to do this in Bash/Zsh/fish, but not as many for Powershell.

It's not hard to do, but the syntax and tools you use are definitely different from Unix tools.

First, let's get the name of the currently active cluster. We can look through the ~/.kube/config file for the field called current-context. On Linux, I would use grep to extract this, on Windows we use Select-String, which receives a regex to match and outputs the matches, as objects. We look for the first match and for the second group (which would be the first and only capture group in our regex) and we put it's value in $ctx variable. This should output the current cluster name.

$K8sContext=$(Get-Content ~/.kube/config | Select-String -Pattern "current-context: (.*)")
$ctx=$K8sContext.Matches[0].Groups[1].Value
Write-Host $ctx

And now, to edit the prompt, you must modify your PS1 profile. If you don't have one, then create a file in C:\Users\<USERNAME>\Documents\WindowsPowerShell\profile.ps1. If you already have a profile (which might be a global one or a per user one), edit it and add the following:

function _OLD_PROMPT {""}
copy-item function:prompt function:_OLD_PROMPT
function prompt {
	$K8sContext=$(Get-Content ~/.kube/config | Select-String -Pattern "current-context: (.*)")
	If ($K8sContext) {
		$ctx=$K8sContext.Matches[0].Groups[1].Value
		# Set the prompt to include the cluster name
		Write-Host -NoNewline -ForegroundColor Yellow "[$ctx] "
	}
	_OLD_PROMPT
}

The prompt function is called to write out the prompt. First we save a copy of the original prompt into the _OLD_PROMPT variable and then we define our new prompt function. In the new function we do the above snippet, with an added check to make sure we add something to the prompt if there was a match for our regex. I put the name of the cluster in square brackets to make it visually distinct from the Python virtual environment name, which comes afterward in parenthesis.

The result is as follows:

Good luck with not nuking the wrong Kubernetes cluster!

I’m publishing this as part of 100 Days To Offload - Day 29.

]]>
<![CDATA[ Giving code presentations ]]> https://rolisz.ro/2020/07/04/giving-code-presentations-in-jupyter-notebook/ 5efb28a717253e7fe6dd646b Sat, 04 Jul 2020 23:28:33 +0300 I sometimes give talks at the local tech hub in the city where I live. It's not a big community, but I enjoy giving talks and they often provide a necessary deadline and motivation to finish some projects.

Last week I gave a talk about Rust. Given that there are still some restrictions on how many people get be in one room, the physical audience was only 10 people, but there was a livestream as well.

Until now, I had used Google Slides for my presentation. For talks that don't have a lot of code, it works fine. But when you are presenting lots of code (such as a tutorial for a programming language), I found Slides to be lacking. If you paste in the code directly, you can't have syntax highlighting. You can paste in a screenshot, but then any later modifications to the slide mean retaking the screenshot and replacing it, so it's more work.

You can present in an IDE, but sometimes you want to have slides with normal text between pieces of code, where you explain some things. Switching between two apps can quickly get annoying. Also, it's hard to prepare just "bite-sized" content in an IDE, but that is needed so that the audience is focused only on what you are explaining right now.

So I decided to try something new for my intro to Rust presentation: I used Jupyter Notebook with a several extensions and I think it worked pretty well (except for a bug towards the end of the presentation).

For this I used the RISE extension, which adds live slide show support to Jupyter, using reveal.js. Each cell can be either a new slide, a sub-slide (so to get to it you have to "navigate down", in reveal.js style), a fragment (so it shows up on the same slide, but on a subsequent click), or notes. You can write new code and run it even during slideshow mode, so it's very useful if someone in the audience has a question, you can quickly write down and execute code to answer them. RISE is simple to install:

> pip install RISE

Then I used a bunch of extenstions that are bundled together in the jupyter_contrib_nbextensions package. By default, you have to enable and configure them by editing JSON files, but there is another plugin to add a dashboard for them, called jupyter_nbextensions_configurator. They can be installed with:

> pip install jupyter_contrib_nbextensions
> jupyter contrib nbextension install --user
> pip install jupyter_nbextensions_configurator
> jupyter nbextensions_configurator enable --user

You have to restart the Jupyter process and now you will see a new tab on the home page of the local Jupyter page, where you can enable and configure all the installed extensions.

I used the "Hide input" extension. Most of my code was organized into two cells. One which didn't contain all the code, just a snippet on which I wanted to focus (for example, I made a small change to a previously defined function), and another one which could be run and showed output. The latter cell was hidden with this extension, so that only the output could be seen.

Initially I also used the "Split cell" extension. This extension gives you a button which can make a cell to be half width. If two consecutive cells are half width, they align next to each other, making two columns. I wanted to use this to have code on the left column and explanations on the right column. This would have worked if the presentation was online only, because I wouldn't have had to zoom in too much. But because in the last week before the presentation we found out that it was allowed to hold the presentation in person (with 10 people in the audience) and I had to present on a projector and zoom in, I ended up removing all the split cells because the content wouldn't fit in any longer.

Making Rust work with Jupyter

All the above is generic and can be made to work with anything that works in Jupyter. To make Rust work in Jupyter you need a kernel for it. Some guys from Google have made one called evcxr_jupyter.

It's fairly straightforward to install. On Windows you need to first install CMake and then you run:

> cargo install evcxr_jupyter
> evcxr_jupyter --install

After restarting the Jupyter process, you now have the option of using a Rust kernel. To include Cargo dependencies, you can insert the following into a cell:

:dep reqwest = { version = "0.10", features = ["json", "blocking"] }

This downloads reqwest , compiles it and makes it available for using in other cells.

The notebook for presentation that I gave can be found in a Github repo and the recording can be found here.

I’m publishing this as part of 100 Days To Offload - Day 27.

]]>
<![CDATA[ My operating system journey ]]> https://rolisz.ro/2020/06/14/operating-system-journey/ 5ec2eed817253e7fe6dd5a49 Sun, 14 Jun 2020 21:46:40 +0300 Getting started with Linux

Ten years ago I was writing how I'm a big fan of Windows (and Android). I would regularly get into friendly debates with Cătălin, who was a staunch supporter of Linux. I kept dipping my toes into Linux, but for a long time, I got burned.

At my first internship and then even more so at my first job, I learned more and more about Linux and got comfortable in it. I started dual booting. By 2014, the most used OS on my laptop was Fedora.

When I built a desktop in 2015, I first installed Linux on it, even though I had to try several distributions until I found one that worked. I was in my "command-line" minimalist phase, so I set up i3, tmux, and fish. I was quite happy with it, but eventually I installed Windows 10 on it so that I could play games, run the Swiss tax form application and YNAB, a budgeting app.

Trying out Mac OS

My work laptop at the time was a MacBook. I thought I would like it and I was looking forward to trying out all the cool and hipster apps that were only on Mac OS, such as Alfred. In the end, while working at Google, I used only a browser and a terminal, and I never got around to really work with any other apps, because I didn't need them. The terminal experience in Mac requires a bit more searching around to get things working. Macs come with old libraries out of the box, you have to update them using copy pasted shell commands and I managed to screw things up once with Homebrew.  I was not impressed by Mac OS and I didn't want to spend my own money on that crappy keyboard.

Slowly turning back to Windows

But Windows (and it's ecosystem) has changed a lot since then. When I bought my new laptop in 2018, it came with Windows and I never bothered installing Linux on it. Why? Windows Subsytem for Linux. You get pretty much all the CLI goodies from Linux and all the other nice stuff from Windows. For example, as far as I know, there's almost no laptop where Linux has comparable battery life with Windows and that is an important factor for me, because I work remotely.

On my desktop I still had ArchLinux, because running Tensorflow was easier on Linux than on Windows (modulo the Nvidia driver updates) . But slowly I got bored of the "command line" minimalism. I tried other desktop environments on Linux, such as KDE and Gnome, but they never stuck. KDE is too bloated, and I find the default theme to be outdated. Gnome looks nice, but I never got around to feeling comfortable in it. The others are too "fringe" for me and I think that it's too hard to find solutions to the problems that inevitably crop up, just because the community is too small.

For the last two months, I have found myself using almost only Windows, even on my desktop. This way, I can watch Netflix at the highest resolution (on Linux, you can watch only in the browser, where's it's capped at 720p), I can play games. Rust works just as well on Windows as on Linux. WSL satisfies my very few needs for Linux only apps. And I never had problems with Nvidia drivers on Windows (unlike on Linux). The new Terminal app on Windows is pretty sweet. Powershell is pretty cool too, even though I don't know much of the syntax so far.

And honestly, I just like the default UI on Windows more. 10 years ago I had the patience to tinker with themes and to customize my desktop endlessly, but now I don't have the time and energy to deal with that anymore. I see plenty of nice Linux themes on Reddit and I tried to replicate one, but abusing fonts to get some nice "symbols" in i3-bar? Ewww.

Even though many people complain about Windows updates messing things up, that has never happened to me in the last 5 years, even though I am running on the insider preview version of Windows 10. On the other hand, I did manage to screw things up with ArchLinux updates, but it was my fault usually, because I didn't read the instructions or I let too much time pass between updates.

Servers

That's the story for my desktops and laptops. On servers, it's Linux all the way. My NAS runs Linux. My VPSs run Linux. And I plan to keep it that way. But there I'm not bothered by the fact that I SSH in and do the half an hour at most every week from the command line.

The only thing that I didn't try was a variant of BSD. Five years ago I might have given it a shot, but now I don't want to relearn a lot of things, from command line flags to concepts like jails. The strongest argument for BSD would be security, but Linux is secure enough for me, for now.

The future

But who knows what will happen in the future? Maybe in five years I'll get bored again of Windows and I'll try something new. Maybe Fuchsia will become mature by then :D

I’m publishing this as part of 100 Days To Offload - Day 23.

]]>
<![CDATA[ DuckDuckGo ]]> https://rolisz.ro/2020/05/26/duckduckgo/ 5ecd754317253e7fe6dd5d9e Tue, 26 May 2020 23:52:41 +0300 Some of my more astute readers might have noticed the "word" "DuckDuckGo" appearing several times on my blog this year, both as a noun and as a verb. Contrary to how the name sounds, it's not a board game; it's a search engine. It's an alternative to Google.

DuckDuckGo promises to be more privacy friendly. They say they don't track you and they don't show targeted ads, only keyword related ads.

Below are the results if you search for me on DuckDuckGo and on Google:

Google has a bit fancier organization of my blog content. But actually DuckDuckGo has only my accounts on the first page, while Google quickly veers off to showing links of other people, such as a LoL gamer.

One of the really cool features of DDG are the bangs. If you enter !g in your search query, DuckDuckGo will redirect you to search on Google (supposedly with some fewer cookies). yt will search on YouTube, gm on Google Maps and so on. They have several thousand bangs already. This is actually a time saving feature, because you can search directly from the address bar, without having to go to that other site first. And when DDG search results are not good enough, you can just add a bang and search on Google.

Another cool feature for programmers: DDG has much better StackOverflow integration. When you search for something programming related, they often show a snippet on the side with the top answer from StackOverflow. It makes copy pasting sooo much easier.

If you think it's funny that an ex-Googler uses DuckDuckGo, you should have seen the faces of my colleagues at Google when they saw that I was using DDG at Google. And even funnier was when my manager noticed this and then we had a discussion about why he doesn't use it: local queries (searching for nearby restaurants for example) were not working very well in DuckDuckGo and he used them very often. Well Suman, if you are reading this, DuckDuckGo now has you covered on that front too.  

If you are interested in alternatives to the big tech companies, I highly recommend using DuckDuckGo.

I’m publishing this as part of 100 Days To Offload - Day 16.

]]>
<![CDATA[ Bridging networks with a Synology NAS ]]> https://rolisz.ro/2020/05/19/bridging-networks-with-a-synology-nas/ 5ec433a417253e7fe6dd5a50 Tue, 19 May 2020 23:25:07 +0300 Warning: potentially bad idea ahead

For various reasons, I had to reorganize my working setup at home, including moving my home office to the room where my Synology NAS is. In this room, I had only one Ethernet outlet, but I needed two connections, one for the NAS and one for my desktop. I was too lazy to go to the nearest electronic shop to buy a switch this evening and I didn't want to unplug the NAS, but then I had an idea:

My NAS has two Ethernet ports. What if I use it as a sort of router, connecting the second Ethernet port directly to my desktop?

Let's give it a shot. I connected the desktop to the NAS. Both of them perceive that there is something connected, but no IP addresses are assigned.

I tried to fool around in the Control Panel of the Synology to enable a DHCP server for the second LAN interface. Eventually I got an IP on the desktop and I could load the Synology server, but I couldn't access the internet.

After some DuckDuckGo-ing and wading through all the posts saying that this is a bad idea and it's not how the Synology should be used, I found a Github repo that said it can bridge the two networks of a Synology. The script there is a bit of an overkill for what I needed, so here is the gist that I needed to get things working:

First, enable vSwitch:

Where to find the vSwitch Settings

Then SSH into the NAS and run the following two commands:

> sudo ovs-vsctl del-br ovs_eth1
> sudo ovs-vsctl add-port ovs_eth0 eth1
> sudo ovs-vsctl show 
    Bridge "ovs_eth0"
        Port "eth1"
            Interface "eth1"
        Port "eth0"
            Interface "eth0"
        Port "ovs_eth0"
            Interface "ovs_eth0"
                type: internal

If the output of the final command shows a bridge and two associated Ports, you're good to go and browse the Internet!

I don't actually intend to keep this as a long term solution. A NAS is not meant to function as a router or switch, so it's not optimized for this. A real switch is probably faster, but a Speedtest shows that I have 250 Mb/s download, so it's pretty good for now, until I get around to buying a switch.

I’m publishing this as part of 100 Days To Offload - Day 13.

]]>
<![CDATA[ An unexpected error in Rust ]]> https://rolisz.ro/2020/05/18/an-unexpected-error-in-rust/ 5ec2e34517253e7fe6dd59e2 Mon, 18 May 2020 23:04:11 +0300 As I continue my journey to learn Rust, I occasionally run into various hardships and difficulties. Most of them are related to lifetimes and ownership, but they lead to a pleasant moment of enlightenment when I figure it out. But for the last couple of days, I struggled with a much dumber error, one which I should have figured out much faster.

I was trying to load some word vector embeddings, from the finalfusion package. Word vectors have grown up since I used them (back in the word2vec days). These are 3.85 Gb.

I tried to load them up and to play with the API on my Linux desktop. The loading time was about 10 seconds, but then it was fast. And it worked.

Fast forward a month, during which I worked on other project and I get around again to working with the word vectors, this time from my Windows desktop. The rest of the project runs fine, but when it comes to loading the word vectors, it errors out with a menacing stack trace

thread 'main' panicked at 'capacity overflow', src\liballoc\raw_vec.rs:750:5
...

I look at the libraries repo, there was a new release in the meantime. They warn something about a breaking change, maybe the old code can't read newer vectors? I update the library; I change the code; I still get the same error.

Maybe it's my Rust version that's old and something broke. I update Rust. Nope, not it.

I try to DuckDuckGo the error, but I don't find anything relevant. So I open an issue on the GitHub repo of the library and I ask about this. I get an answer about it in 5 minutes (thank you for the prompt answer Daniel!): am I using the 32-bit or the 64-bit toolchain?

I facepalm hard, because I realize that's probably the error: the word vector is right around the size that can be loaded into memory  in 32 bit systems, there might be some extra allocations done while loading, so it goes overboard.

I check with rustup what toolchain I have:

> rustup toolchain list
stable-i686-pc-windows-msvc (default)

That's the 32 bit toolchain, my friends. So I install the x86_64 toolchain and set it as default:

> rustup toolchain install stable-x86_64-pc-windows-msvc
> rustup default stable-x86_64-pc-windows-msvc
> rustup toolchain list
stable-i686-pc-windows-msvc
stable-x86_64-pc-windows-msvc (default)

And lo and behold, the word vectors are now successfully loaded and I can start playing around more seriously with them.

Why is the 32bit toolchain the default one on Windows in 2020?

I’m publishing this as part of 100 Days To Offload - Day 12.

]]>
<![CDATA[ Context Variables in Python ]]> https://rolisz.ro/2020/05/15/context-variables-in-python/ 5eba763817253e7fe6dd56d8 Fri, 15 May 2020 22:59:02 +0300 Recently I had to do some parallel processing of nested items in Python. I would read some objects from a 3rd party API, then for each object I would have to get it's child elements and do some processing on them as well. A very simplified sketch of this code is below:

example_record = {'id': 0, 'children':
        [{'id': 'child01', 'prop': 100}]
}
     
executor = ThreadPoolExecutor(20)

def fetch_main_item(i):
    obj = records_to_process[i]
    return process_main_item(obj)

def process_main_item(obj):
    results = executor.map(process_child_item, obj['children'])
    return sum(results)

def process_child_item(child):
    sleep(random.random()*2)
    return child['prop']

results = executor.map(fetch_main_item, range(4))

for r in results:
    print(r)

The code ran just fine, but we wanted to have some visibility in how the processing is going, so we needed to add some logging. Just sprinkling some log statements here and there is easy, but we wanted all the logs to contain the index of the main record, even when processing the child records, which otherwise doesn't have a pointer to the parent record.

The easy and straightforward way would be to add the index to all our functions and always pass it along. But that would mean changing the signature of all our functions, which were much more, because there could be several different kinds of child objects, each being processed in a different way.

A much more elegant way would be to use contextvars, which were added in Python 3.7. These context variables act like a global variable, but per thread. If you set a certain value in one thread, every time you read it again in the same thread, you'll get back that value, but if you read it from another thread, it will be different.

A minimal usage example:

import contextvars
from concurrent.futures.thread import ThreadPoolExecutor
from time import sleep

ctx = contextvars.ContextVar('ctx', default=10)
pool = ThreadPoolExecutor(2)

def show_context():
    sleep(1)
    print("Background thread:", ctx.get())

pool.submit(show_context)
ctx.set(15)
print("Main thread", ctx.get())

The output is:

Main thread 15
Background thread: 10

Even though the background thread prints the value after it has been set to 15 in the main thread, the value of the ContextVar is still the default value in that thread.

This means that if we add the index to a context variable in the first function, it will be available in all other functions that run in the same thread.

import contextvars

context = contextvars.ContextVar('log_data', default=None)

def fetch_main_item(i):
    print(f"Fetching main item {i}")
    obj = records_to_process[i]
    context.set(i)
    result = process_main_item(obj)

    return result

def process_main_item(obj):
    ctx = context.get()
    results = executor.map(process_child_item, obj['children'])
    s = sum(results)
    print(f"Processing main item with {obj['id']} children at position {ctx}")
    return s
    
def process_child_item(child):
    sleep(random.random()*2)
    ctx = context.get()
    print(f"Processing child item {child['id']} of main item at position {ctx}")
    return child['prop']

What we changed was that in the fetch_main_item we set the context variable to the index of the record we process, and in the other two functions we get the context.

And it works as we expect in the process_main_item function, but not in the process_child_item function. In this simplified example, the id of each main record is the same as their index, and the first digit of the id of a child record is the parents id.

Fetching main item 0
Fetching main item 1
Fetching main item 2
Fetching main item 3
Processing child item child11 None
Processing child item child01 None
Processing child item child02 None
Processing child item child31 None
Processing child item child32 None
Processing main item with id 3 with 3
Processing child item child21 None
Processing child item child22 3
Processing child item child03 3
Processing main item with id 0 with 0
Processing child item child12 3
Processing main item with id 1 with 1
Processing child item child23 None
Processing main item with id 2 with 2

What is going on in child processing function? Why is the context sometimes None and sometimes 3?

Well, it's because we didn't set the context on the new thread. When we spawn a bunch of new tasks in the thread pool to process the child records, sometimes they get scheduled on threads that have never been used before. In that case, the context variable hasn't been, so it's None. In other cases, after one of the main records is finished processing, some of the child tasks are scheduled on the thread on which the main record with id 3 was scheduled, so the context variable has remained on that value.

The fix for this is simple. We have to propagate the context to the child tasks:

def process_main_item(obj):
    ctx = context.get()
    results = executor.map(wrap_with_context(process_child_item, ctx), obj['children'])
    s = sum(results)
    print(f"Processing main item with id {obj['id']} with {ctx}")
    return s

def wrap_with_context(func, ctx):
    def wrapper(*args):
        token = context.set(ctx)
        result = func(*args)
        context.reset(token)
        return result
    return wrapper

When calling map, we have to wrap our function in another one which sets the context to the one we pass in manually, calls our function, resets the context and then returns the result of the function. This ensures that the functions called in a background thread have the same context:

Fetching main item 0
Fetching main item 1
Fetching main item 2
Fetching main item 3
Processing child item child11 1
Processing child item child12 1
Processing main item with id 1 with 1
Processing child item child02 0
Processing child item child01 0
Processing child item child03 0
Processing main item with id 0 with 0
Processing child item child32 3
Processing child item child31 3
Processing main item with id 3 with 3
Processing child item child22 2
Processing child item child23 2
Processing child item child21 2
Processing main item with id 2 with 2

And indeed, all the indexes are now matched up correctly.

Context variables are a very nice mechanism to pass along some information, but in a sense they are global variables, so all the caveats that apply to global variables apply here too. It's easy to abuse them and to make it hard to track how the values in the context variable change. But, in some cases, they solve a real problem. For example, distributed tracing libraries, such as Jaeger, use them to be able to track how requests flow inside the program and to be able to build the call graph correctly.

Kudos to my colleague Gheorghe with whom I worked on this.

I’m publishing this as part of 100 Days To Offload - Day 10.

]]>
<![CDATA[ Adding search to a static Ghost blog ]]> https://rolisz.ro/2020/05/13/adding-search-to-static-ghost/ 5eba4dec17253e7fe6dd56d4 Wed, 13 May 2020 22:00:27 +0300 When I switched over to Ghost, my new theme didn't have a search function. Due to a combination of popular demand (read: one friend who asked about it) and me running across a plugin to enable searches for Ghost, I decided to do it today.

It was actually surprisingly simple, taking less than 2 hours, again, with most of the time being spent on CSS.

I used the SearchinGhost plugin, that does the indexing and searching entirely in the browser. It pretty much works out of the box, I just had to add an API key and I changed some of the templates a bit.

Normally, the plugin connects to the Ghost API to retrieve all the posts, but it does so via a GET request, so if I save a file in the right folder hierarchy, containing the JSON response, I can get the plugin to work on my statically served site.

The posts are loaded when clicking in the search bar. It takes a bit, because I have written 1.9Mb of content by now. But after it's sent over the network, searching is blazing fast.

Happy searching!

I’m publishing this as part of 100 Days To Offload - Day 8.

]]>
<![CDATA[ Productivity Tips: Pomodoros ]]> https://rolisz.ro/2020/05/12/productivity-tips-pomodoros/ 5ebaf1e317253e7fe6dd56f2 Tue, 12 May 2020 22:20:29 +0300 I sometimes have periods when I find it very hard to concentrate. When I was working in an office, this was made worse if there were many conversations going on around me. With the popularity of open offices, even if two people are talking 20 meters away, you can still hear them pretty well. But sometimes it happens even now that I am working from home, where I have my own office, with minimal distractions.

Sometimes the lack of focus happens because the task at hand is particularly boring or frustrating, with long wait times, such as configuring a build pipeline in Azure Pipelines, where you have to make a separate commit to test anything out and the documentation is not always clear. Sometimes the task I'm working on is unclear and I'm not sure how to proceed or I don't know how to solve the problem. At other times, my head is just so full of ideas that I constantly just from one thing to another and I can't get anything done. And of course, sometimes I'm just too tempted by Hacker News/Reddit/Facebook/Twitter/RSS feeds.

One thing that I have found to help in cases like these is to use a timer for 20 minutes of work and then a 5 minute break. There's plenty of apps for all platforms to help you schedule these timers. The rule is simple: during the work time, I either do something directly related to my task (usually coding) or I stare blankly out the window. When the timer rings, I can browse whatever I want, chat, watch YouTube videos. But during work time, I either do some useful, or I do nothing.

This part of doing nothing "guilt free" is important. If I do spend a full 20 minutes doing nothing, it means something is wrong with the task I'm working on and maybe I need to redefine it or get some clarifications. Or maybe it's something that I really don't think should be done. But most of the time I don't spend all the time looking into space and I get something done and this makes it easy to get into a more productive mood and get working. Sometimes after I do 2-3 rounds of these Pomodoro's[1], I even stop the timer, because I don't need it anymore.

What are your favorite productivity tips?

I’m publishing this as part of 100 Days To Offload - Day 7.


  1. The name comes from Italian, where it means tomato, because the guy who invented it used a kitchen timer shaped like a tomato. ↩︎

]]>
<![CDATA[ On text editors ]]> https://rolisz.ro/2020/05/08/on-text-editors/ 5eb50bbf81fef554fabb42bc Fri, 08 May 2020 22:33:35 +0300 Or why I switched away from Vim.

Getting into Vim

At my first internship, my mentor scolded me for using RubyMine to edit a Ruby on Rails project, claiming it was too enterprisey and that Real Programmers™️️ use Vim and live in the terminal, but the lesson stopped there. I slowly tried to use Linux more and more, but as can be seen on this blog, my first several attempts didn't go well. But I did learn some things and I did learn some basic things about Vim (at least I knew how to exit it 🙃).

Then at my first job, my team lead, Sever, knew Vim really well and pretty much lived only in the terminal. Because I pair programmed a lot with him, I learned a lot. He took the time to explain to me some useful settings and plugins. He even showed me Gundo (to visualize the undo trees), but I have to admit I never used it since.

Under Sever's tutoring, I got hooked on Vim. I started customizing it, adding plugins to it. I memorized a lot of commands and I became pretty proficient. I read blog posts about it and looked up to Tim Pope, who had written lots of great plugins for Vim.  I even started evangelizing it to other colleagues, saying how awesome it is, how great the action composition model is and so on. I had YouCompleteMe for autocompletion, vim.vinegar for file browsing, and about 20 other plugins. My .vimrc file eventually grew to over 250 lines.

Then I went to work at Google. At first I was working in Python and later in Go, and I kept using Vim. Because there were some Google-specific vim plugins, I had to split up my configuration, to have some parts which are common and some parts which are home/work specific only. This meant that keeping things was more difficult.

Photo by Gigi / Unsplash

Vim falls into disuse

But then I switched to another team and I started coding only in Java. I discovered that Java can be pretty nice, if the code is not overengineered like I learned in college. But still, Java has a lot of boilerplate code, so using an IDE is a must. At least back then (4 years ago), there were many features in IntelliJ that you couldn't replicate in Vim, such as accurate and always up to date "click to go to definition" or semantic renaming of variables.

During this time my .vimrc didn't get much love. I would still use Vim whenever I would code something at home, but that was not too often. And every now and then, something would break. Usually YCM, because the gcc/clang would get updated and YCM would fall behind. Some things I could fix, some I couldn't figure out in less than an hour, after which I would give up.

After I had left Google, I started working in Python again, but this time I went pretty quickly for PyCharm, which offers almost exactly the same features as IntelliJ, but for Python. Very similar UI, the same keyboard shortcuts, the same workflow.

But I often had to SSH into other machines, some of which are random production machines and I would have to edit some files there. Of course, I would reach for Vim. But I wouldn't have my configuration with me. So I would try to use some shortcuts and they wouldn't work. The one that still bites me to this very day is using - to open the file browser if in a file, or to go to the parent folder if in the file browser. I think this command is enabled by vim.vinegar. So I would become frustrated that I cannot edit using my regular workflow.

“You’re not creating if your not first enveloped with passion for your work”
Photo by Jason Strull / Unsplash

Vim today

Today the Vim ecosystem has improved a lot. There's Neovim which adds lots of fancy new features. There are new UI wrappers for Vim. There's a much more powerful plugin system which enable things you couldn't do before. If I really wanted to, I could probably replicate everything I need from PyCharm in Vim. But is that the best use of my time?

I don't think so. First of all, my experience lately has shown me that I often have to work with default Vim instances, so I would have to remember to sets of commands, one in my customized setup and one in default instances. Theoretically I could set up a script that would copy my .rc files to the remote computer every time I ssh, but I don't want to leave a mess behind me on every random cloud VM that I log in to.

Second of all, I don't think spending that much time on customizing Vim and memorizing keyboard shortcuts is going to make me that much more productive. I find that as a team lead, I don't spend as much time writing code. Luckily I don't spend too much time in meetings, but I do have almost every day at least 1:1 call with someone from my team to discuss how to proceed on some feature. I also spend a lot of time designing microservices and thinking about scaling and consistency issues. And when coding, I don't find that my ability to churn out code is a bottleneck, rather, but my ability to think about the interactions of the various microservices involved and how the data flow can go wrong in one way or the other. So I think investing in better tools for thought has a much better ROI for me.

Do I still like Vim? Of course. Do I still use it when I have to quickly open up some files in the terminal? Absolutely. Will I invest hours into learning more about it and customizing it more? Nope.

I’m publishing this as part of 100 Days To Offload - Day 4.

]]>
<![CDATA[ Splitting up my blog ]]> https://rolisz.ro/2020/05/07/splitting-up-my-blog/ 5eb4605481fef554fabb423e Thu, 07 May 2020 23:08:21 +0300

Considering that I plan to write one post every day for the next 97 days, this will mean a lot of spam for my readers. Most of my personal friends will be bored by my technical posts and most of the people who reached to my blog from Google/Hacker News/Reddit and subscribed will probably be uninterested by my hikes (well, when I used to hike), lockdown musings and board game reviews. I guess this is tolerable if I write only one or two posts a month, but with 30 a month, it can get pretty annoying.

Because of this I am "splitting" up the blog into two sections: one with technical posts and one with personal (or rather non-technical) posts. There will be a separate RSS feed for both categories[1]. And my MailChimp subscribers will have the option to select which posts they want to receive by email. Existing subscribers will have to click the "Update preferences" link at the bottom of any email to update... their preferences. If they don't do anything, they will still receive emails with all posts. New subscribers can choose when subscribing.

On the technical realization: going forward all technical posts will have a tech tag on them[2]. I created two new routes in YAML, one pointing to tech/rss and one to personal/rss. I created two almost identical RSS templates, the only difference being that in one I retrieved only posts which have the tech tag and in the other one only posts that don't have it.

Initially I thought about having two marker tags, tech and personal, but that would mean that if I forget to put one of them on a post, that post would not show up anywhere. This way, by default a post shows up in the personal RSS feed, so it's still shown to people, even if maybe to the wrong audience.

In MailChimp I created a new interest group, which contains two segments, each subscribed to one of the two RSS feeds. The most time on this project was spent fiddling with the subscription forms to align the checkboxes for selecting what emails to receive.

That's it folks. If anyone notices any bugs in this, either in the RSS feeds or in the emails, please leave a comment.

I’m publishing this as part of 100 Days To Offload - Day 3.


  1. Well, there's also a third RSS feed with all the posts, because I couldn't figure out in 10 minutes how to remove the default Ghost RSS feed. Oh well, if you use RSS, you can figure out which one you want to follow. ↩︎

  2. I wanted to use #tech, but for some reason Ghost didn't allow filtering based on private tags. ↩︎

]]>
<![CDATA[ Connecting to Azure Python package repositories ]]> https://rolisz.ro/2020/05/06/connecting-to-azure-pypi-repositories/ 5eb27c2781fef554fabb414d Wed, 06 May 2020 21:52:19 +0300 Lately I've been using the Azure DevOps platform for code repositories, issue tracking, wikis, CI/CD and for artifact hosting. It's a quite nice platform, the various tools are very well integrated with each other and they have nice integration with VS Code for example.

But I have one problem that keeps recurring: connecting to the private Python package repository, to which we push some of our internal libraries.

I want to have access to those packages locally, on my development machine, while I'm working on other projects that depend on those packages. The recommended way of authenticating is to use the artifacts-keyring package, which should initiate a web browser based authentication flow. For some reason, it doesn't work for me. It prompts for a username and a password in the CLI and then it says authentication failed. To be fair, it's still in preview and has been for quite some time now, so some bugs are to be expected.  

However, there is an alternate, older, authentication mechanism, but which has . For this you have to create a Personal Access Token, by clicking on the circled icon on any DevOps page and then going to the Personal Access Token page.  

On this page, click the New Token button. Fill out the fields and make sure you check one of the options under Packaging. If you only need read access (so you just want to download packages from the repository), then check only the Read box, otherwise, if you want to upload packages to Twine from the local machine, then check one of the other boxes.

Be careful, the PAT is shown only once and then you can't see it again in the UI, so make sure to copy it right away.

Now that you have the PAT, go to the Artefacts page in Azure DevOps and go to the Python repository you want to connect to. Click on the "Connect to feed" button and then on Pip and you'll find the URL to the custom repository.

Copy that thing and copy it into a pip.ini (on Windows) or pip.conf (on Linux) at the root of your virtual environment folder. These files don't exist unless you've customized something about your virtual environment, so most likely you will have to create them.

Now we have to modify that URL as follow:

extra-index-url=https://<feed_name>:<PAT>@pkgs.dev.azure.com/<organization>/_packaging/<feed>/pypi/simple/

Now time to test it out by trying to install something using pip. Pip should already showing that it's searching in the new index, instead of the public PyPI.

Keep in mind that these PATs have a limited duration. By default it expires after 30 days, but this can be extended up to 90 days from the dropdown form and 1 year from the date picker. Because I was lazy, I always picked the 90 days. After 90 days, when I would need a new version of one of our libraries, pip installs would start failing. Because three months had passed, I had usually forgotten how I set up everything, so I would have to go and figure it out again. I went through this 3 times already, the last time being today, so now I've decided to document it for myself somewhere I can find it easily.

I’m publishing this as part of 100 Days To Offload - Day 2.

]]>
<![CDATA[ Moving away from GMail ]]> https://rolisz.ro/2020/04/11/moving-away-from-gmail/ 5e8a359ee6cfb84549ab35bb Sat, 11 Apr 2020 19:59:33 +0300 tl;dr: I like PurelyMail and I will slowly switch all my online accounts to use it. I've been using it for almost three months and I didn't have any issues (except with some Hotmail/Outlook addresses).

I have a GMail account since 2006 (that's the date of the first email in my inbox), but I started actually using it more in 2009. I have around 120000 emails in total in my mailbox, about 17 Gb in total. I even worked as an SRE in the team responsible for the storage of GMail. So there's quite a bit of history between me and GMail, but I've started to move on.

It's been a long dream of mine to have an email address at my own domain. Technically, I had that ever since I started hosting my blog on a VPS from DigitalOcean, but I didn't set it up properly, so I never used it. I didn't want to use it only for forwarding the emails to GMail, I actually wanted to have send, receive and access emails from my own domain.

Why leave GMail?

First and foremost, because it's a single point of failure, which can fail due to many unrelated problems. Because my GMail address is my Google account basically, if for whatever reason I lose my GMail address, I lose all the things I have in my Google account: Google Drive, Google Photos, Play Store purchases, YouTube uploads and history, Google Calendar data and so on. While some of these can be backed up, such as Drive, Photos and YouTube, some cannot be, for example all the apps and movies I have purchased over the years.

But my GMail address can be lost for many reasons, many unrelated to GMail itself. I guess you can get the address suspended for spamming or doing other things, I'm not particularly worried about that. However, ToS violations of any kind, across other Google products, can lead to a ban on your account and implicitly, on your GMail address. There are many examples: reselling Pixel phones, writing too many emojis in YouTube chat, for publishing "repetitive content" in the Play Store. If you search on DuckDuckGo, I'm sure you can find many other examples.

Did the previous examples break the ToS? Some yes, some not. But the current ToS is 15 pages long and it changed 3 times in the last 4 years. And that's just a general ToS, Play Store has a separate ToS and YouTube as well. On top of this, they are quite vaguely formulated and many bans are unclear about the exact reason for which they have been instated.

The appeals process used to be completely opaque. Rumor has it that this has changed since they added the Google One service/platform/rebrand, but I haven't used the customer support from there myself, so I can't be sure. But even if now it works better and you can convince Google that your account was wrongfully suspended, it would be a couple of very stressful days.

Technically, I could use GMail with my own domain, but only by signing up for a business GSuite account. That comes with its own problems, such as not all features are available for business accounts and you can't move content that you've bought from the old, consumer address, to the new, business one.

Another reason because of which I'm leaving GMail is because it's kinda slow. A full load with an empty cache takes about 20 seconds on my desktop (good specs, wired Internet connection). Loading a folder with 184 emails takes 2.5 seconds. I miss computers being snappy.

My requirements for an email provider

  1. Custom domain.
  2. Decent amount of storage. I accrued 17 Gb in more than 10 years of usage. So even 5  Gb would do the trick for 1-2 years.
  3. Server side based full text search. I want to be able to search for random emails, based on some word from the body of the email  that I vaguely remember.
  4. Ability to use multiple aliases. I want to have site_name@rolisz.ro, besides the main address I will give out, but still have everything come in to my main inbox.
  5. Ability to use 3rd party apps to read my email. This means I want to read my email with any standards (mostly IMAP) compliant app.

Privacy stuff is not a part of my requirements. It's a nice feature to have, but it's not a blocking criterion. From the state mass surveillance perspective, I'm not going to actively implement many of the opsec measures because they are inconvenient. From the advertisement surveillance perspective, 90% of the emails I receive are automated, so there is a marketer on the other side who is probably selling my preferences anyway. If I want to keep something really private, I don't send it through email.

I have started looking at many email providers since last year, but not many of them fit the bill. And those that meet all the requirements are at least 50$/year. I signed up for the trial version of several, but none of them stuck.

Many providers boast that they are located in country X with strong privacy laws, but that is marketing only. Even Switzerland, which is famous for strong privacy laws, gave additional power to their intelligence agencies so that they could monitor the population more closely. The German intelligence agency, BND, cooperated with NSA: BND would spy on Americans, NSA would spy on Germans and they would exchange intel. So there is not much advantage in hosting your email in a Swiss underground bunker.

Some of the more notable email providers:

  • Tutanota
  • Protonmail
  • Fastmail
  • Mailbox

The first two are very security and privacy oriented. If that's your main concern, go for it. But it comes with some inconveniences: no server side search and you can't use 3rd party apps to read your email, just the apps they provide. They have some workarounds, but they just move the problem onto your machine (they offer a bridge that you install locally and which translates their custom protocol to IMAP and does indexing).

Fastmail was my top choice, from a usability perspective and I read many good reviews of it. It ticks all the boxes, at a cost of 50$ per year. I evaluated it a year ago, but because the construction of my house started, I put the email migration project on hold.

PurelyMail

Good thing I did that, because this year, in January, I read on Jan-Lukas Else's blog about PurelyMail. It immediately stood out to me because it's cheap: around 10$/year, so 5 times cheaper than Fastmail. At such a low price, I signed up for it right away.

PurelyMail is a one man show and it offers purely email, nothing else. It meets all my requirements and then some. Email data is encrypted on disk, but it also offers full text search, with the caveat that some information can be extracted from the search index, which seems like a very reasonable trade off to me. Scott Johnson, the owner, is very friendly and answered the support channel very quickly when I had some questions.

Besides the low price, PurelyMail has one more awesome thing: you pay for storage, not for users. So if you have a custom domain, you can create as many users as you want and you won't pay extra. All you will pay for is the total storage (and number of emails sent/received). All the other providers I saw charged extra for different addresses. They even had limitations on the number of aliases you could create for one user. But with PurelyMail, I can create an account for my wife, an account to use for sending automated mails from my servers (for monitoring) and accounts for many other things and not pay anything extra. ❤️

Setting up PurelyMail with my own domain was very easy. They have one page which shows all the information needed to set up MX, SPF, DKIM and DMARC records and validation happened almost instantly. All this is needed so that other email providers know that the emails really came from someone somehow related to the domain and it's not just spammers spoofing your domain. I tested this by sending out emails to several other addresses I own and they all got the mail, except for my Hotmail address. There, I first had to send an email from the Hotmail account to my own domain email address and then replies were received by the Hotmail account too.

PurelyMail default web interface

The web interface is based on RoundCube with a custom skin. UX wise it's decent and load times are faster than GMail :) One thing that I miss is the fact that in GMail folders are actually labels and you can have multiple labels for an email. However, I believe this is more of a limitation of the IMAP protocol, so it's not purely PurelyMail's fault.

The situation on Android is a bit worse. The recommended FOSS email client is FairEmail, whose UX sucks. Sometimes I still can't figure out how to check the details for an email and finding the forward button takes too long. So I cheat and have both FairEmail and the GMail app connected to my PurelyMail account. If I can't do something in FairEmail, I open the account in GMail.

Reliability wise, during the last three months I noticed only one outage (which happened while I was writing this post :) ), during which load times increased 10x. Otherwise the service has been stable.

Incremental switching

In the 11 years that GMail has been my main email account I have signed up for many online services with it. Logging in to all of them and changing the email address would take a long time. So I am taking an incremental approach: when I get a new email on my old account, I try to switch it to the new account.

The advantage of this method is that I don't have to spend a lot of time at once updating accounts and I might actually never waste time trying to change websites that don't send me emails.

During this process, I also review many of my accounts and I've decided to delete some of them and I've unsubscribed from many newsletters that were irrelevant. But I also discovered some oddities about some sites are: for example, Withings has one email address for the user account and another one for the user profile. I updated the first one, but then I still received weekly updates on my old email address. Similarly, The Atlantic confirmed my change of email address for my account, but I still receive their newsletter on the old address, because that's a separate thing.

Conclusion

I will still be checking out my GMail account for a while, but I'm glad that I've successfully convinced my main personal contact to send me emails to my domain: dad. The rest will follow soon.

]]>