<![CDATA[ rolisz consulting - Technical Posts ]]> https://rolisz.ro https://rolisz.ro/favicon.png rolisz consulting https://rolisz.ro Wed, 20 Jan 2021 18:29:12 +0300 60 <![CDATA[ How to ML - Deploying ]]> https://rolisz.ro/2021/01/20/how-to-ml-deploying/ 60084bc7165bd14e3b33595d Wed, 20 Jan 2021 18:28:54 +0300 So the ML engineer presented the model to the business stakeholders and they agreed that it performed well enough on the key metrics in testing that it's time to deploy it to production.

So now we have to make sure the models run reliably in production. We have to answer some more questions, in order to make some trade offs.

How important is latency? Is the model making an inference in response to a user action, so it's crucial to have the answer in tens of milliseconds? Then it's time to optimize the model: quantize weights, distill knowledge to a smaller model, weight pruning and so on. Hopefully, your metrics won't go down due to the optimization.

Can the results be precomputed? For example, if you want to make movie recommendations, maybe there can be a batch job that runs every night that does the inference for every user and stores them in a database. Then when the user makes a request, they are simply quickly loaded from the database. This is possible only if you have finite range of predictions to make.

Where are you running the model? On big beefy servers with a GPU? On mobile devices, which are much less powerful? Or on some edge devices that don't even have an OS? Depending on the answer, you might have to convert the model to a different format or optimize it to be able to fit in memory.

Even in the easy case where you are running the model on servers and latency can be several seconds, you still have to do the whole dance of making it work there. "Works on my machine" is all to often a problem. Maybe production runs a different version of Linux, which has a different BLAS library and the security team won't let you update things. Simple, just use Docker, right? Right, better hope you are good friends with the DevOps team to help you out with setting up the CI/CD pipelines.

But you've killed all the dragons, now it's time to keep watch... aka monitoring the models performance in production.

<![CDATA[ How to ML - Models ]]> https://rolisz.ro/2021/01/18/how-to-ml-models/ 6005e7293e8fc062a027dbe3 Mon, 18 Jan 2021 22:55:44 +0300 So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.

This is the part that most people who want to do machine learning are excited about. I read Bishop's and Murphy's textbooks, watched Andrew Ng's online course about ML and learned about different kinds of ML algorithms and I couldn't wait to try them out and to see which one is the best for the data at hand.

You start off with a simple one, a linear or logistic regression, to get a baseline. Maybe you even play around with the hyperparameters. Then you move on to a more complicated model, such as a random forest. You spend more time fiddling with it, getting 20% better results. Then you switch to the big guns, neural networks. You start with a simple one, with just 3 layers, and progressively end up with 100 ReLU and SIREN layers, dropout, batchnorm, ADAM, convolutions, attention mechanism and finally you get to 99% accuracy.

And then you wake up from your nice dream.

In practice, playing around with ML algorithms is just 10% of the job for an ML engineer. You do try out different algorithms, but you rarely write new ones from scratch. For most production projects, if it's not in one of the sklearn, Tensorflow or Pytorch libraries, it won't fly. For proof of concept projects you might try to use the GitHub repo that accompanies a paper, but that path is full of pain, trying to find all the dependencies of undocumented code and to make it work.

For the hyperparameter tuning, there are libraries to help you with that, and anyway, the time it takes to finish the training runs is much larger than the time you spend coding it up, for any real life datasets.

And in practice, you run into many issues with the data. You'll find that some of the columns in the data have lots of missing values. Or some of the datapoints that come from different sources have different meanings for the same columns. You'll find conflicting or invalid labels. And that means going back to the data pipelines and fixing that bugs that occur there.

If you do get a model that is good enough, it's time to deploy it, which comes with it's own fun...

<![CDATA[ How to ML - Data ]]> https://rolisz.ro/2020/12/29/how-to-ml-data/ 5feb735d2bc2360ef49da332 Tue, 29 Dec 2020 21:22:09 +0300 So we've decided what metrics we want to track for our machine learning project. Because ML needs data, we need to get it.

In some cases we get lucky and we already have it. Maybe we want to predict the failure of pieces of equipment in a factory. There are already lots of sensors measuring the performance of the equipment and there are service logs saying what was replaced for each equipment. In theory, all we need is a bit of a big data processing pipeline, say with Apache Spark, and we can get the data in the form of (input, output) pairs that can be fed into a machine learning classifiers that predicts if an equipment will fail based on the last 10 values measures from its sensors. In practice, we'll find that sensors of the same time that come from different manufacturers have different ranges of possible values, so they will all have to be normalized. Or that the service logs are filled out differently by different people, so that will have to be standardized as well. Or worse, the sensor data is good, but it's kept only for 1 month to save on storage costs so we have to fix that and wait a couple of months for more training data to accumulate.

The next best case is that we don't have the data, but we can get it somehow. Maybe there are already datasets on the internet that we can download for free. This is the case for most face recognition applications: there are plenty of annotated face datasets out there, with various licenses. In some cases the dataset must be bought, for example, if we want to start a new ad network, there are plenty of datasets available online of personal data about everyone, which can be used then to predict the likelihood of clicking on an ad. That's the business model of many startups...

The worst case is that we don't have data and we can't find it out there. Maybe it's because we have a very specific niche, such as we want to find defects in the manufacturing process of our specific widgets, so we can't use random images from the internet to learn this. Or maybe we want to do something that is really new (or very valuable), in which case we will have to gather the data ourselves. If we want to solve something in the physical world, that will mean installing sensors to gather data. After we get the raw data, such as images of our widgets coming of the production line, we will have to annotate those images. This means getting them in front of humans who know how to tell if a widget is good or defective. There needs to be a Q&A process in this, because even humans have an error rate, so each image will have to be labeled by at least three humans. We need several thousand samples, so this will take some time to set up, even if we can use crowdsourcing websites such as AWS Mechanical Turk to distribute the tasks to many workers across the world.

All this is done, we finally have data. Time to start doing the actual ML...

<![CDATA[ How to ML - Metrics ]]> https://rolisz.ro/2020/12/28/how-to-ml/ 5fea21292bc2360ef49da324 Mon, 28 Dec 2020 21:19:07 +0300 We saw that machine learning algorithms process large amounts of data to find patterns. But how exactly do they do that?

The first step in a machine learning project is establishing metrics. What exactly do we want to do and how do we know we're doing it well?

Are we trying to predict a number? How much will Bitcoin cost next year? That's a regression problem. Are we trying to predict who will win the election? That's a binary classification problem (at least in the USA). Are we trying to recognize objects in an image? That's a multi class classification problem.

Another question that has to be answered is what kind of mistakes are worse. Machine learning is not all knowing, so it will make mistakes, but there are trade-offs to be made. Maybe we are building a system to find tumors in X-rays: in that case it might be better that we call wolf too often and have false positives, rather than missing out on a tumor. Or maybe it's the opposite: we are trying to implement a facial recognition system. If the system recognizes a burglar incorrectly, then the wrong person will get sent to jail, which is a very bad consequence for a mistake made by "THE algorithm".

These are not just theoretical concerns, but they actually matter a lot in building machine learning systems. Because of this, many ML projects are human-in-the-loop, meaning the model doesn't decide by itself what to do, it merely makes a suggestion which a human will then confirm. In many cases, that is valuable enough, because it makes the human much more efficient. For example, the security guard doesn't have to look at 20 screens at once, but can only look at the footage that was flagged as anomalous.

Tomorrow we'll look at the next step: gathering the data.

<![CDATA[ What is ML? part 3 ]]> https://rolisz.ro/2020/12/24/what-is-ml-3/ 5fe4b0ab2bc2360ef49da308 Thu, 24 Dec 2020 18:18:51 +0300 Yesterday we saw that machine learning is behind some successful products and it does have the potential to bring many more changes to our life.

So what is it?

Well, the textbook definition is that it's the building of algorithms that can perform tasks they were not explicitly programmed to do. In practice, this means that we have algorithms that analyze large quantities of data to learn some patterns in the data, which can then be used to make predictions about new data points.

This is in contrast with the classical way of programming computers, where a programmer would use either their domain knowledge or they would analyze the data themselves and then write the program that has the correct output.

So one of the crucial distinctions is that in machine learning, the machine has to learn from the data. If a human being figures out the pattern and writes a regular expression to find addresses in text, that's human learning, and we all go to school to do that.

Now does that mean that machine learning is a solution for everything? No. In some cases, it's easier or cheaper to have a data analyst or a programmer find the pattern and code it up.

But there are plenty of cases where despite decades long efforts of big teams of researchers, humans haven't been able to find an explicit pattern. The simplest example of this would be recognizing dogs in pictures. 99.99% of humans over the age of 5 have no problem recognizing a dog, whether a puppy, a golden retriever or a Saint Bernard, but they have zero insight into how they do it, what makes a bunch of pixels on the screen a dog and not a cat. And this is where machine learning shines: you give it a lot of photos (several thousands at least), pair each photo with a label of what it contains and the neural network will learn by itself what makes a dog a dog and not a cat.

Machine learning is just one tool that is available at our disposal, among many other tool. It's a very powerful tool and it's one that gets "sharpened" all the time, with lots of research being done all around the world to find better algorithms, to speed up their training and to make them more accurate.

Come back tomorrow to find out how the sausage is made, on a high level.

<![CDATA[ What is ML? part 2 ]]> https://rolisz.ro/2020/12/23/what-is-ml-part-2/ 5fe301b952484d7aadd5c620 Wed, 23 Dec 2020 11:42:31 +0300 Yesterday I wrote how AI made big promises in the past but it failed to deliver, but that now it's different.

What's changed?

Well, now we have several products that work well with machine learning. My favorite example is Google Photos, Synology Moments and PhotoPrism. They are all photo management applications which use machine learning to automatically recognize all faces in pictures (easy, we had this for 15 years), recognize automatically which pictures are of the same person (hard, but doable by hand if you had too much time) and more than that, index photos by all kinds of objects that are found in them, so that you can search by what items appear in your photos (really hard, nobody had time to do that manually).

I have more than 10 years of photos uploaded to my Synology and one of my favorite party tricks when talking to someone is to whip out my phone and show them all the photos I have of them, since they were kids, or the last time that we met, or that funny thing that happened to them and I have photographic evidence of. Everyone is amazed by that (and some are horrified and deny that they looked like that when they were children). And there is not one, but at least three options to do this, one of which is open source, so that anyone can run in at home on their computer, for free, so there is demand for such a product.

Other successful examples are in the domain of recommender systems, YouTube being a good example. I have a love/hate relationship with it: on one hand, I wasted so many hours of my life to the recommendations it makes (which is proof of how good it is at making personalized suggestions), on the other hand, I found plenty of cool videos with it. This deep learning based recommender system is one of the factors behind the growth of the watch time on YouTube, which is basically the key metric behind revenue (more watch time, more ads).

These are just two examples that are available for everyone to use, and which serve as evidence that machine learning based AI now is not just hot air.

But I still haven't answered the question what is ML... tomorrow, I promise.

<![CDATA[ What is ML? ]]> https://rolisz.ro/2020/12/21/what-is-ml/ 5fe0d5bd1012ed2b469ca47f Mon, 21 Dec 2020 20:17:34 +0300 Machine learning is everywhere these days. Mostly in newspapers, but it's seeping into many real life, actual use cases. But what is it actually?

If you read only articles on TechCrunch, Forbes, Business Insider or even MIT Technology Review, you'd think it's something that brings Model T800 to life soon, or that it will cure cancer and make radiologists useless, or that it will enable humans to upload their minds to the cloud and live forever, or that it will bring fully self driving cars by the end of the year (every year for the last 5 years).

Many companies want to get in on the ML bandwagon. It's understandable: 1) that's where the money is (some 10 billion dollars were invested in it in 2018) and 2) correctly done, applied to the right problems, ML can actually be really valuable, either by automating things that were previously done with manual labor or even by enabling things that were previously unfeasible.

But at the same time, a lot of ML projects make unrealistic promises, eat a lot of money and then deliver something that doesn't work well enough to have a positive ROI. The ML engineers and researchers are happy, they got payed, analyzed the data and played around with building ML models, and maybe even published a paper or two. But the business is not happy, because they are not better off in any way.

This is not a new phenomenon. Artificial Intelligence, of which Machine Learning is a subdomain of, has been plagued by similar bubbles ever since it was founded. AI has gone through several AI winters already, in the 60s, 80s and late 90s. Big promises, few results.

To paraphrase Battlestar Galactica, "All this has happened before, all this will happen again but this time it's different". But why is it different? More about that tomorrow.

<![CDATA[ Machine Learning stories: Misunderstood suggestions ]]> https://rolisz.ro/2020/11/30/machine-learning-stories/ 5fc5247f53c65419dc54f518 Mon, 30 Nov 2020 20:00:59 +0300 A couple of years ago I was working on a calendar application, on the machine learning team, to make it smarter. We had many great ideas, one of them being that once you indicated you wanted to meet with a group of people, the app would automatically suggest you a time slot for the meeting.

We worked on it for several months. We couldn't just use simple hand-coded rules, because we wanted to do things like learn every users working hours, which could vary based on many things. In the end, we implemented this feature using a combination of both hand coded rules (to avoid some bad edge cases) and machine learning. We did lots of testing, both automated and manually in our team.

Once the UI was ready, we did some user testing, where the new prototype was put in front of real users, unrelated to our team, who were recorded while they tried to use it and then were asked questions about the product. When the reports came in, the whole team banged their heads against the desk: most users thought we were suggesting times when the meeting couldn't take place!

What happened? If you included either many people or even only one very busy person, there will be no empty slot which is good for everyone. So our algorithm would make three suggestions, saying that for each there would be a different person who might not be able to make the meeting.

In our own testing, it was obvious to us what was happening, so we didn't consider it a big problem. But users who didn't know the system, found it confusing and kept going to the classic grid to manually find a slot for the meeting.

Lesson: machine learning algorithms are never perfect and every project needs to be prepared to deal with mistakes.

How will your machine learning project handle failures? How will you explain to the end users the decisions the algorithm made? If you need help answering these questions, let's talk.

<![CDATA[ Tailscale ]]> https://rolisz.ro/2020/11/15/tailscale/ 5f8aef7aebd40d0556a6e307 Sun, 15 Nov 2020 23:18:28 +0300 I want to share about an awesome piece of technology that I started using recently: Tailscale. It's a sort of VPN, meaning it sets up a private network for you, but only between your devices, without a central server.

It does so without you having to open any ports on your router or configure firewalls. It's pretty much powered by JustWorks™️ technology. Each device you register gets an IP address of the form 100.x.y.z. Using this address than you can connect to that device from anywhere in the world, as long as both are connected to the Internet, because Tailscale automatically performs NAT traversal.

My use case for Tailscale is to connect my various devices (desktop, laptop, NAS, Raspberry PI and Android phone) and be able to access them, regardless of where I am.

For my NAS I had setup a DynDNS system, but occasionally it would still glitch and I would lose connectivity. With Tailscale, now I have connection pretty much always. I still keep the DNS address, because I have everything set up for it, but I know that debugging stuff will be much easier.

Similarly, the Raspberry Pi at my parents place is behind a crappy router which sometimes doesn't do the port forwarding properly. Now I can login via Tailscale and fix the issue.

There is a bit of overhead with using Tailscale. When I set up the initial backup to the RPI, I tried going through it, but the bandwith was only 3MB/s, while if I connected directly, it would be 10 MB/s. Because borg encrypts the data, I don't need the additional security provided by Tailscale.

All in all: I can strongly recommend Tailscale. It's a great product if you have any sort of home servers. It's developed by some great guys, including Avery Pennarun, David Crawshaw and Brad Fitzpatrick. I wish more startups did cool stuff like this, which work so well.

<![CDATA[ Backing up 4: The Raspberry Pi ]]> https://rolisz.ro/2020/10/11/raspi-backups/ 5f71897a5ad1bb49f64c71b8 Sun, 11 Oct 2020 23:31:12 +0300 More than a year ago I described how I used Syncthing to backup folders from my NAS to an external harddrive attached to my parents PC. This was supposed to be my offline backup. Unfortunately, it didn't prove to be a very reliable solution. The PC ran Windows, I had trouble getting SSH to work reliably, I would often had to fix stuff through Teamviewer. Often the PC would not be turned on for days, so I couldn't even do the backups without asking them to turn it on. And Syncthing turned out to be finicky and sometimes didn't sync.

Then it finally dawned on me: I have two Raspberry Pi 3s at home that are just collecting dust. How about I put one of them to good use?

So I took one of the Pis, set it up at my parents place and after some fiddling, it works. Here's what I did:

I used the latest Raspbian image. It sits at my parent's home, which has a dynamic IP address. The address usually changes only if the router is restarted, but it can still cause issues. At first I thought I would set up a reverse SSH tunnel from the Raspberry Pi to my NAS, but I couldn't get autossh to work with systemd.

Then I tried another option: I set up a Dynamic DNS entry on a subdomain, with ddclient on the Raspberry Pi to update the IP address regularly. I had to open a port on my parents router for this. I added public key authentication through SSH, while restricting password based authentication to LAN networks only, in the /etc/ssh/sshd_config:

PasswordAuthentication no
ChallengeResponseAuthentication no

Match Address,,
    PasswordAuthentication yes

It has worked for two weeks, so that's pretty good.

Now that I have a stable connection to the Pi, it was time to set up the actual backups. I looked around and there are several options. I ended up choosing BorgBackup. It has builtin encryption for the archive, so I don't need to muck around with full disk encryption. It also does deduplication, compression and deltas, so after an initial full backup, it only backs up changes, so it's quite efficient.

BorgBackup is quite simple to use. First you have to initialize a repository, which will contain your backups:

 > borg init ssh://user@pi_hostname:port/path/to/backups -e authenticated

This will prompt you for a passphrase. It will also generate a keyfile, which you should export and keep safe on other machines:

> borg key export ssh://user@pi_hostname:port/path/to/backups

Then, to start the actual backup process:

> borg create --stats --progress --exclude "pattern_to_exclude*" ssh://user@pi_hostname:port/path/to/backups::archive_name ./folder1 ./folder2 ./folder3

The archive_name corresponds to one instance when you backed up everything. If the next day you rerun the command with archive_name2, it will compare all the chunks and transmit only the ones that have changed or which are new. Then you will be able to restore both archives, with BorgBackup doing the right thing in the background to show you only the items that were backed up in that archive.

The cool thing about Borg is that if a backup stops while in progress, it can easily resume at any time.

I added command to a cron job (actually, the Synology Task Scheduler) to run it daily and now I have daily, efficient backups.

# Archive name schema
DATE=$(date --iso-8601)
echo "Starting backups for $DATE"
export BORG_PASSCOMMAND="cat ~/.borg-passphrase"
/usr/local/bin/borg create --stats --exclude "pattern_to_exclude*" ssh://user@pi_hostname:port/path/to/backups::$DATE ./folder1 ./folder2 ./folder3

The .borg-passphrase file contains my passphrase and has the permission set to 400 (read only by my user). Borg then reads the passphrase from that environment variable, so no user input is necessary.

Now I get the following report by email every morning:

Duration: 4 minutes 22.54 seconds
Number of files: 281990

			Original size      Compressed size    Deduplicated size
This archive:              656.97 GB            646.90 GB             12.51 MB

Not bad. Borg sweeps 656 GB of data in 4.5 minutes, determines that there is only 13 MB of new data and sends only that over the network.

I feel much more confident about this solution than about the previous one! Here's to not changing it too often!

<![CDATA[ Playing Codenames in Rust with word vectors ]]> https://rolisz.ro/2020/09/26/playing-codenames-in-rust-with-word-vectors/ 5f3adc474f71eb12e0abb8ca Sat, 26 Sep 2020 21:32:01 +0300 In a previous post I implemented the game of Codenames in Rust, allowing a human player to interact with the computer playing randomly. Now let's implement a smarter computer agent, using word vectors.

Word vectors (or word embeddings) are a way of converting words into a high dimensional vector of numbers. This means that each word will have a long list of numbers associated with it and those numbers aren't completely random. Words that are related usually have values closer to each other in the vector space as well. Getting those numbers from raw data takes a long time, but there are many pretrained embeddings on the internet you can just use and there are also libraries that help you find other words that are close to a target word.

Word vectors in Rust

Machine learning has embraced the Python programming language, so most ML tools, libraries and frameworks are in Python, but some are starting to show up in Rust as well. Rust's performance focus attracts people, because ML is usually computationally intensive.

There is one library in Rust that does exactly what we want: FinalFusion. It has a set of pretrained word vectors (quite fancy ones, with subword embeddings) and it has a library to load them and to make efficient nearest neighbor queries.

The pretrained embeddings come in a 5 GB file, because they pretty much literally have everything, including the kitchen sink, so the download will take a while. Let's start using the library (after adding it to our Cargo.toml file) to get the nearest neighboring words for "cat":

use std::io::BufReader;
use std::fs::File;
use finalfusion::prelude::*;
use finalfusion::similarity::WordSimilarity;

fn main() {
    let mut reader = BufReader::new(File::open("resources/ff.fifu").unwrap());

    // Read the embeddings.
    let embeddings: Embeddings<VocabWrap, StorageViewWrap> =
        Embeddings::read_embeddings(&mut reader)
    println!("{:?}", embeddings.word_similarity("cat", 5).unwrap());

After running it we get the following output:

[WordSimilarityResult { similarity: NotNan(0.81729543), word: "cats" },
WordSimilarityResult { similarity: NotNan(0.812261), word: "kitten" },
WordSimilarityResult { similarity: NotNan(0.7768222), word: "feline" },
WordSimilarityResult { similarity: NotNan(0.7760824), word: "kitty" },
WordSimilarityResult { similarity: NotNan(0.7667354), word: "dog" }]

Loading a 5GB file from disk will take some time. If you have enough RAM, it should be in the OS's file cache after the first run, so it will load faster. Also, compiling this program with --release (turning on optimizations and removing debug information) will speed it up significantly.

One of the rules of Codenames is that hints can't be any of the words on the board or direct derivatives of them (such as plural forms). The finalfusion library has support for masking some words out, but to get plural forms I resorted to another library called inflector which has a method called to_plural, which does exactly what's written on the box.

use std::io::BufReader;
use std::fs::File;
use finalfusion::prelude::*;
use finalfusion::similarity::EmbeddingSimilarity;
use std::collections::HashSet;
use inflector::string::pluralize::to_plural;

fn main() {
    let mut reader = BufReader::new(File::open("resources/ff.fifu").unwrap());

    // Read the embeddings.
    let embeddings: Embeddings<VocabWrap, StorageViewWrap> =
        Embeddings::read_embeddings(&mut reader)
    let word = "cat";
    let embed = embeddings.embedding(word).unwrap();
    let mut skip: HashSet<&str> = HashSet::new();
    let pluralized = to_plural(word);
    let words = embeddings.embedding_similarity_masked(embed.view(), 5, &skip).unwrap();
    println!("{:?}", words);

This is a slightly lower level interface, where we first have to obtain the embedding of the word, we build the set of words to skip and then we search for the most similar words to the vector that we give it. The output is:

[WordSimilarityResult { similarity: NotNan(0.812261), word: "kitten" },
WordSimilarityResult { similarity: NotNan(0.7768222), word: "feline" },
WordSimilarityResult { similarity: NotNan(0.7760824), word: "kitty" }, 
WordSimilarityResult { similarity: NotNan(0.7667354), word: "dog" }, 
WordSimilarityResult { similarity: NotNan(0.7471396), word: "kittens" }]

It's better. Ideally, we could also somehow remove all composite words based on words from the table, but that's a bit more complicated.

This can be wrapped in a function, because it's a common use case:

fn find_similar_words<'a>(word: &str, embeddings: &'a Embedding, limit: usize) -> Vec<WordSimilarityResult<'a>> {
    let embed = embeddings.embedding(&word).unwrap();
    let mut skip: HashSet<&str> = HashSet::new();
    let pluralized = to_plural(&word);
    embeddings.embedding_similarity_masked(embed.view(), limit, &skip).unwrap()

Implementing the first spymaster

Let's implement our first spymaster which uses word vectors! First, let's define a type alias for the embedding type, because it's long and we'll use it many times.

type Embedding = Embeddings<VocabWrap, StorageViewWrap>;

Our spymaster will have two fields: the embeddings and the color of the player. The Spymaster trait requires only the give_hint function to be implemented.

pub struct BestWordVectorSpymaster<'a> {
    pub embeddings: &'a Embedding,
    pub color: Color,

impl Spymaster for BestWordVectorSpymaster<'_> {
    fn give_hint(&mut self, map: &Map) -> Hint {
        let enemy_color = opposite_player(self.color);
        let remaining_words = map.remaining_words_of_color(enemy_color);
        let mut best_sim = NotNan::new(-1f32).unwrap();
        let mut best_word = "";
        for word in remaining_words {
            let words = find_similar_words(&word, self.embeddings, 1);
            let hint = words.get(0).unwrap();
            if hint.similarity > best_sim {
                best_sim = hint.similarity;
                best_word = hint.word;
        return Hint{count: 1, word: best_word.to_string()};

This spymaster uses a simple greedy algorithm. It takes each word that has to be guessed and find's the most similar word to it, while keeping track of the similarity. It returns as hint the word that had the highest similarity to any of the words that belong to the opposite team.

How does it do? I drew some random boards with a fixed seed and ran this spymaster on them. If you hover over the hint, it shows what are the words it's based on.

We have a problem: the word embeddings we use are a bit too noisy. The word embeddings are usually trained on large text corpora crawled from the internet, such as Wikipedia, the Common Crawl project or from CoNLL 2017 dataset (this is the one used above). The problem with these large corpuses is that they are not perfectly cleaned. For example "-pound" is considered a word. Let's try the CC embeddings:

Unfortunately, the CC embeddings give even worse results.

Cleaning up the embeddings

My fix for this was to write a script to prune down the embeddings to only "real" words (ones made of only letters). First, I had to get a set of all these words.

    let words = embeddings.vocab().words();
    let mut total = 0;
    let mut lowercase = 0;
    let mut select = HashSet::new();
    for w in words {
        total += 1;
        if w.chars().all(char::is_lowercase) {
            lowercase +=1;
    println!("{} {}", total,  lowercase);

Then I had to get the embeddings and the norms for each of these words:

    let mut selected_vocab = Vec::new();
    let mut selected_storage = Array2::zeros((select.len(), embeddings.dims()));
    let mut selected_norms = Array1::zeros((select.len(),));

    for (idx, word) in select.into_iter().enumerate() {
        match embeddings.embedding_with_norm(&word) {
            Some(embed_with_norm) => {
                selected_norms[idx] = embed_with_norm.norm;
            None => panic!("Cannot get embedding for: {}", word),


And finally write the now much smaller embedding file:

    let new_embs = Embeddings::new(
    let f = File::create("resources/smaller_embs.fifu").unwrap();
    let mut reader = BufWriter::new(f);
    new_embs.write_embeddings(&mut reader);

On the embeddings trained on the CoNLL dataset the reduction is about 6x: from 1336558 to 233453.

Let's give our Spymaster another shot with these embeddings, simply by changing the file from which we load the embeddings:

"superheroic" and "marched" look kinda against the rules, being too close to one of the words on the board, but "movie" is a really good one word hint.

Implementing a field operative

Now let's implement the other part of the AI team: the field operative which has to guess which words from the board belong to the enemy, based on the hints the spymaster gave.

pub struct SimpleWordVectorFieldOperative<'a> {
    embeddings: &'a Embedding,

impl FieldOperative for SimpleWordVectorFieldOperative<'_> {
    fn choose_words<'a>(&mut self, hint: &Hint, words: &[&'a str]) -> Vec<&'a str> {
        let hint_emb = self.embeddings.embedding(&hint.word).unwrap();
        let hint_embedding = hint_emb.view();
        let mut similarities = vec![];
        for w in words {
            let new_embed = self.embeddings.embedding(&w).unwrap();
            let similarity: f32 = new_embed.view().dot(&hint_embedding);
            similarities.push((w, similarity));
            .sorted_by(|(_, e), (_, e2)| e.partial_cmp(e2).unwrap())
            .rev().take(hint.count).map(|x| *x.0).collect()

The field operative is even simpler: we go through all the words that are still on the board and get a similarity score between them and the hint. Sort the words by the similarity and take the top "count" ones.

Let's see how it does on the same maps as before. If you hover over the field operative, you can see the guesses it makes.

It looks like the two AI players are a good fit for each other: the field operative always guesses the word that the spymaster based the hint on. Now, let's try to make the spymaster give hints for multiple words.

Improving the spymaster

My first idea would be to generate the top n closest embeddings for all words of a color and see if there are any which are in common. n will be a tuneable parameter: lower values will give hints that are closer to the words, but they will match fewer words, higher values will match more words, potentially worse.

impl Spymaster for DoubleHintVectorSpymaster<'_> {
    fn give_hint(&mut self, map: &Map) -> Hint {
        let enemy_color = opposite_player(self.color);
        let remaining_words = map.remaining_words_of_color(enemy_color);

        let mut sim_words = HashMap::new();
        for word in remaining_words {
            let words = find_similar_words(&word, self.embeddings, self.n);
            for w in words {
                let count = sim_words.entry(w.word).or_insert(0);
                *count +=1;
        let mut best_word = sim_words.iter()
        		.max_by_key(|(&x, &y)| y).unwrap();

        return Hint{count: *best_word.1 as usize, word: best_word.0.to_string()};

We store how often each suggested word is found in all suggestions in a hashmap, with the key being the word and the value being the occurence count. After we add all words there, we simply read out the maximum by occurence count. In Rust, this sort is nondeterministic, so if there are multiple words that occur the same number of times, which one will be returned is not guaranteed.

Around n=40, we start seeing hints for multiple words. At n=80 we have hints for two words for all three maps. At n=400 we have triple hints for two of the maps. But starting with n=80, the field agent no longer guesses correctly all of source words. Sometimes it's because the associations in the words is weird, but more often it's because the spy master only takes into account the words to which it should suggest related hints and it doesn't take into account the words from which the hint should be dissimilar.

There are several ways to address this issue, from simply seeing if the hint word is too close to a bad word and rejecting it, to more complex approaches such as finding the max-margin plane that separates the good words from the bad words and looking for hint words near it. But this post is already long, so this will come in part 3.

The whole code I have so far can be seen on Github.

<![CDATA[ Getting rid of vmmem ]]> https://rolisz.ro/2020/07/31/getting-rid-of-vmmem/ 5f23f7dd4f71eb12e0abb780 Fri, 31 Jul 2020 20:52:17 +0300 A while ago I noticed in Task Manager that there is a process called Vmmem running and it's using about 5-15% CPU constantly. A quick duckduckgoing revealed that it's a virtual process that represents the total system usage by VMs.

Alright, so it's not a malware. But I was not running any VMs. Where does it come from? My first suspicion was Docker, which runs Linux containers in a VM. But, I closed the Docker for Windows application, and the Vmmem was still chugging along, burning CPU. Then I suspected that the Windows Subsystem for Linux might be doing something funny, but no, it wasn't that, because I'm still using version 1, not version 2, which is the one that runs in a VM.

Well, after some more searching, it turns out that in some cases, Docker doesn't clean up properly when quitting and it leaves the VM open. To kill it, you must open the application called Hyper-V Manager and turn off the VM there manually.

To paraphrase Homer Simpson, "Docker, the cause of, and solution to, all our problems".

I’m publishing this as part of 100 Days To Offload - Day 32.

<![CDATA[ Showing current Kubernetes cluster in Powershell prompt ]]> https://rolisz.ro/2020/07/14/showing-current-kubernetes-cluster-in-powershell-prompt/ 5f0db64751d8dc2b1662a985 Tue, 14 Jul 2020 22:28:49 +0300 After nearly clearing the wrong Kubernetes cluster, I decided to add the name of the currently active cluster to my Powershell prompt. There are plenty of plugins to do this in Bash/Zsh/fish, but not as many for Powershell.

It's not hard to do, but the syntax and tools you use are definitely different from Unix tools.

First, let's get the name of the currently active cluster. We can look through the ~/.kube/config file for the field called current-context. On Linux, I would use grep to extract this, on Windows we use Select-String, which receives a regex to match and outputs the matches, as objects. We look for the first match and for the second group (which would be the first and only capture group in our regex) and we put it's value in $ctx variable. This should output the current cluster name.

$K8sContext=$(Get-Content ~/.kube/config | Select-String -Pattern "current-context: (.*)")
Write-Host $ctx

And now, to edit the prompt, you must modify your PS1 profile. If you don't have one, then create a file in C:\Users\<USERNAME>\Documents\WindowsPowerShell\profile.ps1. If you already have a profile (which might be a global one or a per user one), edit it and add the following:

function _OLD_PROMPT {""}
copy-item function:prompt function:_OLD_PROMPT
function prompt {
	$K8sContext=$(Get-Content ~/.kube/config | Select-String -Pattern "current-context: (.*)")
	If ($K8sContext) {
		# Set the prompt to include the cluster name
		Write-Host -NoNewline -ForegroundColor Yellow "[$ctx] "

The prompt function is called to write out the prompt. First we save a copy of the original prompt into the _OLD_PROMPT variable and then we define our new prompt function. In the new function we do the above snippet, with an added check to make sure we add something to the prompt if there was a match for our regex. I put the name of the cluster in square brackets to make it visually distinct from the Python virtual environment name, which comes afterward in parenthesis.

The result is as follows:

Good luck with not nuking the wrong Kubernetes cluster!

I’m publishing this as part of 100 Days To Offload - Day 29.

<![CDATA[ Giving code presentations ]]> https://rolisz.ro/2020/07/04/giving-code-presentations-in-jupyter-notebook/ 5efb28a717253e7fe6dd646b Sat, 04 Jul 2020 23:28:33 +0300 I sometimes give talks at the local tech hub in the city where I live. It's not a big community, but I enjoy giving talks and they often provide a necessary deadline and motivation to finish some projects.

Last week I gave a talk about Rust. Given that there are still some restrictions on how many people get be in one room, the physical audience was only 10 people, but there was a livestream as well.

Until now, I had used Google Slides for my presentation. For talks that don't have a lot of code, it works fine. But when you are presenting lots of code (such as a tutorial for a programming language), I found Slides to be lacking. If you paste in the code directly, you can't have syntax highlighting. You can paste in a screenshot, but then any later modifications to the slide mean retaking the screenshot and replacing it, so it's more work.

You can present in an IDE, but sometimes you want to have slides with normal text between pieces of code, where you explain some things. Switching between two apps can quickly get annoying. Also, it's hard to prepare just "bite-sized" content in an IDE, but that is needed so that the audience is focused only on what you are explaining right now.

So I decided to try something new for my intro to Rust presentation: I used Jupyter Notebook with a several extensions and I think it worked pretty well (except for a bug towards the end of the presentation).

For this I used the RISE extension, which adds live slide show support to Jupyter, using reveal.js. Each cell can be either a new slide, a sub-slide (so to get to it you have to "navigate down", in reveal.js style), a fragment (so it shows up on the same slide, but on a subsequent click), or notes. You can write new code and run it even during slideshow mode, so it's very useful if someone in the audience has a question, you can quickly write down and execute code to answer them. RISE is simple to install:

> pip install RISE

Then I used a bunch of extenstions that are bundled together in the jupyter_contrib_nbextensions package. By default, you have to enable and configure them by editing JSON files, but there is another plugin to add a dashboard for them, called jupyter_nbextensions_configurator. They can be installed with:

> pip install jupyter_contrib_nbextensions
> jupyter contrib nbextension install --user
> pip install jupyter_nbextensions_configurator
> jupyter nbextensions_configurator enable --user

You have to restart the Jupyter process and now you will see a new tab on the home page of the local Jupyter page, where you can enable and configure all the installed extensions.

I used the "Hide input" extension. Most of my code was organized into two cells. One which didn't contain all the code, just a snippet on which I wanted to focus (for example, I made a small change to a previously defined function), and another one which could be run and showed output. The latter cell was hidden with this extension, so that only the output could be seen.

Initially I also used the "Split cell" extension. This extension gives you a button which can make a cell to be half width. If two consecutive cells are half width, they align next to each other, making two columns. I wanted to use this to have code on the left column and explanations on the right column. This would have worked if the presentation was online only, because I wouldn't have had to zoom in too much. But because in the last week before the presentation we found out that it was allowed to hold the presentation in person (with 10 people in the audience) and I had to present on a projector and zoom in, I ended up removing all the split cells because the content wouldn't fit in any longer.

Making Rust work with Jupyter

All the above is generic and can be made to work with anything that works in Jupyter. To make Rust work in Jupyter you need a kernel for it. Some guys from Google have made one called evcxr_jupyter.

It's fairly straightforward to install. On Windows you need to first install CMake and then you run:

> cargo install evcxr_jupyter
> evcxr_jupyter --install

After restarting the Jupyter process, you now have the option of using a Rust kernel. To include Cargo dependencies, you can insert the following into a cell:

:dep reqwest = { version = "0.10", features = ["json", "blocking"] }

This downloads reqwest , compiles it and makes it available for using in other cells.

The notebook for presentation that I gave can be found in a Github repo and the recording can be found here.

I’m publishing this as part of 100 Days To Offload - Day 27.

<![CDATA[ My operating system journey ]]> https://rolisz.ro/2020/06/14/operating-system-journey/ 5ec2eed817253e7fe6dd5a49 Sun, 14 Jun 2020 21:46:40 +0300 Getting started with Linux

Ten years ago I was writing how I'm a big fan of Windows (and Android). I would regularly get into friendly debates with Cătălin, who was a staunch supporter of Linux. I kept dipping my toes into Linux, but for a long time, I got burned.

At my first internship and then even more so at my first job, I learned more and more about Linux and got comfortable in it. I started dual booting. By 2014, the most used OS on my laptop was Fedora.

When I built a desktop in 2015, I first installed Linux on it, even though I had to try several distributions until I found one that worked. I was in my "command-line" minimalist phase, so I set up i3, tmux, and fish. I was quite happy with it, but eventually I installed Windows 10 on it so that I could play games, run the Swiss tax form application and YNAB, a budgeting app.

Trying out Mac OS

My work laptop at the time was a MacBook. I thought I would like it and I was looking forward to trying out all the cool and hipster apps that were only on Mac OS, such as Alfred. In the end, while working at Google, I used only a browser and a terminal, and I never got around to really work with any other apps, because I didn't need them. The terminal experience in Mac requires a bit more searching around to get things working. Macs come with old libraries out of the box, you have to update them using copy pasted shell commands and I managed to screw things up once with Homebrew.  I was not impressed by Mac OS and I didn't want to spend my own money on that crappy keyboard.

Slowly turning back to Windows

But Windows (and it's ecosystem) has changed a lot since then. When I bought my new laptop in 2018, it came with Windows and I never bothered installing Linux on it. Why? Windows Subsytem for Linux. You get pretty much all the CLI goodies from Linux and all the other nice stuff from Windows. For example, as far as I know, there's almost no laptop where Linux has comparable battery life with Windows and that is an important factor for me, because I work remotely.

On my desktop I still had ArchLinux, because running Tensorflow was easier on Linux than on Windows (modulo the Nvidia driver updates) . But slowly I got bored of the "command line" minimalism. I tried other desktop environments on Linux, such as KDE and Gnome, but they never stuck. KDE is too bloated, and I find the default theme to be outdated. Gnome looks nice, but I never got around to feeling comfortable in it. The others are too "fringe" for me and I think that it's too hard to find solutions to the problems that inevitably crop up, just because the community is too small.

For the last two months, I have found myself using almost only Windows, even on my desktop. This way, I can watch Netflix at the highest resolution (on Linux, you can watch only in the browser, where's it's capped at 720p), I can play games. Rust works just as well on Windows as on Linux. WSL satisfies my very few needs for Linux only apps. And I never had problems with Nvidia drivers on Windows (unlike on Linux). The new Terminal app on Windows is pretty sweet. Powershell is pretty cool too, even though I don't know much of the syntax so far.

And honestly, I just like the default UI on Windows more. 10 years ago I had the patience to tinker with themes and to customize my desktop endlessly, but now I don't have the time and energy to deal with that anymore. I see plenty of nice Linux themes on Reddit and I tried to replicate one, but abusing fonts to get some nice "symbols" in i3-bar? Ewww.

Even though many people complain about Windows updates messing things up, that has never happened to me in the last 5 years, even though I am running on the insider preview version of Windows 10. On the other hand, I did manage to screw things up with ArchLinux updates, but it was my fault usually, because I didn't read the instructions or I let too much time pass between updates.


That's the story for my desktops and laptops. On servers, it's Linux all the way. My NAS runs Linux. My VPSs run Linux. And I plan to keep it that way. But there I'm not bothered by the fact that I SSH in and do the half an hour at most every week from the command line.

The only thing that I didn't try was a variant of BSD. Five years ago I might have given it a shot, but now I don't want to relearn a lot of things, from command line flags to concepts like jails. The strongest argument for BSD would be security, but Linux is secure enough for me, for now.

The future

But who knows what will happen in the future? Maybe in five years I'll get bored again of Windows and I'll try something new. Maybe Fuchsia will become mature by then :D

I’m publishing this as part of 100 Days To Offload - Day 23.