<![CDATA[ rolisz's blog - Technical Posts ]]> https://rolisz.ro https://rolisz.ro/favicon.png rolisz's blog https://rolisz.ro Thu, 23 Sep 2021 22:12:02 +0300 60 <![CDATA[ Opening Jupyter Notebooks in the right browser from WSL ]]> https://rolisz.ro/2021/09/23/opening-jupyter-notebooks-in-the-right-browser-from-wsl/ 614c740d56c059246e90e2f3 Thu, 23 Sep 2021 22:10:57 +0300 I mentioned last year that I've slowly moved back to using Windows more and more. In the mean time, my transition is almost completely done. This year I've pretty much booted into ArchLinux to update it, about once a month and that's it. I am otherwise very happy with WSL1 for when I need to run Linux only tools, such as auto-sklearn.

There was one small hickup: when opening a Jupyter Notebook from WSL, it would try to open the notebooks in the Linux environment, which is a CLI environment, so it opened them in Lynx, not in the Firefox instance that runs on the Windows side of things. While Lynx is cute, it's not the most useful interface for a Jupyter Notebook.

Jupyter Notebook opening in Lynx

I could quit Lynx by pressing q and then I would CTRL-Click on the link showed in the terminal and Jupyter would open in Firefox. But hey, I'm a programmer and I don't want to do extra clicks. Today I learned how to fix this problem.

First, we need to tell WSL to use the browser from Linux. This can be done by setting the BROWSER environment variable to point to the location of Firefox in Windows, but with the path as seen by WSL:

 export BROWSER=/mnt/c/Program\ Files/Mozilla\ Firefox/firefox.exe

Running jupyter notebook after this will correctly open a window in Firefox, but it will open it with a Linux path towards a redirect file that does the authentication for Jupyter. Because Firefox runs in Windows, it can't access the path on the Linux side.

But there is a way to tell Jupyter to open the normal localhost links, not the ones that point to a local redirect file. For this, you have to create a Jupyter config (unless you already have one):

> jupyter notebook --generate-config
Writing default config to: /home/rolisz/.jupyter/jupyter_notebook_config.py

Then edit this file and change the use_redirect_file parameter to be true (and uncomment it if needed):

c.NotebookApp.use_redirect_file = True

From now, running jupyter notebook in WSL will open properly!

<![CDATA[ Working across multiple machines ]]> https://rolisz.ro/2021/05/19/working-across-multiple-machines/ 60a401e6a3c3ed7839d6f28e Wed, 19 May 2021 13:00:11 +0300 Until this year, I usually had a laptop from my employer, on which I did work stuff and I had a personal desktop and laptop. The two personal devices got far too little usage coding wise, so I didn't really have a need to make sure I have access to the same files on both places.

But since becoming self-employed at the beginning of this year, I find myself using both the desktop and the laptop a lot more and I need to sync files between them. I go to work from a co-working space 2-3 days a week. Sometimes I go to have a meeting with a client at their office. My desktop has a GPU and is much more powerful, so when at home I strongly prefer to work from it, instead of from a laptop that gets thermal throttling pretty fast.

I could transfer code using Github, I'd rather not have to do a WIP commit every time I get up from the desk. But I also need to sync things like business files (PDFs) and machine learning models.  The most common solution for this is to use Dropbox, OneDrive or something similar, but I would like to avoid sending all my files to a centralized service run by a big company.

Trying Syncthing again

I've tried using Syncthing in the past for backups, but it didn't work out at the time. Probably because it's not meant for backups. But it is meant for syncing files between devices!

I've been using Syncthing for this purpose for 3 months now and it just works™️. It does NAT punching really well and syncing is super speedy. I've had problems with files not showing up right away on my laptop only once and I'm pretty sure it was because my laptop's Wifi sometimes acts weird.

My setup

I have three devices talking to each other on Syncthing: my desktop, my laptop and my NAS. The NAS is there to be the always-on replica of my data and it makes it easier to backup things. The desktop has the address of the NAS hardcoded because they are in the same LAN, but all the other devices uses dynamic IP discovery to talk to each other.

I have several folders set up for syncing. Some of them go to all three devices, some of them are only between the desktop and the NAS.

For the programming folders I use ignore patterns generously: I don't sync virtual env folders or node_modules folders, because they usually don't play nice if they end up on a different device with different paths (or worse, different OS). Because of this, I set up my environment on each device separately and I only sync  requirements.txt and then run pip install -r requirements.txt.

What do you use for syncronizing your workspace across devices? Do you have anything better than Syncthing?

<![CDATA[ How to ML - Monitoring ]]> https://rolisz.ro/2021/01/22/how-to-ml-monitoring/ 600b34dcf896ad697fe0fdf3 Fri, 22 Jan 2021 23:29:24 +0300 As much as machine learning developers like to think that once they've got a good enough model, the job is done, it's not quite so.

The first couple of weeks after deployment are critical. Is the model really as good as offline tests said they are? Maybe something is different in production then in all your test data. Maybe the data you collected for offline predictions includes pieces of data that are not available at inference time. For example, if trying to predict click through rates for items in a list and use that to rank the items, when building the training dataset it's easy to include the rank of the item in the data, but the model won't have that when making predictions, because it's what you're trying to infer. Surprise, the model will perform very poorly in production.

Or maybe simply A/B testing reveals that the fancy ML model doesn't really perform better in production than the old rules written with lots of elbow grease by lots of developers and business analysts, using lots of domain knowledge and years of experience.

But even if the model does well at the beginning, will it continue to do so? Maybe there will be an external change in user behavior and they will start searching for other kinds of queries, which your model was not developed for. Or maybe your model will introduce a "positive" feedback loop: it suggests some items, users click on them, so those items get suggested more often, so more users click on them. This leads to a "rich get richer" kind of situation, but the algorithm is actually not making better and better suggestions.

Maybe you are on top of this and you keep retraining your model weekly to keep it in step with user behavior. But then you need to have a staggered release of the model, to make sure that the new one is really performing better across all relevant dimensions. Is inference speed still good enough? Are predictions relatively stable, meaning we don't recommend only action movies one week and then only comedies next week? Are models even comparable from one week to another or is there a significant random component to them which makes it really hard to see how they improved? For example, how are the clusters from the user post data built up? K-means starts with random centroids and clusters from one run have only passing similarity to the ones from another run. How will you deal with that?

<![CDATA[ GPT-3 and AGI ]]> https://rolisz.ro/2021/01/21/gpt3-agi/ 5f3517c94f71eb12e0abb8bf Thu, 21 Jan 2021 23:13:00 +0300 One of the most impressive/controversial papers from 2020 was GPT-3 from OpenAI. It's nothing particularly new, it's mostly a bigger version of GPT-2, which came out in 2019. It's a much bigger version, being by far the largest machine learning model at the time it was release, with 175 billion parameters.

It's a fairly simple algorithm: it's learning to predict the next word in a text[1]. It learns to do this by training on several hundred gigabytes of text gathered from the Internet. Then to use it, you give it a prompt (a starting sequence of words) and then it will start generating more words and eventually it will decide to finish the text by emitting a stop token.

Using this seemingly stupid approach, GPT-3 is capable of generating a wide variety of interesting texts: it can write poems (not prize winning, but still), write news articles, imitate other well know authors, make jokes, argue for it's self awareness, do basic math and, shockingly to programmers all over the world, who are now afraid the robots will take their jobs, it can code simple programs.

That's amazing for such a simple approach. The internet was divided upon seeing these results. Some were welcoming our GPT-3 AI overlords, while others were skeptical, calling it just fancy parroting, without a real understanding of what it says.

I think both sides have a grain of truth. On one hand, it's easy to find failure cases, make it say things like "a horse has five legs" and so on, where it shows it doesn't really know what a horse is. But are humans that different? Think of a small child who is being taught by his parents to say "Please" before his requests. I remember being amused by a small child saying "But I said please" when he was refused by his parents. The kid probably thought that "Please" is a magic word that can unlock anything. Well, not really, in real life we just use it because society likes polite people, but saying please when wishing for a unicorn won't make it any more likely to happen.

And it's not just little humans who do that. Sometimes even grownups parrot stuff without thinking about it, because that's what they heard all their life and they never questioned it. It actually takes a lot of effort to think, to ensure consistency in your thoughts and to produce novel ideas. In this sense, expecting an artificial intelligence that is around human level might be a disappointment.

On the other hand, I believe there is a reason why this amazing result happened in the field of natural language processing and not say, computer vision. It has been long recognized that language is a powerful tool, there is even a saying about it: "The pen is mightier than the sword". Human language is so powerful that we can encode everything that there is in this universe into it, and then some (think of all the sci-fi and fantasy books). More than that, we use language to get others to do our bidding, to motivate them, to cooperate with them and to change their inner state, making them happy or inciting them to anger.

While there is a common ground in the physical world, often times that is not very relevant to the point we are making: "A rose by any other name would smell as sweet". Does it matter what a rose is when the rallying call is to get more roses? As long as the message gets across and is understood in the same way by all listeners, no, it doesn’t. Similarly, if GPTx can affect the desired change in it's readers, it might be good enough, even if doesn't have a mythical understanding of what those words mean.

Technically, the next byte pair encoded token ↩︎

<![CDATA[ How to ML - Deploying ]]> https://rolisz.ro/2021/01/20/how-to-ml-deploying/ 60084bc7165bd14e3b33595d Wed, 20 Jan 2021 18:28:54 +0300 So the ML engineer presented the model to the business stakeholders and they agreed that it performed well enough on the key metrics in testing that it's time to deploy it to production.

So now we have to make sure the models run reliably in production. We have to answer some more questions, in order to make some trade offs.

How important is latency? Is the model making an inference in response to a user action, so it's crucial to have the answer in tens of milliseconds? Then it's time to optimize the model: quantize weights, distill knowledge to a smaller model, weight pruning and so on. Hopefully, your metrics won't go down due to the optimization.

Can the results be precomputed? For example, if you want to make movie recommendations, maybe there can be a batch job that runs every night that does the inference for every user and stores them in a database. Then when the user makes a request, they are simply quickly loaded from the database. This is possible only if you have finite range of predictions to make.

Where are you running the model? On big beefy servers with a GPU? On mobile devices, which are much less powerful? Or on some edge devices that don't even have an OS? Depending on the answer, you might have to convert the model to a different format or optimize it to be able to fit in memory.

Even in the easy case where you are running the model on servers and latency can be several seconds, you still have to do the whole dance of making it work there. "Works on my machine" is all to often a problem. Maybe production runs a different version of Linux, which has a different BLAS library and the security team won't let you update things. Simple, just use Docker, right? Right, better hope you are good friends with the DevOps team to help you out with setting up the CI/CD pipelines.

But you've killed all the dragons, now it's time to keep watch... aka monitoring the models performance in production.

<![CDATA[ How to ML - Models ]]> https://rolisz.ro/2021/01/18/how-to-ml-models/ 6005e7293e8fc062a027dbe3 Mon, 18 Jan 2021 22:55:44 +0300 So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.

This is the part that most people who want to do machine learning are excited about. I read Bishop's and Murphy's textbooks, watched Andrew Ng's online course about ML and learned about different kinds of ML algorithms and I couldn't wait to try them out and to see which one is the best for the data at hand.

You start off with a simple one, a linear or logistic regression, to get a baseline. Maybe you even play around with the hyperparameters. Then you move on to a more complicated model, such as a random forest. You spend more time fiddling with it, getting 20% better results. Then you switch to the big guns, neural networks. You start with a simple one, with just 3 layers, and progressively end up with 100 ReLU and SIREN layers, dropout, batchnorm, ADAM, convolutions, attention mechanism and finally you get to 99% accuracy.

And then you wake up from your nice dream.

In practice, playing around with ML algorithms is just 10% of the job for an ML engineer. You do try out different algorithms, but you rarely write new ones from scratch. For most production projects, if it's not in one of the sklearn, Tensorflow or Pytorch libraries, it won't fly. For proof of concept projects you might try to use the GitHub repo that accompanies a paper, but that path is full of pain, trying to find all the dependencies of undocumented code and to make it work.

For the hyperparameter tuning, there are libraries to help you with that, and anyway, the time it takes to finish the training runs is much larger than the time you spend coding it up, for any real life datasets.

And in practice, you run into many issues with the data. You'll find that some of the columns in the data have lots of missing values. Or some of the datapoints that come from different sources have different meanings for the same columns. You'll find conflicting or invalid labels. And that means going back to the data pipelines and fixing that bugs that occur there.

If you do get a model that is good enough, it's time to deploy it, which comes with it's own fun...

<![CDATA[ How to ML - Data ]]> https://rolisz.ro/2020/12/29/how-to-ml-data/ 5feb735d2bc2360ef49da332 Tue, 29 Dec 2020 21:22:09 +0300 So we've decided what metrics we want to track for our machine learning project. Because ML needs data, we need to get it.

In some cases we get lucky and we already have it. Maybe we want to predict the failure of pieces of equipment in a factory. There are already lots of sensors measuring the performance of the equipment and there are service logs saying what was replaced for each equipment. In theory, all we need is a bit of a big data processing pipeline, say with Apache Spark, and we can get the data in the form of (input, output) pairs that can be fed into a machine learning classifiers that predicts if an equipment will fail based on the last 10 values measures from its sensors. In practice, we'll find that sensors of the same time that come from different manufacturers have different ranges of possible values, so they will all have to be normalized. Or that the service logs are filled out differently by different people, so that will have to be standardized as well. Or worse, the sensor data is good, but it's kept only for 1 month to save on storage costs so we have to fix that and wait a couple of months for more training data to accumulate.

The next best case is that we don't have the data, but we can get it somehow. Maybe there are already datasets on the internet that we can download for free. This is the case for most face recognition applications: there are plenty of annotated face datasets out there, with various licenses. In some cases the dataset must be bought, for example, if we want to start a new ad network, there are plenty of datasets available online of personal data about everyone, which can be used then to predict the likelihood of clicking on an ad. That's the business model of many startups...

The worst case is that we don't have data and we can't find it out there. Maybe it's because we have a very specific niche, such as we want to find defects in the manufacturing process of our specific widgets, so we can't use random images from the internet to learn this. Or maybe we want to do something that is really new (or very valuable), in which case we will have to gather the data ourselves. If we want to solve something in the physical world, that will mean installing sensors to gather data. After we get the raw data, such as images of our widgets coming of the production line, we will have to annotate those images. This means getting them in front of humans who know how to tell if a widget is good or defective. There needs to be a Q&A process in this, because even humans have an error rate, so each image will have to be labeled by at least three humans. We need several thousand samples, so this will take some time to set up, even if we can use crowdsourcing websites such as AWS Mechanical Turk to distribute the tasks to many workers across the world.

All this is done, we finally have data. Time to start doing the actual ML...

<![CDATA[ How to ML - Metrics ]]> https://rolisz.ro/2020/12/28/how-to-ml/ 5fea21292bc2360ef49da324 Mon, 28 Dec 2020 21:19:07 +0300 We saw that machine learning algorithms process large amounts of data to find patterns. But how exactly do they do that?

The first step in a machine learning project is establishing metrics. What exactly do we want to do and how do we know we're doing it well?

Are we trying to predict a number? How much will Bitcoin cost next year? That's a regression problem. Are we trying to predict who will win the election? That's a binary classification problem (at least in the USA). Are we trying to recognize objects in an image? That's a multi class classification problem.

Another question that has to be answered is what kind of mistakes are worse. Machine learning is not all knowing, so it will make mistakes, but there are trade-offs to be made. Maybe we are building a system to find tumors in X-rays: in that case it might be better that we call wolf too often and have false positives, rather than missing out on a tumor. Or maybe it's the opposite: we are trying to implement a facial recognition system. If the system recognizes a burglar incorrectly, then the wrong person will get sent to jail, which is a very bad consequence for a mistake made by "THE algorithm".

These are not just theoretical concerns, but they actually matter a lot in building machine learning systems. Because of this, many ML projects are human-in-the-loop, meaning the model doesn't decide by itself what to do, it merely makes a suggestion which a human will then confirm. In many cases, that is valuable enough, because it makes the human much more efficient. For example, the security guard doesn't have to look at 20 screens at once, but can only look at the footage that was flagged as anomalous.

Tomorrow we'll look at the next step: gathering the data.

<![CDATA[ What is ML? part 3 ]]> https://rolisz.ro/2020/12/24/what-is-ml-3/ 5fe4b0ab2bc2360ef49da308 Thu, 24 Dec 2020 18:18:51 +0300 Yesterday we saw that machine learning is behind some successful products and it does have the potential to bring many more changes to our life.

So what is it?

Well, the textbook definition is that it's the building of algorithms that can perform tasks they were not explicitly programmed to do. In practice, this means that we have algorithms that analyze large quantities of data to learn some patterns in the data, which can then be used to make predictions about new data points.

This is in contrast with the classical way of programming computers, where a programmer would use either their domain knowledge or they would analyze the data themselves and then write the program that has the correct output.

So one of the crucial distinctions is that in machine learning, the machine has to learn from the data. If a human being figures out the pattern and writes a regular expression to find addresses in text, that's human learning, and we all go to school to do that.

Now does that mean that machine learning is a solution for everything? No. In some cases, it's easier or cheaper to have a data analyst or a programmer find the pattern and code it up.

But there are plenty of cases where despite decades long efforts of big teams of researchers, humans haven't been able to find an explicit pattern. The simplest example of this would be recognizing dogs in pictures. 99.99% of humans over the age of 5 have no problem recognizing a dog, whether a puppy, a golden retriever or a Saint Bernard, but they have zero insight into how they do it, what makes a bunch of pixels on the screen a dog and not a cat. And this is where machine learning shines: you give it a lot of photos (several thousands at least), pair each photo with a label of what it contains and the neural network will learn by itself what makes a dog a dog and not a cat.

Machine learning is just one tool that is available at our disposal, among many other tool. It's a very powerful tool and it's one that gets "sharpened" all the time, with lots of research being done all around the world to find better algorithms, to speed up their training and to make them more accurate.

Come back tomorrow to find out how the sausage is made, on a high level.

<![CDATA[ What is ML? part 2 ]]> https://rolisz.ro/2020/12/23/what-is-ml-part-2/ 5fe301b952484d7aadd5c620 Wed, 23 Dec 2020 11:42:31 +0300 Yesterday I wrote how AI made big promises in the past but it failed to deliver, but that now it's different.

What's changed?

Well, now we have several products that work well with machine learning. My favorite example is Google Photos, Synology Moments and PhotoPrism. They are all photo management applications which use machine learning to automatically recognize all faces in pictures (easy, we had this for 15 years), recognize automatically which pictures are of the same person (hard, but doable by hand if you had too much time) and more than that, index photos by all kinds of objects that are found in them, so that you can search by what items appear in your photos (really hard, nobody had time to do that manually).

I have more than 10 years of photos uploaded to my Synology and one of my favorite party tricks when talking to someone is to whip out my phone and show them all the photos I have of them, since they were kids, or the last time that we met, or that funny thing that happened to them and I have photographic evidence of. Everyone is amazed by that (and some are horrified and deny that they looked like that when they were children). And there is not one, but at least three options to do this, one of which is open source, so that anyone can run in at home on their computer, for free, so there is demand for such a product.

Other successful examples are in the domain of recommender systems, YouTube being a good example. I have a love/hate relationship with it: on one hand, I wasted so many hours of my life to the recommendations it makes (which is proof of how good it is at making personalized suggestions), on the other hand, I found plenty of cool videos with it. This deep learning based recommender system is one of the factors behind the growth of the watch time on YouTube, which is basically the key metric behind revenue (more watch time, more ads).

These are just two examples that are available for everyone to use, and which serve as evidence that machine learning based AI now is not just hot air.

But I still haven't answered the question what is ML... tomorrow, I promise.

<![CDATA[ What is ML? ]]> https://rolisz.ro/2020/12/21/what-is-ml/ 5fe0d5bd1012ed2b469ca47f Mon, 21 Dec 2020 20:17:34 +0300 Machine learning is everywhere these days. Mostly in newspapers, but it's seeping into many real life, actual use cases. But what is it actually?

If you read only articles on TechCrunch, Forbes, Business Insider or even MIT Technology Review, you'd think it's something that brings Model T800 to life soon, or that it will cure cancer and make radiologists useless, or that it will enable humans to upload their minds to the cloud and live forever, or that it will bring fully self driving cars by the end of the year (every year for the last 5 years).

Many companies want to get in on the ML bandwagon. It's understandable: 1) that's where the money is (some 10 billion dollars were invested in it in 2018) and 2) correctly done, applied to the right problems, ML can actually be really valuable, either by automating things that were previously done with manual labor or even by enabling things that were previously unfeasible.

But at the same time, a lot of ML projects make unrealistic promises, eat a lot of money and then deliver something that doesn't work well enough to have a positive ROI. The ML engineers and researchers are happy, they got payed, analyzed the data and played around with building ML models, and maybe even published a paper or two. But the business is not happy, because they are not better off in any way.

This is not a new phenomenon. Artificial Intelligence, of which Machine Learning is a subdomain of, has been plagued by similar bubbles ever since it was founded. AI has gone through several AI winters already, in the 60s, 80s and late 90s. Big promises, few results.

To paraphrase Battlestar Galactica, "All this has happened before, all this will happen again but this time it's different". But why is it different? More about that tomorrow.

<![CDATA[ Machine Learning stories: Misunderstood suggestions ]]> https://rolisz.ro/2020/11/30/machine-learning-stories/ 5fc5247f53c65419dc54f518 Mon, 30 Nov 2020 20:00:59 +0300 A couple of years ago I was working on a calendar application, on the machine learning team, to make it smarter. We had many great ideas, one of them being that once you indicated you wanted to meet with a group of people, the app would automatically suggest you a time slot for the meeting.

We worked on it for several months. We couldn't just use simple hand-coded rules, because we wanted to do things like learn every users working hours, which could vary based on many things. In the end, we implemented this feature using a combination of both hand coded rules (to avoid some bad edge cases) and machine learning. We did lots of testing, both automated and manually in our team.

Once the UI was ready, we did some user testing, where the new prototype was put in front of real users, unrelated to our team, who were recorded while they tried to use it and then were asked questions about the product. When the reports came in, the whole team banged their heads against the desk: most users thought we were suggesting times when the meeting couldn't take place!

What happened? If you included either many people or even only one very busy person, there will be no empty slot which is good for everyone. So our algorithm would make three suggestions, saying that for each there would be a different person who might not be able to make the meeting.

In our own testing, it was obvious to us what was happening, so we didn't consider it a big problem. But users who didn't know the system, found it confusing and kept going to the classic grid to manually find a slot for the meeting.

Lesson: machine learning algorithms are never perfect and every project needs to be prepared to deal with mistakes.

How will your machine learning project handle failures? How will you explain to the end users the decisions the algorithm made? If you need help answering these questions, let's talk.

<![CDATA[ Tailscale ]]> https://rolisz.ro/2020/11/15/tailscale/ 5f8aef7aebd40d0556a6e307 Sun, 15 Nov 2020 23:18:28 +0300 I want to share about an awesome piece of technology that I started using recently: Tailscale. It's a sort of VPN, meaning it sets up a private network for you, but only between your devices, without a central server.

It does so without you having to open any ports on your router or configure firewalls. It's pretty much powered by JustWorks™️ technology. Each device you register gets an IP address of the form 100.x.y.z. Using this address than you can connect to that device from anywhere in the world, as long as both are connected to the Internet, because Tailscale automatically performs NAT traversal.

My use case for Tailscale is to connect my various devices (desktop, laptop, NAS, Raspberry PI and Android phone) and be able to access them, regardless of where I am.

For my NAS I had setup a DynDNS system, but occasionally it would still glitch and I would lose connectivity. With Tailscale, now I have connection pretty much always. I still keep the DNS address, because I have everything set up for it, but I know that debugging stuff will be much easier.

Similarly, the Raspberry Pi at my parents place is behind a crappy router which sometimes doesn't do the port forwarding properly. Now I can login via Tailscale and fix the issue.

There is a bit of overhead with using Tailscale. When I set up the initial backup to the RPI, I tried going through it, but the bandwith was only 3MB/s, while if I connected directly, it would be 10 MB/s. Because borg encrypts the data, I don't need the additional security provided by Tailscale.

All in all: I can strongly recommend Tailscale. It's a great product if you have any sort of home servers. It's developed by some great guys, including Avery Pennarun, David Crawshaw and Brad Fitzpatrick. I wish more startups did cool stuff like this, which work so well.

<![CDATA[ Backing up 4: The Raspberry Pi ]]> https://rolisz.ro/2020/10/11/raspi-backups/ 5f71897a5ad1bb49f64c71b8 Sun, 11 Oct 2020 23:31:12 +0300 More than a year ago I described how I used Syncthing to backup folders from my NAS to an external harddrive attached to my parents PC. This was supposed to be my offline backup. Unfortunately, it didn't prove to be a very reliable solution. The PC ran Windows, I had trouble getting SSH to work reliably, I would often had to fix stuff through Teamviewer. Often the PC would not be turned on for days, so I couldn't even do the backups without asking them to turn it on. And Syncthing turned out to be finicky and sometimes didn't sync.

Then it finally dawned on me: I have two Raspberry Pi 3s at home that are just collecting dust. How about I put one of them to good use?

So I took one of the Pis, set it up at my parents place and after some fiddling, it works. Here's what I did:

I used the latest Raspbian image. It sits at my parent's home, which has a dynamic IP address. The address usually changes only if the router is restarted, but it can still cause issues. At first I thought I would set up a reverse SSH tunnel from the Raspberry Pi to my NAS, but I couldn't get autossh to work with systemd.

Then I tried another option: I set up a Dynamic DNS entry on a subdomain, with ddclient on the Raspberry Pi to update the IP address regularly. I had to open a port on my parents router for this. I added public key authentication through SSH, while restricting password based authentication to LAN networks only, in the /etc/ssh/sshd_config:

PasswordAuthentication no
ChallengeResponseAuthentication no

Match Address,,
    PasswordAuthentication yes

It has worked for two weeks, so that's pretty good.

Now that I have a stable connection to the Pi, it was time to set up the actual backups. I looked around and there are several options. I ended up choosing BorgBackup. It has builtin encryption for the archive, so I don't need to muck around with full disk encryption. It also does deduplication, compression and deltas, so after an initial full backup, it only backs up changes, so it's quite efficient.

BorgBackup is quite simple to use. First you have to initialize a repository, which will contain your backups:

 > borg init ssh://user@pi_hostname:port/path/to/backups -e authenticated

This will prompt you for a passphrase. It will also generate a keyfile, which you should export and keep safe on other machines:

> borg key export ssh://user@pi_hostname:port/path/to/backups

Then, to start the actual backup process:

> borg create --stats --progress --exclude "pattern_to_exclude*" ssh://user@pi_hostname:port/path/to/backups::archive_name ./folder1 ./folder2 ./folder3

The archive_name corresponds to one instance when you backed up everything. If the next day you rerun the command with archive_name2, it will compare all the chunks and transmit only the ones that have changed or which are new. Then you will be able to restore both archives, with BorgBackup doing the right thing in the background to show you only the items that were backed up in that archive.

The cool thing about Borg is that if a backup stops while in progress, it can easily resume at any time.

I added command to a cron job (actually, the Synology Task Scheduler) to run it daily and now I have daily, efficient backups.

# Archive name schema
DATE=$(date --iso-8601)
echo "Starting backups for $DATE"
export BORG_PASSCOMMAND="cat ~/.borg-passphrase"
/usr/local/bin/borg create --stats --exclude "pattern_to_exclude*" ssh://user@pi_hostname:port/path/to/backups::$DATE ./folder1 ./folder2 ./folder3

The .borg-passphrase file contains my passphrase and has the permission set to 400 (read only by my user). Borg then reads the passphrase from that environment variable, so no user input is necessary.

Now I get the following report by email every morning:

Duration: 4 minutes 22.54 seconds
Number of files: 281990

			Original size      Compressed size    Deduplicated size
This archive:              656.97 GB            646.90 GB             12.51 MB

Not bad. Borg sweeps 656 GB of data in 4.5 minutes, determines that there is only 13 MB of new data and sends only that over the network.

I feel much more confident about this solution than about the previous one! Here's to not changing it too often!

<![CDATA[ Playing Codenames in Rust with word vectors ]]> https://rolisz.ro/2020/09/26/playing-codenames-in-rust-with-word-vectors/ 5f3adc474f71eb12e0abb8ca Sat, 26 Sep 2020 21:32:01 +0300 In a previous post I implemented the game of Codenames in Rust, allowing a human player to interact with the computer playing randomly. Now let's implement a smarter computer agent, using word vectors.

Word vectors (or word embeddings) are a way of converting words into a high dimensional vector of numbers. This means that each word will have a long list of numbers associated with it and those numbers aren't completely random. Words that are related usually have values closer to each other in the vector space as well. Getting those numbers from raw data takes a long time, but there are many pretrained embeddings on the internet you can just use and there are also libraries that help you find other words that are close to a target word.

Word vectors in Rust

Machine learning has embraced the Python programming language, so most ML tools, libraries and frameworks are in Python, but some are starting to show up in Rust as well. Rust's performance focus attracts people, because ML is usually computationally intensive.

There is one library in Rust that does exactly what we want: FinalFusion. It has a set of pretrained word vectors (quite fancy ones, with subword embeddings) and it has a library to load them and to make efficient nearest neighbor queries.

The pretrained embeddings come in a 5 GB file, because they pretty much literally have everything, including the kitchen sink, so the download will take a while. Let's start using the library (after adding it to our Cargo.toml file) to get the nearest neighboring words for "cat":

use std::io::BufReader;
use std::fs::File;
use finalfusion::prelude::*;
use finalfusion::similarity::WordSimilarity;

fn main() {
    let mut reader = BufReader::new(File::open("resources/ff.fifu").unwrap());

    // Read the embeddings.
    let embeddings: Embeddings<VocabWrap, StorageViewWrap> =
        Embeddings::read_embeddings(&mut reader)
    println!("{:?}", embeddings.word_similarity("cat", 5).unwrap());

After running it we get the following output:

[WordSimilarityResult { similarity: NotNan(0.81729543), word: "cats" },
WordSimilarityResult { similarity: NotNan(0.812261), word: "kitten" },
WordSimilarityResult { similarity: NotNan(0.7768222), word: "feline" },
WordSimilarityResult { similarity: NotNan(0.7760824), word: "kitty" },
WordSimilarityResult { similarity: NotNan(0.7667354), word: "dog" }]

Loading a 5GB file from disk will take some time. If you have enough RAM, it should be in the OS's file cache after the first run, so it will load faster. Also, compiling this program with --release (turning on optimizations and removing debug information) will speed it up significantly.

One of the rules of Codenames is that hints can't be any of the words on the board or direct derivatives of them (such as plural forms). The finalfusion library has support for masking some words out, but to get plural forms I resorted to another library called inflector which has a method called to_plural, which does exactly what's written on the box.

use std::io::BufReader;
use std::fs::File;
use finalfusion::prelude::*;
use finalfusion::similarity::EmbeddingSimilarity;
use std::collections::HashSet;
use inflector::string::pluralize::to_plural;

fn main() {
    let mut reader = BufReader::new(File::open("resources/ff.fifu").unwrap());

    // Read the embeddings.
    let embeddings: Embeddings<VocabWrap, StorageViewWrap> =
        Embeddings::read_embeddings(&mut reader)
    let word = "cat";
    let embed = embeddings.embedding(word).unwrap();
    let mut skip: HashSet<&str> = HashSet::new();
    let pluralized = to_plural(word);
    let words = embeddings.embedding_similarity_masked(embed.view(), 5, &skip).unwrap();
    println!("{:?}", words);

This is a slightly lower level interface, where we first have to obtain the embedding of the word, we build the set of words to skip and then we search for the most similar words to the vector that we give it. The output is:

[WordSimilarityResult { similarity: NotNan(0.812261), word: "kitten" },
WordSimilarityResult { similarity: NotNan(0.7768222), word: "feline" },
WordSimilarityResult { similarity: NotNan(0.7760824), word: "kitty" }, 
WordSimilarityResult { similarity: NotNan(0.7667354), word: "dog" }, 
WordSimilarityResult { similarity: NotNan(0.7471396), word: "kittens" }]

It's better. Ideally, we could also somehow remove all composite words based on words from the table, but that's a bit more complicated.

This can be wrapped in a function, because it's a common use case:

fn find_similar_words<'a>(word: &str, embeddings: &'a Embedding, limit: usize) -> Vec<WordSimilarityResult<'a>> {
    let embed = embeddings.embedding(&word).unwrap();
    let mut skip: HashSet<&str> = HashSet::new();
    let pluralized = to_plural(&word);
    embeddings.embedding_similarity_masked(embed.view(), limit, &skip).unwrap()

Implementing the first spymaster

Let's implement our first spymaster which uses word vectors! First, let's define a type alias for the embedding type, because it's long and we'll use it many times.

type Embedding = Embeddings<VocabWrap, StorageViewWrap>;

Our spymaster will have two fields: the embeddings and the color of the player. The Spymaster trait requires only the give_hint function to be implemented.

pub struct BestWordVectorSpymaster<'a> {
    pub embeddings: &'a Embedding,
    pub color: Color,

impl Spymaster for BestWordVectorSpymaster<'_> {
    fn give_hint(&mut self, map: &Map) -> Hint {
        let enemy_color = opposite_player(self.color);
        let remaining_words = map.remaining_words_of_color(enemy_color);
        let mut best_sim = NotNan::new(-1f32).unwrap();
        let mut best_word = "";
        for word in remaining_words {
            let words = find_similar_words(&word, self.embeddings, 1);
            let hint = words.get(0).unwrap();
            if hint.similarity > best_sim {
                best_sim = hint.similarity;
                best_word = hint.word;
        return Hint{count: 1, word: best_word.to_string()};

This spymaster uses a simple greedy algorithm. It takes each word that has to be guessed and find's the most similar word to it, while keeping track of the similarity. It returns as hint the word that had the highest similarity to any of the words that belong to the opposite team.

How does it do? I drew some random boards with a fixed seed and ran this spymaster on them. If you hover over the hint, it shows what are the words it's based on.

We have a problem: the word embeddings we use are a bit too noisy. The word embeddings are usually trained on large text corpora crawled from the internet, such as Wikipedia, the Common Crawl project or from CoNLL 2017 dataset (this is the one used above). The problem with these large corpuses is that they are not perfectly cleaned. For example "-pound" is considered a word. Let's try the CC embeddings:

Unfortunately, the CC embeddings give even worse results.

Cleaning up the embeddings

My fix for this was to write a script to prune down the embeddings to only "real" words (ones made of only letters). First, I had to get a set of all these words.

    let words = embeddings.vocab().words();
    let mut total = 0;
    let mut lowercase = 0;
    let mut select = HashSet::new();
    for w in words {
        total += 1;
        if w.chars().all(char::is_lowercase) {
            lowercase +=1;
    println!("{} {}", total,  lowercase);

Then I had to get the embeddings and the norms for each of these words:

    let mut selected_vocab = Vec::new();
    let mut selected_storage = Array2::zeros((select.len(), embeddings.dims()));
    let mut selected_norms = Array1::zeros((select.len(),));

    for (idx, word) in select.into_iter().enumerate() {
        match embeddings.embedding_with_norm(&word) {
            Some(embed_with_norm) => {
                selected_norms[idx] = embed_with_norm.norm;
            None => panic!("Cannot get embedding for: {}", word),


And finally write the now much smaller embedding file:

    let new_embs = Embeddings::new(
    let f = File::create("resources/smaller_embs.fifu").unwrap();
    let mut reader = BufWriter::new(f);
    new_embs.write_embeddings(&mut reader);

On the embeddings trained on the CoNLL dataset the reduction is about 6x: from 1336558 to 233453.

Let's give our Spymaster another shot with these embeddings, simply by changing the file from which we load the embeddings:

"superheroic" and "marched" look kinda against the rules, being too close to one of the words on the board, but "movie" is a really good one word hint.

Implementing a field operative

Now let's implement the other part of the AI team: the field operative which has to guess which words from the board belong to the enemy, based on the hints the spymaster gave.

pub struct SimpleWordVectorFieldOperative<'a> {
    embeddings: &'a Embedding,

impl FieldOperative for SimpleWordVectorFieldOperative<'_> {
    fn choose_words<'a>(&mut self, hint: &Hint, words: &[&'a str]) -> Vec<&'a str> {
        let hint_emb = self.embeddings.embedding(&hint.word).unwrap();
        let hint_embedding = hint_emb.view();
        let mut similarities = vec![];
        for w in words {
            let new_embed = self.embeddings.embedding(&w).unwrap();
            let similarity: f32 = new_embed.view().dot(&hint_embedding);
            similarities.push((w, similarity));
            .sorted_by(|(_, e), (_, e2)| e.partial_cmp(e2).unwrap())
            .rev().take(hint.count).map(|x| *x.0).collect()

The field operative is even simpler: we go through all the words that are still on the board and get a similarity score between them and the hint. Sort the words by the similarity and take the top "count" ones.

Let's see how it does on the same maps as before. If you hover over the field operative, you can see the guesses it makes.

It looks like the two AI players are a good fit for each other: the field operative always guesses the word that the spymaster based the hint on. Now, let's try to make the spymaster give hints for multiple words.

Improving the spymaster

My first idea would be to generate the top n closest embeddings for all words of a color and see if there are any which are in common. n will be a tuneable parameter: lower values will give hints that are closer to the words, but they will match fewer words, higher values will match more words, potentially worse.

impl Spymaster for DoubleHintVectorSpymaster<'_> {
    fn give_hint(&mut self, map: &Map) -> Hint {
        let enemy_color = opposite_player(self.color);
        let remaining_words = map.remaining_words_of_color(enemy_color);

        let mut sim_words = HashMap::new();
        for word in remaining_words {
            let words = find_similar_words(&word, self.embeddings, self.n);
            for w in words {
                let count = sim_words.entry(w.word).or_insert(0);
                *count +=1;
        let mut best_word = sim_words.iter()
        		.max_by_key(|(&x, &y)| y).unwrap();

        return Hint{count: *best_word.1 as usize, word: best_word.0.to_string()};

We store how often each suggested word is found in all suggestions in a hashmap, with the key being the word and the value being the occurence count. After we add all words there, we simply read out the maximum by occurence count. In Rust, this sort is nondeterministic, so if there are multiple words that occur the same number of times, which one will be returned is not guaranteed.

Around n=40, we start seeing hints for multiple words. At n=80 we have hints for two words for all three maps. At n=400 we have triple hints for two of the maps. But starting with n=80, the field agent no longer guesses correctly all of source words. Sometimes it's because the associations in the words is weird, but more often it's because the spy master only takes into account the words to which it should suggest related hints and it doesn't take into account the words from which the hint should be dissimilar.

There are several ways to address this issue, from simply seeing if the hint word is too close to a bad word and rejecting it, to more complex approaches such as finding the max-margin plane that separates the good words from the bad words and looking for hint words near it. But this post is already long, so this will come in part 3.

The whole code I have so far can be seen on Github.