rolisz's blog - Technical Posts

Using custom SSH keys with Git

Roland Szabo — Mon, 25 Sep 2023 17:59:02 +0300

As a freelancer, I work for many clients, who have their source code in many different places, often self hosted. I generally create new SSH keys for each separate platform. Unfortunately, Git doesn't provide an option for what keys to use, you have to configure this in the ~/.ssh/config file:

Host private.git.server.com
    PreferredAuthentications publickey
    IdentityFile ~/.ssh/key_file_for_this

Per folder custom titles for Jupyter notebooks

Roland Szabo — Wed, 05 Jul 2023 17:39:12 +0300

I work on many different projects and I use some ActivityWatch for automatic time tracking. AW has a feature where you can use regexes to match window titles to assign them automatically to different projects. But, unfortunately, Jupyter Notebook have a "{filename} - Jupyter Notebook" title in the browser, so it's hard to match on them. I'd rather not name all the notebook with a prefix of the project, so I went looking for a different solution.

I quickly found a way to customize the title shown in the browser, but not in a dynamic way (aka dependent on the folder I'm in).

To get dynamic custom titles, I had to first update in the config file used by Jupyter (~/.jupyter/jupyter_notebook_config.py) the following two values:

c.NotebookApp.extra_template_paths = ["~/.jupyter/templates/"]

import os
from pathlib import Path

cur_dir = os.getcwd()

projects = {
        'folder_name1': 'Project 1',
        'folder_name2': "Project 2",
}

cur_env = Path(cur_dir).name
for p in projects:
    if p in cur_dir:
        cur_env = projects[p]
    
c.NotebookApp.jinja_template_vars = {"env_name": cur_env}

The first line add a new path where Jupyter looks for HTML templates. Then I get the name of the current working directory. I look for some folder names and if they match, I get the project name, otherwise I'll use as project name the name of the current working directory. On the last line I inject the env_name variable into every Jinja2 template used by Jupyter.

Then I copied into ~/.jupyter/templates/page.html the template file from the notebook project and at the end of the last script block, I added the following:

{% if env_name %}
window.title += " - {{ env_name }}";
document.__defineSetter__('title', function(val) {
    document.querySelector('title').childNodes[0].nodeValue = val + " - {{ env_name }}";
});
{% endif %}

First I check if env_name is set. If it is, I add Javascript code which will add the value of it to the window title, and also will update the window title whenever it changes (such as when you rename the file).

This is a bit hackish and when the notebook templates update, I should update my own copy as well. Luckily, it doesn't change too often, there being only 5 commits since 2020.

Telegram notifications from Jupyter Notebooks

Roland Szabo — Tue, 23 May 2023 11:41:51 +0300

When running long running code in Jupyter, I want to get notified when it finished so that I can get back to it. There is an extension to do that with browser notifications, but there are times when I leave the computer while waiting for an ML training to finish.

For long running CLI commands there is the ntfy, a command line tool that allows you to send notifications through a lot of channels.

So I hacked the two together to get some code that automatically messages me on Telegram when a cell finished more than 60 seconds after it started. This extension is registered automatically on the startup of any IPython and Jupyter notebook (even if they are installed in random virtual environments). Why Telegram? Because I already have it installed and it seemed like the easiest integration to set up.

The code has to be placed in ~\.ipython\profile_default\startup\01notification.py. You can place multiple files in this folder and they are loaded in lexicographic order, so you should prepend a number if you care about order. First, a couple of magic imports:

import time
import subprocess

from IPython.core.getipython import get_ipython
from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring
from IPython.core.magic import (register_line_magic)

To send the notification using ntfy, I'm simply calling the program using subprocess. The path resolution used by subprocess is not clear to me, so I had to use the full path to the executable.

def display_notification(message):
    subprocess.run([r"C:\Users\Roland\AppData\Local\Programs\Python\Python310\Scripts\ntfy.exe", "-b", "telegram", "send", message])

Then we define some variables, one for the threshold for notification and the other one for remembering the start of the execution:

autonotify_after = 60
run_start_time = None

And now we define the magic function (that's what these things are called in IPython). It has two arguments, one to override the duration of the threshold for notifications and the other for the default message. I copy pasted the decorators as you see them. After parsing the arguments (which come as a string), we register two event handlers: one to run before a cell is executed and one after a cell is executed.

@magic_arguments()
@argument(
    "-a", "--after", default=None,
    help="Send notification if cell execution is longer than x seconds"
)
@argument(
    "-m",
    "--message",
    default="Cell Execution Has Finished!",
    help="Custom notification message"
)
@register_line_magic
def autonotify(line):
    # Record options
    args = parse_argstring(autonotify, line)
    message = args.message.lstrip("\'\"").rstrip("\'\"")
    if args.after:
        global autonotify_after
        autonotify_after = args.after
    ### Register events
    ip = get_ipython()

    # Register new events
    ip.events.register('pre_run_cell', pre_run_cell)
    ip.events.register('post_run_cell', lambda: post_run_cell(message))

The handler to run before a cell is simple: we just record the start time of the run.

def pre_run_cell():
    global run_start_time
    run_start_time = time.time()

The second handler is slightly more complex. We look at the output of the last cell and append it to the message if it's not an "empty" value. We check how long has elapsed to know whether to show a notification or not:

def post_run_cell(message):
    # Set last output as notification message
    last_output = get_ipython().user_global_ns['_']
    # Don't use output if it's None or empty (but still allow False, 0, etc.)
    try:
        if last_output is not None and len(str(last_output)):
            message = message + "\n" + str(last_output)
    except ValueError:
        pass # can't convert to string. Use default message
    
    # Check autonotify options and perform checks
    if not check_after(): 
        return
    display_notification(message)


def check_after():
    # Check if the time elapsed is over the specified time.
    now, start = time.time(), run_start_time
    return autonotify_after >= 0 and start and (now - start) >= autonotify_after

The last piece of magic is to run this function. The other blog post I was inspired by said you should delete the function for things to work properly, so I did that as well:

ipython = get_ipython()
ipython.magic('autonotify')
del autonotify

And voila, now you will get Telegram notifications automatically when your model finishes training! Setting up a Telegram bot is left as an exercise for the reader.

TIL: Recreating tmux socket

Roland Szabo — Fri, 28 Apr 2023 11:38:00 +0300

I use tmux as a multiplexer and for running some long-running commands on servers. Today I encountered a weird issue: when trying to attach to an existing tmux session (or when trying to do anything with tmux), I would get the following error:

can't create socket: Permission denied

A quick search on Kagi revealed that it might be permissions issue. I tried resetting the permission of the /tmp/tmux-* folders, but it didn't help.

What did work however was recreating the socket the tmux server uses to communicate with the tmux client. To do that, you have to run the following command:

killall -s SIGUSR1 tmux

And then the tmux attach worked perfectly, showing me all the windows from my old session.

I have no idea why this happened though.

Making a loudness monitor for online meetings

Roland Szabo — Thu, 02 Feb 2023 18:43:56 +0300

As I work from home 90% of the time, I run into a small issue during meetings: I sometimes speak too loudly. Before my daughter Gloria arrived, this was something that annoyed my wife and others in the house, but now, when Gloria is sleeping, this is not just an annoyance, it's a BIG problem, because nobody wants to wake up a toddler.

While I do have monitoring that alerts me via Signal that I'm speaking too loud (my wife), I wanted to write a program to do that all the time, in the spirit of using programming to make my life nicer.

So I started to look for a Python library that can give me information about the sound level from my microphone. A quick Kagi search revealed several options, but sounddevice seemed like the best one.

The first step is to identify the microphone. For that I need the name as it's know to the operating system. I can get that by running the following code in a Python console:

> import sounddevice as sd
> sd.query_devices()
0 Microsoft Sound Mapper - Input, MME (2 in, 0 out)
1 Microphone (Yeti Stereo Microph, MME (2 in, 0 out)
2 Microphone (WO Mic Device), MME (2 in, 0 out)
....
> sd.query_devices()[1]['name']
'Microphone (Yeti Stereo Microph'

I get a long list of stuff, but I see something with Yeti in the name so I grab that one.

Now let's start listening to the microphone. Sounddevice offers a callback based API, where it passes along the raw audio data received from the microphone. From that, I estimate the loudness by calculating the norm of the sound:

import numpy as np
import sounddevice as sd


def print_sound(indata, frames, t, status):
    volume = np.linalg.norm(indata) * 10
    print(volume)


name = 'Microphone (Yeti Stereo Microph'
with sd.InputStream(device=name,callback=print_sound):
    for i in range(5):
        sd.sleep(1000)

Running this gives something as follows. Can you guess where I snapped my fingers?

0.3724626451730728
0.6015866994857788
0.9348087012767792
0.7427176833152771
0.8615989238023758
0.7162655889987946
0.5638395622372627
0.7117109000682831
59.17434215545654
50.70761203765869
20.951063632965088
14.069621562957764
9.29598331451416
5.908793210983276
3.782018721103668
2.402055263519287
1.7902085185050964
1.1522774398326874
0.793280228972435

The next step is to make it warn me when I speak too loud. For this I keep a buffer of the latest sound intensities in order to be able to detect when either something loud has been happening for a long time or if a really loud noise happened in the last frames:

import time
from collections import deque

import numpy as np
import sounddevice as sd


last_alert = time.time() - 10
q = deque(maxlen=200)


def print_sound(indata, frames, t, status):
    global last_alert
    volume_norm = np.linalg.norm(indata) * 10
    q.append(volume_norm)
    last_elements = [q[i] for i in range(-min(50, len(q)), 0)]
    recent_avg_sound = sum(last_elements) / len(last_elements)
    num_high_count = len([x for x in q if x > 20])
    if num_high_count > 30 or recent_avg_sound > 50:
        if time.time() - last_alert > 10:
            print(f"You are speaking at {volume_norm:.2f}. Think of Gloria!\a")
            last_alert = time.time()


name = 'Microphone (Yeti Stereo Microph'
with sd.InputStream(device=name,callback=print_sound):
    while True:
        sd.sleep(1000)

Now, when running from a Terminal (either on Windows or Linux), this will make a bell sound if in the last 5 seconds (that's about 200 frames) there have been more than 30 frames with loudness over 20 or if in the last second the average was over 50 (this would mean a really loud sound).

If you want to run this outside of Terminal, you can use beepy for example to make sounds and replace the print statement with this:

from beepy import beep

beep(sound="error")

To run this on startup on Windows, I created the following ps1 script in the startup folder:

C:\Users\Roland\Programming\mic_check\.venv\Scripts\pythonw.exe C:\Users\Roland\Programming\mic_check\mic_check.py

Another improvement would be to make it easier to see current loudness and to be able to quit it easily (because the ps1 script runs in the background). For this I used the infi.systray library on Windows:

from infi.systray import SysTrayIcon

def quit(systray):
    global still_on
    still_on = False


menu_options = ()
systray = SysTrayIcon("icon.ico", "Mic check tray icon", menu_options, on_quit=quit)
systray.start()
still_on = True
name = 'Microphone (Yeti Stereo Microph'
with sd.InputStream(device=name,callback=print_sound):
    while still_on:
        sd.sleep(1000)

And now, hopefully I'll learn to control my loudness better!

You can find the full code here.

Facilitating Code Retreat in Oradea

Roland Szabo — Sun, 27 Feb 2022 17:47:00 +0300

After 8 years, I finally went again to a Code Retreat, but this time as a facilitator, not as a participant. As far as I know, it was the first time it was done in Oradea, so kudos to the Oradea Tech Hub team for organizing it.

My co-facilitator was Bogdan Bota, who is the co-founder of OptiOffer, with whom I share a surprising number of views on technology and programming. Some of these common things are that neither of us is a big fans of OOP and TDD, so we didn't push that angle too much.

One surprise for me was that the mix of languages that people knew has changed a lot. Java and C# were much rarer, while Javascript was ubiquitous. There were a couple of high school students, who were working with C++ (because that's what's taught in Romania in programming classes).

The participants enjoyed the different challenges we gave them, even if some of them were annoying (but realistic), such as changing requirements or partners mid-session. Bogdi and I had the most fun though, guiding them through this.

We got really good feedback, people had fun and said they learned a lot, and almost everyone was asking about when is the next event going to be. Naomi, I hope you'll organize many more great events!

Opening Jupyter Notebooks in the right browser from WSL

Roland Szabo — Thu, 23 Sep 2021 22:10:57 +0300

I mentioned last year that I've slowly moved back to using Windows more and more. In the mean time, my transition is almost completely done. This year I've pretty much booted into ArchLinux to update it, about once a month and that's it. I am otherwise very happy with WSL1 for when I need to run Linux only tools, such as auto-sklearn.

There was one small hickup: when opening a Jupyter Notebook from WSL, it would try to open the notebooks in the Linux environment, which is a CLI environment, so it opened them in Lynx, not in the Firefox instance that runs on the Windows side of things. While Lynx is cute, it's not the most useful interface for a Jupyter Notebook.

Jupyter Notebook opening in Lynx

I could quit Lynx by pressing q and then I would CTRL-Click on the link showed in the terminal and Jupyter would open in Firefox. But hey, I'm a programmer and I don't want to do extra clicks. Today I learned how to fix this problem.

First, we need to tell WSL to use the browser from Linux. This can be done by setting the BROWSER environment variable to point to the location of Firefox in Windows, but with the path as seen by WSL:

 export BROWSER=/mnt/c/Program\ Files/Mozilla\ Firefox/firefox.exe

Running jupyter notebook after this will correctly open a window in Firefox, but it will open it with a Linux path towards a redirect file that does the authentication for Jupyter. Because Firefox runs in Windows, it can't access the path on the Linux side.

But there is a way to tell Jupyter to open the normal localhost links, not the ones that point to a local redirect file. For this, you have to create a Jupyter config (unless you already have one):

> jupyter notebook --generate-config
Writing default config to: /home/rolisz/.jupyter/jupyter_notebook_config.py

Then edit this file and change the use_redirect_file parameter to be true (and uncomment it if needed):

c.NotebookApp.use_redirect_file = True

From now, running jupyter notebook in WSL will open properly!

Working across multiple machines

Roland Szabo — Wed, 19 May 2021 13:00:11 +0300

Until this year, I usually had a laptop from my employer, on which I did work stuff and I had a personal desktop and laptop. The two personal devices got far too little usage coding wise, so I didn't really have a need to make sure I have access to the same files on both places.

But since becoming self-employed at the beginning of this year, I find myself using both the desktop and the laptop a lot more and I need to sync files between them. I go to work from a co-working space 2-3 days a week. Sometimes I go to have a meeting with a client at their office. My desktop has a GPU and is much more powerful, so when at home I strongly prefer to work from it, instead of from a laptop that gets thermal throttling pretty fast.

I could transfer code using Github, I'd rather not have to do a WIP commit every time I get up from the desk. But I also need to sync things like business files (PDFs) and machine learning models. The most common solution for this is to use Dropbox, OneDrive or something similar, but I would like to avoid sending all my files to a centralized service run by a big company.

Trying Syncthing again

I've tried using Syncthing in the past for backups, but it didn't work out at the time. Probably because it's not meant for backups. But it is meant for syncing files between devices!

I've been using Syncthing for this purpose for 3 months now and it just works™️. It does NAT punching really well and syncing is super speedy. I've had problems with files not showing up right away on my laptop only once and I'm pretty sure it was because my laptop's Wifi sometimes acts weird.

My setup

I have three devices talking to each other on Syncthing: my desktop, my laptop and my NAS. The NAS is there to be the always-on replica of my data and it makes it easier to backup things. The desktop has the address of the NAS hardcoded because they are in the same LAN, but all the other devices uses dynamic IP discovery to talk to each other.

I have several folders set up for syncing. Some of them go to all three devices, some of them are only between the desktop and the NAS.

For the programming folders I use ignore patterns generously: I don't sync virtual env folders or node_modules folders, because they usually don't play nice if they end up on a different device with different paths (or worse, different OS). Because of this, I set up my environment on each device separately and I only sync requirements.txt and then run pip install -r requirements.txt.

What do you use for syncronizing your workspace across devices? Do you have anything better than Syncthing?

How to ML - Monitoring

Roland Szabo — Fri, 22 Jan 2021 23:29:24 +0300

As much as machine learning developers like to think that once they've got a good enough model, the job is done, it's not quite so.

The first couple of weeks after deployment are critical. Is the model really as good as offline tests said they are? Maybe something is different in production then in all your test data. Maybe the data you collected for offline predictions includes pieces of data that are not available at inference time. For example, if trying to predict click through rates for items in a list and use that to rank the items, when building the training dataset it's easy to include the rank of the item in the data, but the model won't have that when making predictions, because it's what you're trying to infer. Surprise, the model will perform very poorly in production.

Or maybe simply A/B testing reveals that the fancy ML model doesn't really perform better in production than the old rules written with lots of elbow grease by lots of developers and business analysts, using lots of domain knowledge and years of experience.

But even if the model does well at the beginning, will it continue to do so? Maybe there will be an external change in user behavior and they will start searching for other kinds of queries, which your model was not developed for. Or maybe your model will introduce a "positive" feedback loop: it suggests some items, users click on them, so those items get suggested more often, so more users click on them. This leads to a "rich get richer" kind of situation, but the algorithm is actually not making better and better suggestions.

Maybe you are on top of this and you keep retraining your model weekly to keep it in step with user behavior. But then you need to have a staggered release of the model, to make sure that the new one is really performing better across all relevant dimensions. Is inference speed still good enough? Are predictions relatively stable, meaning we don't recommend only action movies one week and then only comedies next week? Are models even comparable from one week to another or is there a significant random component to them which makes it really hard to see how they improved? For example, how are the clusters from the user post data built up? K-means starts with random centroids and clusters from one run have only passing similarity to the ones from another run. How will you deal with that?

GPT-3 and AGI

Roland Szabo — Thu, 21 Jan 2021 23:13:00 +0300

One of the most impressive/controversial papers from 2020 was GPT-3 from OpenAI. It's nothing particularly new, it's mostly a bigger version of GPT-2, which came out in 2019. It's a much bigger version, being by far the largest machine learning model at the time it was release, with 175 billion parameters.

It's a fairly simple algorithm: it's learning to predict the next word in a text^[1]. It learns to do this by training on several hundred gigabytes of text gathered from the Internet. Then to use it, you give it a prompt (a starting sequence of words) and then it will start generating more words and eventually it will decide to finish the text by emitting a stop token.

Using this seemingly stupid approach, GPT-3 is capable of generating a wide variety of interesting texts: it can write poems (not prize winning, but still), write news articles, imitate other well know authors, make jokes, argue for it's self awareness, do basic math and, shockingly to programmers all over the world, who are now afraid the robots will take their jobs, it can code simple programs.

That's amazing for such a simple approach. The internet was divided upon seeing these results. Some were welcoming our GPT-3 AI overlords, while others were skeptical, calling it just fancy parroting, without a real understanding of what it says.

I think both sides have a grain of truth. On one hand, it's easy to find failure cases, make it say things like "a horse has five legs" and so on, where it shows it doesn't really know what a horse is. But are humans that different? Think of a small child who is being taught by his parents to say "Please" before his requests. I remember being amused by a small child saying "But I said please" when he was refused by his parents. The kid probably thought that "Please" is a magic word that can unlock anything. Well, not really, in real life we just use it because society likes polite people, but saying please when wishing for a unicorn won't make it any more likely to happen.

And it's not just little humans who do that. Sometimes even grownups parrot stuff without thinking about it, because that's what they heard all their life and they never questioned it. It actually takes a lot of effort to think, to ensure consistency in your thoughts and to produce novel ideas. In this sense, expecting an artificial intelligence that is around human level might be a disappointment.

On the other hand, I believe there is a reason why this amazing result happened in the field of natural language processing and not say, computer vision. It has been long recognized that language is a powerful tool, there is even a saying about it: "The pen is mightier than the sword". Human language is so powerful that we can encode everything that there is in this universe into it, and then some (think of all the sci-fi and fantasy books). More than that, we use language to get others to do our bidding, to motivate them, to cooperate with them and to change their inner state, making them happy or inciting them to anger.

While there is a common ground in the physical world, often times that is not very relevant to the point we are making: "A rose by any other name would smell as sweet". Does it matter what a rose is when the rallying call is to get more roses? As long as the message gets across and is understood in the same way by all listeners, no, it doesn’t. Similarly, if GPTx can affect the desired change in it's readers, it might be good enough, even if doesn't have a mythical understanding of what those words mean.

Technically, the next byte pair encoded token ↩︎

How to ML - Deploying

Roland Szabo — Wed, 20 Jan 2021 18:28:54 +0300

So the ML engineer presented the model to the business stakeholders and they agreed that it performed well enough on the key metrics in testing that it's time to deploy it to production.

So now we have to make sure the models run reliably in production. We have to answer some more questions, in order to make some trade offs.

How important is latency? Is the model making an inference in response to a user action, so it's crucial to have the answer in tens of milliseconds? Then it's time to optimize the model: quantize weights, distill knowledge to a smaller model, weight pruning and so on. Hopefully, your metrics won't go down due to the optimization.

Can the results be precomputed? For example, if you want to make movie recommendations, maybe there can be a batch job that runs every night that does the inference for every user and stores them in a database. Then when the user makes a request, they are simply quickly loaded from the database. This is possible only if you have finite range of predictions to make.

Where are you running the model? On big beefy servers with a GPU? On mobile devices, which are much less powerful? Or on some edge devices that don't even have an OS? Depending on the answer, you might have to convert the model to a different format or optimize it to be able to fit in memory.

Even in the easy case where you are running the model on servers and latency can be several seconds, you still have to do the whole dance of making it work there. "Works on my machine" is all to often a problem. Maybe production runs a different version of Linux, which has a different BLAS library and the security team won't let you update things. Simple, just use Docker, right? Right, better hope you are good friends with the DevOps team to help you out with setting up the CI/CD pipelines.

But you've killed all the dragons, now it's time to keep watch... aka monitoring the models performance in production.

How to ML - Models

Roland Szabo — Mon, 18 Jan 2021 22:55:44 +0300

So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.

This is the part that most people who want to do machine learning are excited about. I read Bishop's and Murphy's textbooks, watched Andrew Ng's online course about ML and learned about different kinds of ML algorithms and I couldn't wait to try them out and to see which one is the best for the data at hand.

You start off with a simple one, a linear or logistic regression, to get a baseline. Maybe you even play around with the hyperparameters. Then you move on to a more complicated model, such as a random forest. You spend more time fiddling with it, getting 20% better results. Then you switch to the big guns, neural networks. You start with a simple one, with just 3 layers, and progressively end up with 100 ReLU and SIREN layers, dropout, batchnorm, ADAM, convolutions, attention mechanism and finally you get to 99% accuracy.

And then you wake up from your nice dream.

In practice, playing around with ML algorithms is just 10% of the job for an ML engineer. You do try out different algorithms, but you rarely write new ones from scratch. For most production projects, if it's not in one of the sklearn, Tensorflow or Pytorch libraries, it won't fly. For proof of concept projects you might try to use the GitHub repo that accompanies a paper, but that path is full of pain, trying to find all the dependencies of undocumented code and to make it work.

For the hyperparameter tuning, there are libraries to help you with that, and anyway, the time it takes to finish the training runs is much larger than the time you spend coding it up, for any real life datasets.

And in practice, you run into many issues with the data. You'll find that some of the columns in the data have lots of missing values. Or some of the datapoints that come from different sources have different meanings for the same columns. You'll find conflicting or invalid labels. And that means going back to the data pipelines and fixing that bugs that occur there.

If you do get a model that is good enough, it's time to deploy it, which comes with it's own fun...

How to ML - Data

Roland Szabo — Tue, 29 Dec 2020 21:22:09 +0300

So we've decided what metrics we want to track for our machine learning project. Because ML needs data, we need to get it.

In some cases we get lucky and we already have it. Maybe we want to predict the failure of pieces of equipment in a factory. There are already lots of sensors measuring the performance of the equipment and there are service logs saying what was replaced for each equipment. In theory, all we need is a bit of a big data processing pipeline, say with Apache Spark, and we can get the data in the form of (input, output) pairs that can be fed into a machine learning classifiers that predicts if an equipment will fail based on the last 10 values measures from its sensors. In practice, we'll find that sensors of the same time that come from different manufacturers have different ranges of possible values, so they will all have to be normalized. Or that the service logs are filled out differently by different people, so that will have to be standardized as well. Or worse, the sensor data is good, but it's kept only for 1 month to save on storage costs so we have to fix that and wait a couple of months for more training data to accumulate.

The next best case is that we don't have the data, but we can get it somehow. Maybe there are already datasets on the internet that we can download for free. This is the case for most face recognition applications: there are plenty of annotated face datasets out there, with various licenses. In some cases the dataset must be bought, for example, if we want to start a new ad network, there are plenty of datasets available online of personal data about everyone, which can be used then to predict the likelihood of clicking on an ad. That's the business model of many startups...

The worst case is that we don't have data and we can't find it out there. Maybe it's because we have a very specific niche, such as we want to find defects in the manufacturing process of our specific widgets, so we can't use random images from the internet to learn this. Or maybe we want to do something that is really new (or very valuable), in which case we will have to gather the data ourselves. If we want to solve something in the physical world, that will mean installing sensors to gather data. After we get the raw data, such as images of our widgets coming of the production line, we will have to annotate those images. This means getting them in front of humans who know how to tell if a widget is good or defective. There needs to be a Q&A process in this, because even humans have an error rate, so each image will have to be labeled by at least three humans. We need several thousand samples, so this will take some time to set up, even if we can use crowdsourcing websites such as AWS Mechanical Turk to distribute the tasks to many workers across the world.

All this is done, we finally have data. Time to start doing the actual ML...

How to ML - Metrics

Roland Szabo — Mon, 28 Dec 2020 21:19:07 +0300

We saw that machine learning algorithms process large amounts of data to find patterns. But how exactly do they do that?

The first step in a machine learning project is establishing metrics. What exactly do we want to do and how do we know we're doing it well?

Are we trying to predict a number? How much will Bitcoin cost next year? That's a regression problem. Are we trying to predict who will win the election? That's a binary classification problem (at least in the USA). Are we trying to recognize objects in an image? That's a multi class classification problem.

Another question that has to be answered is what kind of mistakes are worse. Machine learning is not all knowing, so it will make mistakes, but there are trade-offs to be made. Maybe we are building a system to find tumors in X-rays: in that case it might be better that we call wolf too often and have false positives, rather than missing out on a tumor. Or maybe it's the opposite: we are trying to implement a facial recognition system. If the system recognizes a burglar incorrectly, then the wrong person will get sent to jail, which is a very bad consequence for a mistake made by "THE algorithm".

These are not just theoretical concerns, but they actually matter a lot in building machine learning systems. Because of this, many ML projects are human-in-the-loop, meaning the model doesn't decide by itself what to do, it merely makes a suggestion which a human will then confirm. In many cases, that is valuable enough, because it makes the human much more efficient. For example, the security guard doesn't have to look at 20 screens at once, but can only look at the footage that was flagged as anomalous.

Tomorrow we'll look at the next step: gathering the data.

What is ML? part 3

Roland Szabo — Thu, 24 Dec 2020 18:18:51 +0300

Yesterday we saw that machine learning is behind some successful products and it does have the potential to bring many more changes to our life.

So what is it?

Well, the textbook definition is that it's the building of algorithms that can perform tasks they were not explicitly programmed to do. In practice, this means that we have algorithms that analyze large quantities of data to learn some patterns in the data, which can then be used to make predictions about new data points.

This is in contrast with the classical way of programming computers, where a programmer would use either their domain knowledge or they would analyze the data themselves and then write the program that has the correct output.

Left: Classical Programming; Right: ML Programming

So one of the crucial distinctions is that in machine learning, the machine has to learn from the data. If a human being figures out the pattern and writes a regular expression to find addresses in text, that's human learning, and we all go to school to do that.

Now does that mean that machine learning is a solution for everything? No. In some cases, it's easier or cheaper to have a data analyst or a programmer find the pattern and code it up.

But there are plenty of cases where despite decades long efforts of big teams of researchers, humans haven't been able to find an explicit pattern. The simplest example of this would be recognizing dogs in pictures. 99.99% of humans over the age of 5 have no problem recognizing a dog, whether a puppy, a golden retriever or a Saint Bernard, but they have zero insight into how they do it, what makes a bunch of pixels on the screen a dog and not a cat. And this is where machine learning shines: you give it a lot of photos (several thousands at least), pair each photo with a label of what it contains and the neural network will learn by itself what makes a dog a dog and not a cat.

Machine learning is just one tool that is available at our disposal, among many other tool. It's a very powerful tool and it's one that gets "sharpened" all the time, with lots of research being done all around the world to find better algorithms, to speed up their training and to make them more accurate.

Come back tomorrow to find out how the sausage is made, on a high level.