machine learning

How to ML - Models

Roland Szabo

18 Jan 2021 — 2 min read

So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.

This is the part that most people who want to do machine learning are excited about. I read Bishop's and Murphy's textbooks, watched Andrew Ng's online course about ML and learned about different kinds of ML algorithms and I couldn't wait to try them out and to see which one is the best for the data at hand.

You start off with a simple one, a linear or logistic regression, to get a baseline. Maybe you even play around with the hyperparameters. Then you move on to a more complicated model, such as a random forest. You spend more time fiddling with it, getting 20% better results. Then you switch to the big guns, neural networks. You start with a simple one, with just 3 layers, and progressively end up with 100 ReLU and SIREN layers, dropout, batchnorm, ADAM, convolutions, attention mechanism and finally you get to 99% accuracy.

And then you wake up from your nice dream.

In practice, playing around with ML algorithms is just 10% of the job for an ML engineer. You do try out different algorithms, but you rarely write new ones from scratch. For most production projects, if it's not in one of the sklearn, Tensorflow or Pytorch libraries, it won't fly. For proof of concept projects you might try to use the GitHub repo that accompanies a paper, but that path is full of pain, trying to find all the dependencies of undocumented code and to make it work.

For the hyperparameter tuning, there are libraries to help you with that, and anyway, the time it takes to finish the training runs is much larger than the time you spend coding it up, for any real life datasets.

And in practice, you run into many issues with the data. You'll find that some of the columns in the data have lots of missing values. Or some of the datapoints that come from different sources have different meanings for the same columns. You'll find conflicting or invalid labels. And that means going back to the data pipelines and fixing that bugs that occur there.

If you do get a model that is good enough, it's time to deploy it, which comes with it's own fun...

TIL: pytz can return ancient timezone

Also TIL: Romania was on timezone UTC +1:44:24 until 1930. So if you ever get weird offsets, not full hour (and not even 30 minute) offsets in code, it's probably because of pytz for some reason returns the first recorded offset for a certain timezone. Fix:

A wild rolisz reappears

I haven't written a blog post in a long time. It's been a tough year. Mostly from a health perspective. One of the highlights was when I cracked four ribs from coughing (yes, the doctors were just as shocked as you are). Before that, I used

TIA: Cod liver with boiled egg

This evening I came home late after a long day, starving. So I asked my friend ChatGPT what I should eat. It gave me a 5 suggestions, taking into account my food preferences and health issues. I didn't fancy anything from those 5 things, but it did remind

TIL: Caddy

I used to use Nginx to proxy requests and set up SSL. It's not super hard, but the config is a screenfull. Enter Caddy. Install it. Put the following in /etc/caddy/Caddyfile: DOMAIN.NAME { reverse_proxy localhost:8000 } Restart the systemd service. And voila, you have an

Read more

TIL: pytz can return ancient timezone

A wild rolisz reappears

TIA: Cod liver with boiled egg

TIL: Caddy