machine learning

How to ML - Deploying

Roland Szabo

20 Jan 2021 — 2 min read

So the ML engineer presented the model to the business stakeholders and they agreed that it performed well enough on the key metrics in testing that it's time to deploy it to production.

So now we have to make sure the models run reliably in production. We have to answer some more questions, in order to make some trade offs.

How important is latency? Is the model making an inference in response to a user action, so it's crucial to have the answer in tens of milliseconds? Then it's time to optimize the model: quantize weights, distill knowledge to a smaller model, weight pruning and so on. Hopefully, your metrics won't go down due to the optimization.

Can the results be precomputed? For example, if you want to make movie recommendations, maybe there can be a batch job that runs every night that does the inference for every user and stores them in a database. Then when the user makes a request, they are simply quickly loaded from the database. This is possible only if you have finite range of predictions to make.

Where are you running the model? On big beefy servers with a GPU? On mobile devices, which are much less powerful? Or on some edge devices that don't even have an OS? Depending on the answer, you might have to convert the model to a different format or optimize it to be able to fit in memory.

Even in the easy case where you are running the model on servers and latency can be several seconds, you still have to do the whole dance of making it work there. "Works on my machine" is all to often a problem. Maybe production runs a different version of Linux, which has a different BLAS library and the security team won't let you update things. Simple, just use Docker, right? Right, better hope you are good friends with the DevOps team to help you out with setting up the CI/CD pipelines.

But you've killed all the dragons, now it's time to keep watch... aka monitoring the models performance in production.

To AI and back - part 1

I wrote my first AI program in high school (around 2009-2010). I found a tutorial for writing a genetic algorithm to find a list of number that sum to a value (I think). It was written in C++, I knew only PHP, I didn't know any OOP, so

TIL: pytz can return ancient timezone

Also TIL: Romania was on timezone UTC +1:44:24 until 1930. So if you ever get weird offsets, not full hour (and not even 30 minute) offsets in code, it's probably because of pytz for some reason returns the first recorded offset for a certain timezone. Fix:

A wild rolisz reappears

I haven't written a blog post in a long time. It's been a tough year. Mostly from a health perspective. One of the highlights was when I cracked four ribs from coughing (yes, the doctors were just as shocked as you are). Before that, I used

TIA: Cod liver with boiled egg

This evening I came home late after a long day, starving. So I asked my friend ChatGPT what I should eat. It gave me a 5 suggestions, taking into account my food preferences and health issues. I didn't fancy anything from those 5 things, but it did remind

Read more

To AI and back - part 1

TIL: pytz can return ancient timezone

A wild rolisz reappears

TIA: Cod liver with boiled egg