<![CDATA[rolisz's blog]]>https://rolisz.ro/https://rolisz.ro/favicon.pngrolisz's bloghttps://rolisz.ro/Ghost 4.4Fri, 09 Jul 2021 15:01:06 GMT60<![CDATA[Half a year as an indie consultant]]>It's hard to believe it's been more than half an year since I started my own company and became an independent machine learning consultant. It's been a very interesting ride.

There have been plenty of moments where the predominant feeling was "what now?

]]>
https://rolisz.ro/2021/07/09/half-a-year-as-an-indie-consultant/60d58d48a0cb673c37e93d2bFri, 09 Jul 2021 14:57:00 GMT

It's hard to believe it's been more than half an year since I started my own company and became an independent machine learning consultant. It's been a very interesting ride.

There have been plenty of moments where the predominant feeling was "what now?". How am I going to find more clients? How to negotiate with this client? The Dip, as it's called by Seth Godin, is very real and very scary. When you draw the line and see how much you've earned over six months... you start getting serious doubts. Was it worth it? Wouldn't it have been better (and much easier) to just find a nice job?

But there are other moments: when I realize I have freedom to choose my clients and the projects that I work on; after working for a whole day on something that I love, ML, without any useless meetings; when deciding with almost complete freedom the tech stack which will be used to build the ML side of things; when I take a day off almost whenever I want, just because I don't feel like working on that particular project on that particular day. Or when I realize that I am a consultant, that my clients look to me for advice and that they actually take my advice seriously. If I say that the way they did things previously won't work and they should do things differently? They'll get to it right away.

And then there are moments when I realize I barely have time to read any state of the art machine learning papers and instead I have to learn the basics of marketing, branding, business development, communication, coaching, explaining, teaching - and to put all of this into practice. Most of my clients don't care if I'm using the latest state of the art Transformer architecture (and don't even know what on earth that is). They don't even know what machine learning is. But they need someone to explain it to them - to people who have built successful companies in their own fields - and to help them understand if it's something that they need or not.

I am thankful to God for guiding me on this new path, of which I have dreamed for a long time. Faith in his faithfulness is what has kept me steady when my knees wavered.

I am grateful to my dear wife who was willing to take this risk alongside me and has been very supportive all along the way.

I am very glad I have a good accountant who can help me with all the paperwork of the company.

I am grateful to the whole team from Oradea Tech Hub, who have helped me get my name out there, and especially to my friend David Achim with whom I did many rounds of business strategy discussions.

And I am thankful to many others who have cheered me on, who have encouraged me and who have put in a good word for me to potential clients.

]]>
<![CDATA[Happy 11th Birthday!]]>My blog has circled the Sun for another year. You got 37 more posts in the meantime. The Obsidian post was very popular, as was the Rust Codenames series. Vmmem issues are finding a solution on my blog as well.  The second half of last year was slower than

]]>
https://rolisz.ro/2021/06/08/happy-11th-birthday/60bfba22a3c3ed7839d6f32eTue, 08 Jun 2021 19:17:05 GMT

My blog has circled the Sun for another year. You got 37 more posts in the meantime. The Obsidian post was very popular, as was the Rust Codenames series. Vmmem issues are finding a solution on my blog as well.  The second half of last year was slower than the first one, but it's ok.

I kinda split my blog into two: personal posts stayed here, anything related to machine learning goes to my new domain, which is for my consulting business. I still want to post some technical content here and I do hope I'll make it to the front page of HN again :D

I haven't had as much time to write posts because I've been busy with all kinds of other content: an in person machine learning course here in Oradea, several presentations, some about machine learning, some about quick iteration, some locally, some online. It turns I only have so much creative juice in me every day.

I've resumed my goals to blog again, but at a much more humble rate. Sometimes I'm tempted to try daily blogging, but I'm a bit afraid of that commitment and of the quality of the posts that would result from that. Some people say that writing daily turns on the faucets of creativity and you'll have plenty of ideas. But for now I'll stick to a more reasonable goal of two posts per month.

]]>
<![CDATA[Working across multiple machines]]>Until this year, I usually had a laptop from my employer, on which I did work stuff and I had a personal desktop and laptop. The two personal devices got far too little usage coding wise, so I didn't really have a need to make sure I have

]]>
https://rolisz.ro/2021/05/19/working-across-multiple-machines/60a401e6a3c3ed7839d6f28eWed, 19 May 2021 10:00:11 GMT

Until this year, I usually had a laptop from my employer, on which I did work stuff and I had a personal desktop and laptop. The two personal devices got far too little usage coding wise, so I didn't really have a need to make sure I have access to the same files on both places.

But since becoming self-employed at the beginning of this year, I find myself using both the desktop and the laptop a lot more and I need to sync files between them. I go to work from a co-working space 2-3 days a week. Sometimes I go to have a meeting with a client at their office. My desktop has a GPU and is much more powerful, so when at home I strongly prefer to work from it, instead of from a laptop that gets thermal throttling pretty fast.

I could transfer code using Github, I'd rather not have to do a WIP commit every time I get up from the desk. But I also need to sync things like business files (PDFs) and machine learning models.  The most common solution for this is to use Dropbox, OneDrive or something similar, but I would like to avoid sending all my files to a centralized service run by a big company.

Trying Syncthing again

I've tried using Syncthing in the past for backups, but it didn't work out at the time. Probably because it's not meant for backups. But it is meant for syncing files between devices!

I've been using Syncthing for this purpose for 3 months now and it just works™️. It does NAT punching really well and syncing is super speedy. I've had problems with files not showing up right away on my laptop only once and I'm pretty sure it was because my laptop's Wifi sometimes acts weird.

My setup

I have three devices talking to each other on Syncthing: my desktop, my laptop and my NAS. The NAS is there to be the always-on replica of my data and it makes it easier to backup things. The desktop has the address of the NAS hardcoded because they are in the same LAN, but all the other devices uses dynamic IP discovery to talk to each other.

I have several folders set up for syncing. Some of them go to all three devices, some of them are only between the desktop and the NAS.

For the programming folders I use ignore patterns generously: I don't sync virtual env folders or node_modules folders, because they usually don't play nice if they end up on a different device with different paths (or worse, different OS). Because of this, I set up my environment on each device separately and I only sync  requirements.txt and then run pip install -r requirements.txt.

What do you use for syncronizing your workspace across devices? Do you have anything better than Syncthing?

]]>
<![CDATA[Productivity Tips: Time Blocks]]>As I've started my freelance machine learning consulting business this year, I found I need better ways to organize my time. When I was employed as a software engineer, there was a task board I would choose what to work on. The tasks would be mostly decided at

]]>
https://rolisz.ro/2021/04/04/productivity-tips/606a03ac88041c04f2008470Sun, 04 Apr 2021 19:51:13 GMT

As I've started my freelance machine learning consulting business this year, I found I need better ways to organize my time. When I was employed as a software engineer, there was a task board I would choose what to work on. The tasks would be mostly decided at the beginning of the spring, so it was quite clear what to focus on most of the time. Of course, sometimes unexpected issues would come up, but usually those are urgent, so it's easy to decide to switch over to them.

But now, I have to juggle between working for different clients, talking to leads and doing marketing or administrative tasks. My to-do list just keeps growing longer and it's getting harder to pick something to work on. Should I write a new blog post? Should I work on a video? Should I do some exploratory data analysis for a client? Should I look into preparing an MLOps report for a client? Or maybe write a blog post so that my friends know I'm still alive?

Having to make a choice about this every time I want to start working is tiring, leading to choice paralysis. Often I have to work on 3-4 tasks a day. If I context switch between them too often, my efficiency drops.  So last month I started applying a variant of time blocking, about which I read from Cal Newport.

Productivity Tips: Time Blocks
Blue events are meetings, green ones are time blocks

Instead of using a paper based method like he suggests, I create an event in Google Calendar when I want to block off some time. Ideally I schedule them the day before, but sometimes I either forget or something comes up and I have to change what I'll work on for the same day. I try to create blocks of one or two hours. Shorter blocks don't give you enough time to get immersed in deep work, while longer blocks are usually too tiring. I also make sure to leave some breaks between the time blocks.

I use a separate calendar so that I can easily toggle the visibility, leaving in the Calendar app only those events which have to take place at a given time (such as client meetings) and so that the time blocks don't interfere with Calendly, a meeting scheduling service I use.

I'm not very strict about the time blocks. If I find that I'm in the flow when a block ends, then I'll continue working on it. If something else is more urgent or I'm simply in a very strong mood for another task, I'll work on that and I'll simply move the calendar event to another time.

How do you organize your time and decide what to work on?

]]>
<![CDATA[Learning to machine learn]]>tl;dr: I'm launching an introductory course about machine learning in Romanian. It's aimed not just at developers, but at a more general audience.

De ceva timp mă bate gândul să trec la următorul nivel de creare de conț

]]>
https://rolisz.ro/2021/02/19/learning-to-machine-learn/602fdaddf2fbd3222b45f000Fri, 19 Feb 2021 15:51:29 GMT

tl;dr: I'm launching an introductory course about machine learning in Romanian. It's aimed not just at developers, but at a more general audience.

De ceva timp mă bate gândul să trec la următorul nivel de creare de conținut. Scriu pe blog de 10 ani și îmi place asta. Unele posturi pe care le-am scris despre programare și machine learning au avut succes. Așa că m-am gândit să fac un curs de machine learning.

Pe net sunt o mulțime de resurse de machine learning, cursuri care mai de care. Și eu am învățat din ele, deci sunt și cursuri bune și foarte bune printre ele. Dar pentru început, aș vrea să încep prin a face un curs în limba română, unde nu cred că sunt suficiente resurse de calitate. Bine, practic va fi o romgleză, că abia pot să pronunț „învățare automată”, „machine learning” alunecă mult mai bine. Ce să mai zic de deep learning...

O altă lacună pe care am identificat-o e că majoritatea cursurilor sunt pentru programatori care scriu cod în fiecare zi și vor să știe folosi și unealta numită machine learning. Dar este o lipsă mare de înțelegere a modului cum funcționează machine learning și inteligența artificială în rândul managerilor și, de ce nu, a oamenilor non tehnici. Dacă te iei doar după ce citești la știri, imediat urmează scenariul Terminator, când în realitate toate sistemele de ML au slăbiciuni mari și ușor de găsit.

Asta duce la unele așteptări nerealiste din partea conducerii unor firme, care vor să devină mai „hipsteri” și să folosească ML, dar vin cu idei complet greșite, care nu pot fi făcute să meargă suficient de bine. Sper să pot să ajut și astfel de persoane.

Mulți oameni cred că trebuie cunoștiințe tehnice foarte avansate ca să folosești chestii de inteligență artificială. Dar bariera scade tot mai mult și apar aplicații și în domenii creative, cum ar fi generare de imagini sau de text și care pot fi folosite relativ simplu, odată ce înțelegi conceptele de bază.

Dacă vă surâde ce ați citit mai sus, intrați pe pagina cursului.

]]>
<![CDATA[Design patterns in real life]]>In programming there are so called design patterns, which are basically commonly repeated pieces of code that occur often enough that people thought it would be helpful to give them a name so that’s it’s easier to talk about them. One example is the iterator pattern,

]]>
https://rolisz.ro/2021/01/26/design-patterns-in-real-life/601075f3f896ad697fe0fe06Tue, 26 Jan 2021 20:07:52 GMT

In programming there are so called design patterns, which are basically commonly repeated pieces of code that occur often enough that people thought it would be helpful to give them a name so that’s it’s easier to talk about them. One example is the iterator pattern, which is about an efficient method of traversing the elements of a container, whether they are an array, a hash table or something else. The builder pattern is used for building objects when we don’t know all their required parameters upfront.

Sometimes, if you don’t know about a pattern and you read code that uses it, it might seem strange. Why is this extra layer of abstraction here? Why is this API broken down into these pieces? After learning about the pattern, you might learn that the extra layer of abstraction is needed because the layer that’s below changes often. Or that the API is broken into those specific pieces because this makes it easy to cover more use cases in an efficient way.

As I’ve started diving head first into the world of running my own consulting business, I’m starting to learn about a whole other world of “design patterns”, unrelated to programming. And suddenly many things that I’ve seen before started to make sense.

My friend David has been bugging me to start a community for people passionate about machine learning in Oradea, where I live, for almost two years. For a long time I was thinking, why does he push so much for this? Well, after taking Seth Godin’s Freelancer Workshop, now I know that being the person who organizes a community is one of the best ways to make yourself known.

Another example is that I saw website offering a sort of business networking thing for a very high membership cost (or at least it seemed expensive at the time). Why would anyone do that? Then I learned about a thing called alchemy network 1 and how if it’s done well it can bring great value to it’s members.

All my friends who are freelancers charge by the hour. That’s what I thought was normal. But then I heard about Value based pricing by Jonathan Stark. A different pricing “design pattern”, which aligns the incentives of the client and of the service provider in a much better way. Let’s see if I can pull it off though.

Just like in programming, design patterns help us find the correct solution faster and communicate more efficiently. The more patterns you know, the faster you can recognize a situation and react better to it.

What are your favorite design patterns?

]]>
<![CDATA[How to ML - Monitoring]]>As much as machine learning developers like to think that once they've got a good enough model, the job is done, it's not quite so.

The first couple of weeks after deployment are critical. Is the model really as good as offline tests said they are?

]]>
https://rolisz.ro/2021/01/22/how-to-ml-monitoring/600b34dcf896ad697fe0fdf3Fri, 22 Jan 2021 20:29:24 GMT

As much as machine learning developers like to think that once they've got a good enough model, the job is done, it's not quite so.

The first couple of weeks after deployment are critical. Is the model really as good as offline tests said they are? Maybe something is different in production then in all your test data. Maybe the data you collected for offline predictions includes pieces of data that are not available at inference time. For example, if trying to predict click through rates for items in a list and use that to rank the items, when building the training dataset it's easy to include the rank of the item in the data, but the model won't have that when making predictions, because it's what you're trying to infer. Surprise, the model will perform very poorly in production.

Or maybe simply A/B testing reveals that the fancy ML model doesn't really perform better in production than the old rules written with lots of elbow grease by lots of developers and business analysts, using lots of domain knowledge and years of experience.

But even if the model does well at the beginning, will it continue to do so? Maybe there will be an external change in user behavior and they will start searching for other kinds of queries, which your model was not developed for. Or maybe your model will introduce a "positive" feedback loop: it suggests some items, users click on them, so those items get suggested more often, so more users click on them. This leads to a "rich get richer" kind of situation, but the algorithm is actually not making better and better suggestions.

Maybe you are on top of this and you keep retraining your model weekly to keep it in step with user behavior. But then you need to have a staggered release of the model, to make sure that the new one is really performing better across all relevant dimensions. Is inference speed still good enough? Are predictions relatively stable, meaning we don't recommend only action movies one week and then only comedies next week? Are models even comparable from one week to another or is there a significant random component to them which makes it really hard to see how they improved? For example, how are the clusters from the user post data built up? K-means starts with random centroids and clusters from one run have only passing similarity to the ones from another run. How will you deal with that?

]]>
<![CDATA[GPT-3 and AGI]]>One of the most impressive/controversial papers from 2020 was GPT-3 from OpenAI. It's nothing particularly new, it's mostly a bigger version of GPT-2, which came out in 2019. It's a much bigger version, being by far the largest machine learning model at the

]]>
https://rolisz.ro/2021/01/21/gpt3-agi/5f3517c94f71eb12e0abb8bfThu, 21 Jan 2021 20:13:00 GMT

One of the most impressive/controversial papers from 2020 was GPT-3 from OpenAI. It's nothing particularly new, it's mostly a bigger version of GPT-2, which came out in 2019. It's a much bigger version, being by far the largest machine learning model at the time it was release, with 175 billion parameters.

It's a fairly simple algorithm: it's learning to predict the next word in a text[1]. It learns to do this by training on several hundred gigabytes of text gathered from the Internet. Then to use it, you give it a prompt (a starting sequence of words) and then it will start generating more words and eventually it will decide to finish the text by emitting a stop token.

Using this seemingly stupid approach, GPT-3 is capable of generating a wide variety of interesting texts: it can write poems (not prize winning, but still), write news articles, imitate other well know authors, make jokes, argue for it's self awareness, do basic math and, shockingly to programmers all over the world, who are now afraid the robots will take their jobs, it can code simple programs.

That's amazing for such a simple approach. The internet was divided upon seeing these results. Some were welcoming our GPT-3 AI overlords, while others were skeptical, calling it just fancy parroting, without a real understanding of what it says.

I think both sides have a grain of truth. On one hand, it's easy to find failure cases, make it say things like "a horse has five legs" and so on, where it shows it doesn't really know what a horse is. But are humans that different? Think of a small child who is being taught by his parents to say "Please" before his requests. I remember being amused by a small child saying "But I said please" when he was refused by his parents. The kid probably thought that "Please" is a magic word that can unlock anything. Well, not really, in real life we just use it because society likes polite people, but saying please when wishing for a unicorn won't make it any more likely to happen.

And it's not just little humans who do that. Sometimes even grownups parrot stuff without thinking about it, because that's what they heard all their life and they never questioned it. It actually takes a lot of effort to think, to ensure consistency in your thoughts and to produce novel ideas. In this sense, expecting an artificial intelligence that is around human level might be a disappointment.

On the other hand, I believe there is a reason why this amazing result happened in the field of natural language processing and not say, computer vision. It has been long recognized that language is a powerful tool, there is even a saying about it: "The pen is mightier than the sword". Human language is so powerful that we can encode everything that there is in this universe into it, and then some (think of all the sci-fi and fantasy books). More than that, we use language to get others to do our bidding, to motivate them, to cooperate with them and to change their inner state, making them happy or inciting them to anger.

While there is a common ground in the physical world, often times that is not very relevant to the point we are making: "A rose by any other name would smell as sweet". Does it matter what a rose is when the rallying call is to get more roses? As long as the message gets across and is understood in the same way by all listeners, no, it doesn’t. Similarly, if GPTx can affect the desired change in it's readers, it might be good enough, even if doesn't have a mythical understanding of what those words mean.


Technically, the next byte pair encoded token ↩︎

]]>
<![CDATA[How to ML - Deploying]]>So the ML engineer presented the model to the business stakeholders and they agreed that it performed well enough on the key metrics in testing that it's time to deploy it to production.

So now we have to make sure the models run reliably in production. We have

]]>
https://rolisz.ro/2021/01/20/how-to-ml-deploying/60084bc7165bd14e3b33595dWed, 20 Jan 2021 15:28:54 GMT

So the ML engineer presented the model to the business stakeholders and they agreed that it performed well enough on the key metrics in testing that it's time to deploy it to production.

So now we have to make sure the models run reliably in production. We have to answer some more questions, in order to make some trade offs.

How important is latency? Is the model making an inference in response to a user action, so it's crucial to have the answer in tens of milliseconds? Then it's time to optimize the model: quantize weights, distill knowledge to a smaller model, weight pruning and so on. Hopefully, your metrics won't go down due to the optimization.

Can the results be precomputed? For example, if you want to make movie recommendations, maybe there can be a batch job that runs every night that does the inference for every user and stores them in a database. Then when the user makes a request, they are simply quickly loaded from the database. This is possible only if you have finite range of predictions to make.

Where are you running the model? On big beefy servers with a GPU? On mobile devices, which are much less powerful? Or on some edge devices that don't even have an OS? Depending on the answer, you might have to convert the model to a different format or optimize it to be able to fit in memory.

Even in the easy case where you are running the model on servers and latency can be several seconds, you still have to do the whole dance of making it work there. "Works on my machine" is all to often a problem. Maybe production runs a different version of Linux, which has a different BLAS library and the security team won't let you update things. Simple, just use Docker, right? Right, better hope you are good friends with the DevOps team to help you out with setting up the CI/CD pipelines.

But you've killed all the dragons, now it's time to keep watch... aka monitoring the models performance in production.

]]>
<![CDATA[How to ML - Models]]>So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.

This is the part that most people who

]]>
https://rolisz.ro/2021/01/18/how-to-ml-models/6005e7293e8fc062a027dbe3Mon, 18 Jan 2021 19:55:44 GMT

So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.

This is the part that most people who want to do machine learning are excited about. I read Bishop's and Murphy's textbooks, watched Andrew Ng's online course about ML and learned about different kinds of ML algorithms and I couldn't wait to try them out and to see which one is the best for the data at hand.

You start off with a simple one, a linear or logistic regression, to get a baseline. Maybe you even play around with the hyperparameters. Then you move on to a more complicated model, such as a random forest. You spend more time fiddling with it, getting 20% better results. Then you switch to the big guns, neural networks. You start with a simple one, with just 3 layers, and progressively end up with 100 ReLU and SIREN layers, dropout, batchnorm, ADAM, convolutions, attention mechanism and finally you get to 99% accuracy.

And then you wake up from your nice dream.

In practice, playing around with ML algorithms is just 10% of the job for an ML engineer. You do try out different algorithms, but you rarely write new ones from scratch. For most production projects, if it's not in one of the sklearn, Tensorflow or Pytorch libraries, it won't fly. For proof of concept projects you might try to use the GitHub repo that accompanies a paper, but that path is full of pain, trying to find all the dependencies of undocumented code and to make it work.

For the hyperparameter tuning, there are libraries to help you with that, and anyway, the time it takes to finish the training runs is much larger than the time you spend coding it up, for any real life datasets.

And in practice, you run into many issues with the data. You'll find that some of the columns in the data have lots of missing values. Or some of the datapoints that come from different sources have different meanings for the same columns. You'll find conflicting or invalid labels. And that means going back to the data pipelines and fixing that bugs that occur there.

If you do get a model that is good enough, it's time to deploy it, which comes with it's own fun...

]]>
<![CDATA[2020 in Review]]>2020 might have been a bad year outside, but it was a good year for my blog. I wrote 62 posts, almost as many as in the previous 3 years combined (63). Part of it was due to more time because of Covid, part of it was because of the

]]>
https://rolisz.ro/2020/12/31/2020-in-review/5fee192e08f8f65d8ac7c91cThu, 31 Dec 2020 23:02:17 GMT

2020 might have been a bad year outside, but it was a good year for my blog. I wrote 62 posts, almost as many as in the previous 3 years combined (63). Part of it was due to more time because of Covid, part of it was because of the 100 Days to Offload Challenge (which I didn't finish), part of it was because I have an interest to take my blog in a new direction, to help get leads for my consulting business.

Visits were up: 60.000 sessions compared to 10.000 in 2019. Most of my sessions were from unique visitors, because those were around 53.000, compared to 8700. Pageviews are at 73500, versus 16300.

Most of this is due to some posts that got very popular. The Moving away from Gmail post is now my most popular blog post ever, dethroning the neural network post that is 7 years old and is still getting 2000 views per year. It was on the front page of HackerNews and it got 36000 pageviews in 3 days. The Obsidian post was also quite popular, having been suggested in the Google app, getting to 8000 views. My Rust posts all got over 800 views, with the web crawler one getting over 2400. Surprisingly, how to bridge networks with a Synology NAS is a very interesting topic, because that also got 1000 views.

The Ghost platform has worked ok during the last year, but it has some small friction points, so I'm thinking about changing again. But regardless of how I'll post, I definitely plan to keep post more content.

]]>
<![CDATA[World's best phone case]]>Yesterday I enjoyed the Australian Șuncuiuș Christmas weather while doing another Via Ferrata trail. It was much harder than the one I did last year. But as I finished the vertical ascent that is seen in the top picture, my phone slipped from my pocket, and fell about

]]>
https://rolisz.ro/2020/12/31/worlds-best-phone-case/5fedc94d08f8f65d8ac7c8c8Thu, 31 Dec 2020 13:10:44 GMT

Yesterday I enjoyed the Australian Șuncuiuș Christmas weather while doing another Via Ferrata trail. It was much harder than the one I did last year. But as I finished the vertical ascent that is seen in the top picture, my phone slipped from my pocket, and fell about 15m.

I immediately thought I'd have to buy myself a late Christmas present. After we finished the hike, we went to search for the phone. The case had come off the phone and we found it pretty quickly. The phone was on vibrate, so calling it didn't help. It had slipped under some rocks, so we had to look harder for it. But after we found it, we were all shocked that it was intact, without a scratch on it.

The case has some very minor scratches on it. Ladies and gentleman, if until now I was a big fan of SupCase Unicorn Beetle Pro cases, from now on I probably won't buy a phone without a case from them. Kudos to the SupCase team!

]]>
<![CDATA[How to ML - Data]]>So we've decided what metrics we want to track for our machine learning project. Because ML needs data, we need to get it.

In some cases we get lucky and we already have it. Maybe we want to predict the failure of pieces of equipment in a factory.

]]>
https://rolisz.ro/2020/12/29/how-to-ml-data/5feb735d2bc2360ef49da332Tue, 29 Dec 2020 18:22:09 GMT

So we've decided what metrics we want to track for our machine learning project. Because ML needs data, we need to get it.

In some cases we get lucky and we already have it. Maybe we want to predict the failure of pieces of equipment in a factory. There are already lots of sensors measuring the performance of the equipment and there are service logs saying what was replaced for each equipment. In theory, all we need is a bit of a big data processing pipeline, say with Apache Spark, and we can get the data in the form of (input, output) pairs that can be fed into a machine learning classifiers that predicts if an equipment will fail based on the last 10 values measures from its sensors. In practice, we'll find that sensors of the same time that come from different manufacturers have different ranges of possible values, so they will all have to be normalized. Or that the service logs are filled out differently by different people, so that will have to be standardized as well. Or worse, the sensor data is good, but it's kept only for 1 month to save on storage costs so we have to fix that and wait a couple of months for more training data to accumulate.

The next best case is that we don't have the data, but we can get it somehow. Maybe there are already datasets on the internet that we can download for free. This is the case for most face recognition applications: there are plenty of annotated face datasets out there, with various licenses. In some cases the dataset must be bought, for example, if we want to start a new ad network, there are plenty of datasets available online of personal data about everyone, which can be used then to predict the likelihood of clicking on an ad. That's the business model of many startups...

The worst case is that we don't have data and we can't find it out there. Maybe it's because we have a very specific niche, such as we want to find defects in the manufacturing process of our specific widgets, so we can't use random images from the internet to learn this. Or maybe we want to do something that is really new (or very valuable), in which case we will have to gather the data ourselves. If we want to solve something in the physical world, that will mean installing sensors to gather data. After we get the raw data, such as images of our widgets coming of the production line, we will have to annotate those images. This means getting them in front of humans who know how to tell if a widget is good or defective. There needs to be a Q&A process in this, because even humans have an error rate, so each image will have to be labeled by at least three humans. We need several thousand samples, so this will take some time to set up, even if we can use crowdsourcing websites such as AWS Mechanical Turk to distribute the tasks to many workers across the world.

All this is done, we finally have data. Time to start doing the actual ML...

]]>
<![CDATA[How to ML - Metrics]]>We saw that machine learning algorithms process large amounts of data to find patterns. But how exactly do they do that?

The first step in a machine learning project is establishing metrics. What exactly do we want to do and how do we know we're doing it well?

]]>
https://rolisz.ro/2020/12/28/how-to-ml/5fea21292bc2360ef49da324Mon, 28 Dec 2020 18:19:07 GMT

We saw that machine learning algorithms process large amounts of data to find patterns. But how exactly do they do that?

The first step in a machine learning project is establishing metrics. What exactly do we want to do and how do we know we're doing it well?

Are we trying to predict a number? How much will Bitcoin cost next year? That's a regression problem. Are we trying to predict who will win the election? That's a binary classification problem (at least in the USA). Are we trying to recognize objects in an image? That's a multi class classification problem.

Another question that has to be answered is what kind of mistakes are worse. Machine learning is not all knowing, so it will make mistakes, but there are trade-offs to be made. Maybe we are building a system to find tumors in X-rays: in that case it might be better that we call wolf too often and have false positives, rather than missing out on a tumor. Or maybe it's the opposite: we are trying to implement a facial recognition system. If the system recognizes a burglar incorrectly, then the wrong person will get sent to jail, which is a very bad consequence for a mistake made by "THE algorithm".

These are not just theoretical concerns, but they actually matter a lot in building machine learning systems. Because of this, many ML projects are human-in-the-loop, meaning the model doesn't decide by itself what to do, it merely makes a suggestion which a human will then confirm. In many cases, that is valuable enough, because it makes the human much more efficient. For example, the security guard doesn't have to look at 20 screens at once, but can only look at the footage that was flagged as anomalous.

Tomorrow we'll look at the next step: gathering the data.

]]>
<![CDATA[What is ML? part 3]]>Yesterday we saw that machine learning is behind some successful products and it does have the potential to bring many more changes to our life.

So what is it?

Well, the textbook definition is that it's the building of algorithms that can perform tasks they were not explicitly

]]>
https://rolisz.ro/2020/12/24/what-is-ml-3/5fe4b0ab2bc2360ef49da308Thu, 24 Dec 2020 15:18:51 GMT

Yesterday we saw that machine learning is behind some successful products and it does have the potential to bring many more changes to our life.

So what is it?

Well, the textbook definition is that it's the building of algorithms that can perform tasks they were not explicitly programmed to do. In practice, this means that we have algorithms that analyze large quantities of data to learn some patterns in the data, which can then be used to make predictions about new data points.

This is in contrast with the classical way of programming computers, where a programmer would use either their domain knowledge or they would analyze the data themselves and then write the program that has the correct output.

So one of the crucial distinctions is that in machine learning, the machine has to learn from the data. If a human being figures out the pattern and writes a regular expression to find addresses in text, that's human learning, and we all go to school to do that.

Now does that mean that machine learning is a solution for everything? No. In some cases, it's easier or cheaper to have a data analyst or a programmer find the pattern and code it up.

But there are plenty of cases where despite decades long efforts of big teams of researchers, humans haven't been able to find an explicit pattern. The simplest example of this would be recognizing dogs in pictures. 99.99% of humans over the age of 5 have no problem recognizing a dog, whether a puppy, a golden retriever or a Saint Bernard, but they have zero insight into how they do it, what makes a bunch of pixels on the screen a dog and not a cat. And this is where machine learning shines: you give it a lot of photos (several thousands at least), pair each photo with a label of what it contains and the neural network will learn by itself what makes a dog a dog and not a cat.

Machine learning is just one tool that is available at our disposal, among many other tool. It's a very powerful tool and it's one that gets "sharpened" all the time, with lots of research being done all around the world to find better algorithms, to speed up their training and to make them more accurate.

Come back tomorrow to find out how the sausage is made, on a high level.

]]>