The Master Algorithm

The Master Algorithm

Having a goal to read books really helps with reading them. I finally started reading through the mountain of machine learning books that have been sitting on my shelves for half a year.

The Master Algorithm by Pedro Domingos, a professor at the University of Washington, is an user friendly book for introducing lay people to machine learning. It contains maybe 2-3 lines of math, 0 code, but a lot of explanations.

Professor Domingos divides up all of machine learning into five "tribes": symbolists, connectionists, evolutionaries, bayesians and analogizers. He describes various algorithms that are in each tribe, usually managing to make it quite understandable. However, for some reason, he glides over deep learning, which is part of the connections tribe, without giving too many details. Deep learning is the algorithm that has been making rounds for the past 5 years, in everything from voice recognition to object localization, from machine translation to beating the World Champion at Go. Some other things could have been shortened, to make room for it, such as the chapter about genetic programming or about SVMs.

I find interesting the author's view that there is one master algorithm to rule them all and in darkness bind them, especially in the light of things like the No Free Lunch Theorem. This theorem says that if a machine learning algorithm is particularly good at a certain class of problems, it must be worse for all the other problems, and this is true for any possible algorithm. In the book, the author tries to argue, starting from Hume's question of how it's possible to learn, that by feeding prior information to a learner, you can get it to perform well on any dataset. I'm not sure I buy this argument. No matter how much "prior information" you will give to a decision tree, it will have a hard time learning XOR or parity problems. You can give it the prior information of "this is the result of doing XOR on these variables", but then you have solved the learning problem for it, so it's not really doing a lot of useful stuff for you.

But suppose that there is this even better algorithm, that can learn anything if you give it enough prior information and all the data it needs. Such an algorithm is not that hard to imagine, a super naive and inefficient version would be "go through all possible parameters with a step of 0.00000001. Throw out anything that contradicts prior information, and choose from the remaining what fits the data best". It's a super slow algorithm, but it does the job (that it finishes only after the heat death of the universe, well, that's another problem). I would argue that the prior information that we give it, "transform" it into another algorithm. It can be shown for example that Hidden Markov Models, PCA, Factor Analysis, Kalman Filters and Mixtures of Gaussian Clusters, can all be described by the same general set of equations governing Linear Gaussian Models. The only difference between them is the "prior information" that we impose upon the model: whether the data is continuous or not, whether it varies by time, or whether the parameters of the model have to form a multiple of the identity matrix. Yet we consider them as different algorithms and for most of them we have different implementations, because that's how it's the most efficient. Similarly, even if we had a master algorithm, this prior information would twist it so much, that it would become an entirely different algorithm.

There is a weird episode, where the author goes full-blown medieval fantasy novel, and describes an imaginary world of the five tribes and how their cities would look like. It feels very out of place. I understand that he is trying to make it appealing to the masses, but it could have been made in a different way, by describing the high-dimensional spaces, and how various algorithms shape it differently.

The author ends in a quite positive note. He is not scared of the Singularity and he makes some valid points, with which I agree. He says that machine learning algorithms optimize loss functions and nothing more. If you build an algorithm to make you a good dinner, it will try to find out what you like best and will optimize it's recipes based on that, but it won't turn on you and try to kill you. It might accidentally do that, if it doesn't turn off the gas, but that's because it knows too little, not because it knows too much.

He is also not worried about the NSA and he argues about more control over how our data is used by companies. He proposes building fake avatars out of our data, that do all the modeling for us, and then we go out and enjoy only the best. It's an interesting idea, but... it feels a bit non-human. But who knows what our kids will be doing in 30 years.

I can't really say I recommend the book. It has some good ideas, it does explain well some concepts to laypeople, but it has some huge flaws in other areas, missing some key concepts.

Grade: 7