I'm doing a project about doing OCR on receipts and today, while trying to do character seg­men­ta­tion, I made a pretty stupid mistake that led to my model over­fit­ting almost perfectly pretty neatly (in some cases I got 100% correct clas­si­fi­ca­tion accuracy).

I already had my own data about letters (with the help of my parents, I labeled 7000 letters, with their bounding boxes in about 25 receipts) and my classifier (a simple linear SVM) on individual letters did pretty good: between 90-94% accuracy. For something obtained with almost 0 fiddling, it's pretty good, and good enough for my purposes. Also it's pretty much impossible to tell apart 0 and O in a receipt, because you need context to do that. And there are a lot of 0s and Os in a receipt.

So I turned to the next problem: character seg­men­ta­tion. Looking for connected components wasn't good enough. On a receipt many letters are broken (especially r's and u's) and a lot are stuck together. So I needed to train a model to decide whether an image portion is centered on a place where we should cut or not.

I started building my training data set by looking for letters that were closer than 2 pixels together and con­sid­er­ing the middle between rightmost point of one and leftmost point of second one to be a seg­men­ta­tion point. For the negative class, I wanted to include all my letters and move to the left and right randomly 1-3 pixels and include that as well.

And here I made a small mistake: when cutting between letters, I needed to take x pixels to the left and x to the right. However, when cropping the letters, I started from the leftmost x coordinate and needed to take 2*x to the right. But because of some hasty copy-paste, I ended up taking only one x to the right.

Because I was adding some padding around everything that wasn't the correct size (line_height * 2*x), everything was working out "all right": I was getting the correct sized vectors from both operations.

When I first saw that I was getting over 99% accuracy on the test set I was really happy. But then I started to test it on images for actual seg­men­ta­tion, I was in for a surprise: almost every column was predicted to be the place for a seg­men­ta­tion. The odd column around a T, I or . was classified correctly as not a seg­men­ta­tion, but that was it.

What was going on? I tried a lot of different clas­si­fiers (almost everything from the scikit-learn library), I tried adding ZCA whitening. Nothing.

I started looking more closely at the images it cut out: most of the non-seg­men­ta­tion ones had a nice consistent 3-4px padding to their sides, while the seg­men­ta­tion images didn't. And that's what the clas­si­fiers where actually learning: if image has padding it is not a seg­men­ta­tion.

After fixing this, the clas­si­fi­ca­tion accuracy dropped to about 90%, but in this case, it was actually useful in the "real world".

Important lesson: make sure your clas­si­fiers are learning what you want them to learn, and not just some random cor­re­la­tions.

My con­so­la­tion is that even some DoD re­searchers made the same mistake: http://lesswrong.com/lw/7qz/ma­chine_learn­ing_and_un­in­tend­ed_­con­se­quences/ :D