Character segmentation overfitting
I'm doing a project about doing OCR on receipts and today, while trying to do character segmentation, I made a pretty stupid mistake that led to my model overfitting almost perfectly pretty neatly (in some cases I got 100% correct classification accuracy).
I already had my own data about letters (with the help of my parents, I labeled 7000 letters, with their bounding boxes in about 25 receipts) and my classifier (a simple linear SVM) on individual letters did pretty good: between 90-94% accuracy. For something obtained with almost 0 fiddling, it's pretty good, and good enough for my purposes. Also it's pretty much impossible to tell apart 0 and O in a receipt, because you need context to do that. And there are a lot of 0s and Os in a receipt.
So I turned to the next problem: character segmentation. Looking for connected components wasn't good enough. On a receipt many letters are broken (especially r's and u's) and a lot are stuck together. So I needed to train a model to decide whether an image portion is centered on a place where we should cut or not.
I started building my training data set by looking for letters that were closer than 2 pixels together and considering the middle between rightmost point of one and leftmost point of second one to be a segmentation point. For the negative class, I wanted to include all my letters and move to the left and right randomly 1-3 pixels and include that as well.
And here I made a small mistake: when cutting between letters, I needed to take x pixels to the left and x to the right. However, when cropping the letters, I started from the leftmost x coordinate and needed to take 2*x to the right. But because of some hasty copy-paste, I ended up taking only one x to the right.
Because I was adding some padding around everything that wasn't the correct size (line_height * 2*x), everything was working out "all right": I was getting the correct sized vectors from both operations.
When I first saw that I was getting over 99% accuracy on the test set I was really happy. But then I started to test it on images for actual segmentation, I was in for a surprise: almost every column was predicted to be the place for a segmentation. The odd column around a T, I or . was classified correctly as not a segmentation, but that was it.
What was going on? I tried a lot of different classifiers (almost everything from the scikit-learn library), I tried adding ZCA whitening. Nothing.
I started looking more closely at the images it cut out: most of the non-segmentation ones had a nice consistent 3-4px padding to their sides, while the segmentation images didn't. And that's what the classifiers where actually learning: if image has padding it is not a segmentation.
After fixing this, the classification accuracy dropped to about 90%, but in this case, it was actually useful in the "real world".
Important lesson: make sure your classifiers are learning what you want them to learn, and not just some random correlations.
My consolation is that even some DoD researchers made the same mistake: http://lesswrong.com/lw/7qz/machine_learning_and_unintended_consequences/ :D