A couple weeks ago I started doing Project Euler problems. Unfortunately this was during their security breach so login was disabled so I couldn’t get credit for any of the problems I solved. Now that account functionality is restored, I want to resubmit my answers (about 60 of them) on my actual login. Since they have a 30 second rate limiting between answer submissions, it would’ve taken me half an hour to manually resubmit. This was a little too tedious and boring so I wanted to automate it. But they do have a CAPTCHA you need to fill out for each submission to prevent this…
Fortunately their CAPTCHAs are were really weak (EDIT: Project Euler made their CAPTCHAs significantly harder since this post): digits only, uniform font/font-size/font-color/background, well separated, unobstructed, etc. It seems like the only variation is that each digit is slightly rotated. Here are a few typical examples:
Writing a naive solver for these CAPTCHAs is pretty easy. The rest of this post will outline how (link to code is at the end).
First step is to extract each digit. Since all the digits are the same solid color, you can flood-fill to find all the contiguous region with the same color as the font, sweeping left to right. Here is what that looks like:
Once you have separated out the digits you need to classify them. Since there aren’t that many variations for each digit, you can actually just label enough training data to cover almost all the different cases.
So I went and labeled about 20 CAPTCHAs manually. This means 20 * 5 labeled digits, which is an average of 10 examples for each digit. This was good enough to cover most of the orientations each digit could be rotated in. For example these were all the labeled images for 2:
Then to classify new digits, you just need to match it up with the labeled digit it is most similar to. A pixel-wise comparison worked well as the distance function and can already classify more than half the CAPTCHAs correctly. Adding a threshold for matching only when the number of pixels is approximately correct made the classification perfect as far as I can tell. It makes sense that area is a good feature for distinguishing the digits since area is invariant under rotation (if not for pixel interpolation) and there are only a few digits with similar number of pixels (e.g., 6 and 9).
And that’s all there is to it! Once you have the solver it is easy to whip up a mechanize/beautifulsoup python script to submit/sleep(30)/repeat. Thirty minutes later it managed to submit all my answers with only 2 out of 60 CAPTCHAs failing, both succeeding on the automatic refresh. Both failures were due to overlapping digits where it only managed to extract 4 contiguous regions instead of 5 (which is a fixable problem but not worth the complexity in this case).
The solver is about 80 lines of python with the only external dependency being scipy for reading images.
The source code for the solver/submitter scripts can be found here (no answers included and please don’t use this for submitting answers to problems you didn’t solve yourself!).