Let’s say we want to find the difference between these two images:

You can do this by simply inverting the colors and then stacking them together at 50%

]]>Let’s say we want to find the difference between these two images:

You can do this by simply inverting the colors and then stacking them together at 50% opacity. This can be achieved with a single CSS filter:

`-webkit-filter: invert(100%) opacity(50%);`

What you should get back is a gray image except for places where the two images differ. This works because inverting will flip the color (`255 - c`

) and opacity will blend the two colors equally (`c1 + c2 / 2`

). So the resulting color is `((255 - c1) + c2) / 2`

which is just `rgb(127.5, 127.5, 127.5)`

whenever the overlapping pixel colors `c1`

and `c2`

are the same.

Firefox 34 and older doesn’t support this but you can fallback to an equivalent SVG filter:

`filter: url("data:image/svg+xml;utf8,<svg xmlns='http://www.w3.org/2000/svg'><filter id='invert' color-interpolation-filters='srgb'><feColorMatrix color-interpolation-filters='srgb' in='SourceGraphic' type='matrix' values='-1 0 0 0 1 0 -1 0 0 1 0 0 -1 0 1 0 0 0 0 0.5'/></filter></svg>#invert");`

Or use mix-blend-mode: difference which actually produces prettier diffs ~~but isn’t supported on Chrome~~. *EDIT(10-2015): Ed in the comments below verified that mix-blend-mode now works on Chrome too! Check out his example.*

Doing the visual diff using pure CSS is nice because it is possible even if you don’t have access to the raw pixel data (which is usually only available by loading the image into a canvas or using Chrome extension’s captureVisibleTab). But with CSS you can do a visual diff of any element as long as you can overlap things on top of it.

In particular, let’s say your designer mocked up something in Photoshop and gave it to you as a png to build. To layout elements matching the design, the usual workflow might involve manually measuring dimensions or overlaying a semitransparent mockup as a template. This process is usually pretty tedious since humans are surprisingly bad at detecting changes so you might miss subtle differences between the design and what you built.

So you can use this CSS trick to get a quick visual diff instead. For example, let’s say someone gave you a screenshot of google.com and asked you to replicate it. To help position things you can overlay the inverted mock (see first section of the js tab) and then tweak the dimensions/margin/padding in inspector until everything is aligned:

See the Pen vEGmKR by Franklin Ta (@fta) on CodePen.

When everything is solid gray, you know you got everything pixel perfect!

]]>If you don’t want to manage your own

]]>If you don’t want to manage your own servers, you can use some external service for signaling (and a public STUN server). But let’s say just for fun we don’t want to use any external services either. Then one interesting approach by Chris Ball is to manually exchange the session description by pasting them into IM or email.

I thought it would be cool to extend his demo by transmitting the data using QR codes instead. Then you will be able to establish a WebRTC connection between a phone and a desktop (or two phones with each other) by just pointing their cameras at each other.

After gluing some QR code libraries together, this was the result:

These two are just plain static web pages running javascript (using jsqrcode for scanning QR codes, and jquery-qrcode for generating). On the left in the video is a remote debugger showing what the Android Chrome browser is seeing, and on the right is a regular desktop Chrome browser. The signaling process only requires one back and forth to generate an offer and reply with an answer. So both cameras are active the whole time looking for the other’s offer/answer while displaying QR codes of their own. Once the exchange is done they can open a data channel which in this case is used for a simple chat application.

The interesting part was actually how to encode into QR codes. The session description is about ~2000 characters which is hitting the limit of what a single QR code was designed for:

This is too large to reliably scan especially since the image processing is done using javascript on a mobile device. Compressing can cut down the size by about 50% but this is still too large to read quickly. So you have to break up the data into smaller pieces and encode those into QR codes individually.

But then you will need to scan multiple QR codes in series and if you want to avoid requiring user interaction, you need to automatically know when to display the next QR code. At this stage, the two devices can’t communicate yet so it is hard to ‘ACK’ that you are done scanning a code unless both devices are simultaneously in view of each other’s camera which is very awkward to position physically. (Not that any of this is practical or serious in any way, but trying to build a full duplex connection out of QR code is getting way deeper into IP over avian carriers territory.)

So I ended up just flashing each QR code for a brief moment over and over again. This doesn’t work super well since you can’t flip the codes too quickly or there will be tearing artifacts but flipping too slowly means long waits to get another chance to try again if it misses a frame. It did work consistently enough for a demo so I was okay with declaring this experiment done!

Here’s a link to the demo you can try: https://db.tt/m3HdcC94. Some warnings:

- Requires latest Chrome on both desktop and Android (I used some Chrome specific stuff so Firefox isn’t supported). Tested with a MacBook Air and a Nexus 5.
- Requires camera permissions. If you don’t complete the flow, make sure to exit the page so it can release the camera.
- The exchange must complete within 30 seconds or it will timeout. It will probably take a couple of tries to learn the sweet spots for the scanner to do this quickly enough.
- Will probably heat up your CPU since it is naively checking for a QR code for every frame captured.

Code is on github.

]]>This is certainly nothing new and was done over seven years ago. Oliver Kreylos, the creator of the video I just linked to, wrote a very good

]]>This is certainly nothing new and was done over seven years ago. Oliver Kreylos, the creator of the video I just linked to, wrote a very good writeup for how he did the motion tracking along with source code. But we will see how to reimplement the same effect with just one function using a computer vision library, OpenCV!

To see what we can do with the motion tracking, here’s a simple demo that plots the position of the wiimote as I am using tracing letters in the air (also shown is a white square that represents 4 LEDs along with its projected image onto the wiimote’s camera):

And another that shows my hand:

First a bit of background about the wiimote. In terms of sensing capabilities, it has an infrared camera, an accelerometer, and if using Motion Plus, a gyroscope also. If you want the exact specs you can check out WiiBrew’s wiki.

For our purposes we will only need the IR camera which is normally used for tracking IR LEDs on the Wii’s sensor bar. The IR camera doesn’t capture traditional images and instead has on-board image processing for detecting up to 4 blobs at 1024×768 resolution which it reports back at up to 100Hz. This is pretty awesome since this is a relatively high report rate and we only have to deal with 4 coordinates rather than 1024*768 pixels like with a regular camera.

To use this data we need to connect the wiimote via bluetooth. There are a lot of libraries (most of which are abandoned) for interfacing with the wiimote. It doesn’t matter which one we choose since we just need very basic functionality for connecting to the device and reading raw IR values. The one I ended up using was wiiuse.

Once we have the wiimote talking to our computer we can capture “images” of up to 4 points like this with the IR camera:

This might not seem like much information but if I told you that the picture above was of a square, you might guess that the wiimote camera is looking at something like this:

Or to draw it a different way, it might be looking at it from this angle:

This process of guessing the camera's position/orientation is all we need for motion tracking! In other words, if we know the real shape of the object, we can just iteratively guess a location of the camera and see if it will take the same “photo” as the one we saw. In code this can be implemented with some variant of gradient descent that minimizes the reprojection error.

But you don’t need to implement it yourself. OpenCV has a powerful function that can do exactly that for you called solvePnP (where PnP stands for the perspective-n-point problem). The signature for it looks like this:

```
bool solvePnP(InputArray objectPoints,
InputArray imagePoints,
InputArray cameraMatrix,
InputArray distCoeffs,
OutputArray rvec,
OutputArray tvec,
bool useExtrinsicGuess=false,
int flags=ITERATIVE)
```

Arguments

**objectPoints**: These are the points of the object that the camera is looking at in world coordinates. For example if we are looking at LEDs in the shape of a square this might be [(1, 1, 0), (2, 1, 0), (1, 2, 0), (2, 2, 0)].**imagePoints**: These are the 2D points representing the image that we got from the camera.**cameraMatrix/distCoeffs**: This is also known as the intrinsic matrix that describes how your camera captures an image. It consists of the focal lengths, skew, and projection center. Usually we can do camera calibration to get the exact values for these but I wasn’t getting consistent measurements so I picked something that were close to the physical specs instead. And there’s also the distortion coefficients which is also something we can get from calibration, but I just assumed zero distortion.**rvec / tvec**: This is the output camera translation/rotation! It is used to build what is known as the extrinsic matrix which gives us the rigid transform from the world frame to the camera frame. So if we had a point`p`

in world coordinates and we want to know where it is in camera coordinates we would do_{w}`p`

, and similarly if we want to know where a point in camera coordinates is in the world we would do_{c}= R p_{w}+ t`p`

(noting that the inverse of a rotational matrix is its transpose). For example the position of the camera is at (0, 0, 0) in the camera frame, so it is at_{w}= R^{⊤}(p_{c}– t)`-R`

in the world frame.^{⊤}t**flags**: The algorithm to use. The default algorithm CV_ITERATIVE is based on Levenberg-Marquardt which is the same as the one that Kreylos implemented.

Or in short, you need to give it objectPoints (3d coordinates defining the shape you’re looking at), imagePoints (the captured 2d points from the image) and camera intrinsics (how to reproject the object into a new image for calculating error). Then it will spit back out the camera position/rotation.

Here’s a short snippet showing how it is used:

```
// Intrinsic
double fx = 1700;
double fy = 1700;
double cx = image_width / 2;
double cy = image_height / 2;
cv::Mat intrinsic = (cv::Mat_<double>(3, 3) <<
fx, 0, cx,
0, fy, cy,
0, 0, 1
);
// Solve for rvec and tvec
cv::Mat rvec, tvec;
solvePnP(object_points, image_points, intrinsic,
cv::noArray(), rvec, tvec);
// Extrinsic
cv::Mat R;
Rodrigues(rvec, R);
cv::Mat extrinsic = cv::Mat::eye(4, 4, CV_64F);
R.copyTo(extrinsic.rowRange(0, 3).colRange(0, 3));
tvec.copyTo(extrinsic.rowRange(0, 3).col(3));
// Extrinsic inverse
cv::Mat extrinsic_inv = extrinsic.inv();
// Setup where to view from
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(blah blah blah);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
gluLookAt(blah blah blah)
/*** Insert code here to draw in world frame ***/
glMultMatrixd(cv::Mat(extrinsic_inv.t()).ptr<double>(0));
/*** Insert code here to draw in camera frame ***/
```

Side note: This sometimes isn’t a unique solution. Especially for a symmetric shape like a square there are inherently many different positions that will explain the image you captured:

But the wiimote can only “see” 4 points so there’s not much you can do about it.

But the worst shape is probably the default wiimote sensor bar which is in a straight line:

If you just see a line of points you have an infinite number of camera pose that can explain your picture. To get around that we have to build our own IR LED marker!

It is easy to build one since all we are doing is making a circuit with four 940nm infrared LEDs. We will also need wires, resistors and batteries. I used this calculator to figure out what resistors I needed.

To make the LEDs into a square, I poked them through a piece of cardboard, using a thumbtack to make holes first, and then taped the pins of the LEDs together (I don’t have a soldering iron). I only chose square because it was easy but it could be whatever you want. The LEDs are diodes which means direction matters so make sure you hook them up in the right order (the longer pin is positive). Here is what mine looks like:

If you are human you probably can’t see infrared but digital cameras can (as long as they don’t have an IR filter) so you can use that for debugging. Here is an image taken with my android phone that shows the LEDs glowing purple:

You can get the code on github if you want to experiment with it. I only tested on a Mac and with a Wii Remote Plus (which didn’t work with wiiuse at first without a small patch so it might be safer to go with the original wiimotes if you don’t need the gyro for anything else). I am not sure I would recommend using this for anything serious but it was cheap and fun to play with!

A few notes on limitations:

- The narrow field of view makes this very awkward to use since you have to keep the target visible to the camera at all times.
- LEDs have a narrow beamwidth so the wiimote might not detect them even if it is in its field of view. Retroreflective markers might work better here.
- Square shape for the LEDs was not a great idea because you can’t automatically tell which of the 4 (or 8 if you count looking from back) directions you are looking from. It was dealt with by having one of the buttons cycle through the 4 orderings to select one manually but it is annoying whenever the LEDs go out view since then you’ll have to do it all over again. It is probably better to have something asymmetric.
- You can scale up the physical size of the marker if you want to track from farther distance. But regardless of physical scale it will usually start getting unusably jittery/jumpy very quickly once the projected image is smaller than let’s say about 1/10 of the imaging plane. This is probably because the camera’s real resolution is actually 128×96 (it does subpixel analysis to get to 1024×768 resolution). I didn’t do any filtering for the demos but maybe that will help (opencv actually has an implementation of a Kalman Filter).

I wanted to add an easter egg

]]>I wanted to add an easter egg where I could use the screens of those devices to display arbitrary things. I thought it would just be a simple matter of pixel pushing with translate/rotate/scale/etc using CSS transforms but couldn't really get things to line up perfectly.

Frustrated, I started trying to solve it analytically instead. That means for any given shape, I need to solve for the perspective transform that warps an element into that shape. Once that is solved, it is easy to write a WYSIWYG helper script for outputting the CSS. Here's the final result:

See the Pen ifnqH by Franklin Ta (@fta) on CodePen.

See the coffeescript tab for the code. Or paste this gist into console to try it out on any page that has jQuery. You will need to change the selector to whatever element you want to add the dots to. Using that you can drag things into whatever (convex quadrilateral) shape you want:I ended up not using this for anything but hope someone else will find it useful!

The rest of this post will explain how to derive the equation for the transform since I remember not being able to find much about it back then. Looking at the code you will see that the core logic is just a few lines to set up and solve a system of linear equations. We will now see how to derive that system.

So let's say we have the 4 corners of the element we want to transform, \((x_i, y_i)\) where \(i \in {0, 1, 2, 3}\) and we want to map each \((x_i, y_i)\) to some \((u_i, v_i)\). From the docs of matrix3d, the transform we want is a homogeneous matrix so we have to represent each point using homogeneous coordinates. In homogenous coordinates, a point \((x, y)\) is represented by \((k x, k y, k)\) for any \(k \neq 0\). For example \((3, 2, 1)\) and \((6, 4, 2)\) both represent the point \((3, 2)\).

Thus the transformation matrix \(H\) that we want to solve for must satisfy

$$
\underbrace{
\begin{pmatrix}
h_0 & h_1 & h_2 \\

h_3 & h_4 & h_5 \\

h_6 & h_7 & h_8 \\

\end{pmatrix}
}_{H}
\begin{pmatrix}
x_i \\

y_i \\

1 \\

\end{pmatrix}
= k_i
\begin{pmatrix}
u_i \\

v_i \\

1 \\

\end{pmatrix}
$$

for each \(i\), where the knowns are \(x_i, y_i, u_i, v_i\).

Notice that the \(H\) that can satisfy this is not unique. For example you can scale \(H\) by some constant and the resulting matrix will still map the points correctly (since you can also scale the \(k_i\) by the same amount and still represent the same homogeneous point). So assuming \(h_8 \neq 0\) (see footnote^{1}), we should always be able to scale both sides until \(h_8 = 1\), which will simplify the problem for us a little bit:

$$
\begin{pmatrix}
h_0 & h_1 & h_2 \\

h_3 & h_4 & h_5 \\

h_6 & h_7 & 1 \\

\end{pmatrix}
\begin{pmatrix}
x_i \\

y_i \\

1 \\

\end{pmatrix}
= k_i
\begin{pmatrix}
u_i \\

v_i \\

1 \\

\end{pmatrix}
$$

Now we should try to get it into a form that we can solve. Multiplying out we get:

$$
\begin{align*}
x_i h_0 + y_i h_1 + h_2 & = k_i u_i \\

x_i h_3 + y_i h_4 + h_5 & = k_i v_i \\

x_i h_6 + y_i h_7 + 1 & = k_i \\

\end{align*}
$$

We can get rid of \(k_i\) by substituting it from the third into the first two equations:

$$
\begin{align*}
x_i h_0 + y_i h_1 + h_2 & = u_i x_i h_6 + u_i y_i h_7 + u_i \\

x_i h_3 + y_i h_4 + h_5 & = v_i x_i h_6 + v_i y_i h_7 + v_i \\

\end{align*}
$$

Remember we are trying to solve for \(h_i\) so we should try to separate them out:

$$
\begin{array}{rcccl}
x_i h_0 + y_i h_1 + h_2 & & - u_i x_i h_6 - u_i y_i h_7 = u_i \\

& x_i h_3 + y_i h_4 + h_5 & - v_i x_i h_6 - v_i y_i h_7 = v_i \\
\end{array}
$$

Which in matrix notation is:

$$
\begin{pmatrix}
x_i & y_i & 1 & 0 & 0 & 0 & -u_i x_i & -u_i y_i \\

0 & 0 & 0 & x_i & y_i & 1 & -v_i x_i & -v_i y_i \\

\end{pmatrix}
\begin{pmatrix}
h_0 \\

h_1 \\

h_2 \\

h_3 \\

h_4 \\

h_5 \\

h_6 \\

h_7 \\

\end{pmatrix} = \begin{pmatrix}
u_i \\

v_i \\

\end{pmatrix}
$$

Since we have 4 of these mappings we can write them like this:

$$
\begin{pmatrix}
x_0 & y_0 & 1 & 0 & 0 & 0 & -u_0 x_0 & -u_0 y_0 \\

0 & 0 & 0 & x_0 & y_0 & 1 & -v_0 x_0 & -v_0 y_0 \\

x_1 & y_1 & 1 & 0 & 0 & 0 & -u_1 x_1 & -u_1 y_1 \\

0 & 0 & 0 & x_1 & y_1 & 1 & -v_1 x_1 & -v_1 y_1 \\

x_2 & y_2 & 1 & 0 & 0 & 0 & -u_2 x_2 & -u_2 y_2 \\

0 & 0 & 0 & x_2 & y_2 & 1 & -v_2 x_2 & -v_2 y_2 \\

x_3 & y_3 & 1 & 0 & 0 & 0 & -u_3 x_3 & -u_3 y_3 \\

0 & 0 & 0 & x_3 & y_3 & 1 & -v_3 x_3 & -v_3 y_3 \\

\end{pmatrix}
\begin{pmatrix}
h_0 \\

h_1 \\

h_2 \\

h_3 \\

h_4 \\

h_5 \\

h_6 \\

h_7 \\

\end{pmatrix} = \begin{pmatrix}
u_0 \\

v_0 \\

u_1 \\

v_1 \\

u_2 \\

v_2 \\

u_3 \\

v_3 \\

\end{pmatrix}
$$

At this point we are done because it is in \(Ah = b\) form so we can just throw this at a matrix algebra library to solve for \(h\). It should spit back out the \(h_i\) which will let us recover the transform we want:

$$
H =

\begin{pmatrix}
h_0 & h_1 & h_2 \\

h_3 & h_4 & h_5 \\

h_6 & h_7 & h_8 \\

\end{pmatrix}
$$

One last wrinkle is that matrix3d actually takes in a 4 by 4 matrix rather than a 3 by 3. Since we don't care about \(z\) values (because all our points are on the same plane, \(z=0\)) we can just make \(z\) map back to itself. Like so:

\[
\begin{pmatrix}
h_0 & h_1 & 0 & h_2 \\

h_3 & h_4 & 0 & h_5 \\

0 & 0 & 1 & 0 \\

h_6 & h_7 & 0 & h_8 \\

\end{pmatrix}
\]

And that's the final matrix you use for matrix3d. Remember to specify it in column major order and also set the transform-origin to whatever you measured your points with respect to.

I didn't know what to google when I first did this so I had to derive this by hand. I have recently been reading a book on computer vision and it turns out this problem is actually a basic problem in that field (see getPerspectiveTransform in opencv) and the technique for solving this is called direct linear transformation.

Footnotes:

If you want to solve this rigorously without making the assumption that \(h_8 \neq 0\) you can still follow the same steps outlined in this post. You will just end up with a homogeneous system \(Ah = 0\) instead (where \(A\) is now a 8 by 9 matrix and \(h\) is a 9-vector). This can be solved by doing a singular value decomposition and then finding the singular vector corresponding with a singular value of zero. Then any scalar multiple of that singular vector will be a solution. The js library I was using for math didn't support SVD very well though! ↩

EDIT (2016-11-09): As mentioned by Alexander in the comments, matrix3d used to be incorrect on Chrome under page zoom. I contributed a fix to chromium so it should be fine now. ↩

So I started digging through `java.util.Random`

's source and found that it is just a linear congruential generator which can be easily to cracked to obtain the original seed. The rest of this post will show how (if you just want to get the code, you can find it here.

A quick refresher on how LCGs work: starting with a seed `x`

, the next pseudorandom number generated is given by _{0}`x`

, for some fixed constants _{1} = a * x_{0} + c mod m`a`

, `c`

, and `m`

. So for example let's say `a = 3`

, `c = 5`

, `m = 7`

and we start with the seed `x`

, then the next few random numbers generated will be: _{0} = 0`5 = 3 * 0 + 5 mod 7`

, `6 = 3 * 5 + 5 mod 7`

, `2 = 3 * 6 + 5 mod 7`

, `4 = 3 * 2 + 5 mod 7`

, etc.

It should be clear that if I gave you a "random" number generated from this process (e.g., `x`

), you can predict the next number by applying the formula yourself (e.g, _{i} = 2`x`

)._{i+1} = 3 * 2 + 5 mod 7 = 4

The relevant method in `java.util.Random`

looks like this (where `m=2`

or in other words, a 48-bit number is generated with each iteration):^{48}

```
public
class Random implements java.io.Serializable {
private final AtomicLong seed;
private static final long multiplier = 0x5DEECE66DL;
private static final long addend = 0xBL;
private static final long mask = (1L << 48) - 1;
protected int next(int bits) {
long oldseed, nextseed;
AtomicLong seed = this.seed;
do {
oldseed = seed.get();
nextseed = (oldseed * multiplier + addend) & mask;
} while (!seed.compareAndSet(oldseed, nextseed));
return (int)(nextseed >>> (48 - bits));
}
/* ... */
}
```

Unlike our toy example, each call to `next`

only returns `bits`

number of the top bits instead of the whole 48-bit number generated so there's a bit more work to do.

Now going back to the post title, how is `next(bits)`

used for `Math.random()`

? If you dig into the java source code you will find that `Math.random()`

is just a call to `nextDouble()`

on a static instance of `Random`

. And `nextDouble`

calls `next(bits)`

like this:

```
public double nextDouble() {
return (((long)(next(26)) << 27) + next(27))
/ (double)(1L << 53);
}
```

Which means it is concatenating the top 26 and 27 bits of two iterations of `next()`

to make a 53 bit number which it normalizes back into a double with a value between 0 and 1.

We are interested in the original values returned by `next(26)`

and `next(27)`

which can be recovered by reversing the operations done above. So assuming we have a value that we know was generated from calling `nextDouble()`

, we can do the following:

```
long numerator = (long)(nextDoubleValue * (1L << 53));
int next26 = (int)(numerator >>> 27);
int next27 = (int)(numerator & ((1L << 27) - 1));
```

Now we have the top 26 and 27 bits of two previous seeds used by `next()`

. But only having the top bits is not sufficient for generating future values. Fortunately, it is fairly quick to brute force the unknown remaining lower 48-26=22 bits (2^{22} possibilities). You can do this by trying all 48-bit seeds that has the same top 26 bits as `next26`

and apply the LCG formula to see if the next seed has the same top 27 bits as `next27`

. Like so:

```
long upper27Of48Mask = ((1L << 27) - 1) << (48 - 27);
long oldSeedUpper26 = ((long)next26 << (48 - 26)) & mask;
long newSeedUpper27 = ((long)next27 << (48 - 27)) & mask;
ArrayList<Long> possibleSeeds = new ArrayList<Long>();
for (long oldSeed = oldSeedUpper26;
oldSeed <= (oldSeedUpper26 + ((1L << (48 - 26)) - 1));
oldSeed++) {
long newSeed = (oldSeed * multiplier + addend) & mask;
if ((newSeed & upper27Of48Mask) == newSeedUpper27) {
possibleSeeds.add(newSeed);
}
}
```

It is possible that out of the 2^{22} seeds brute-forced there is a next value with the same top 27 bits purely by chance. Though if we make the (questionable) assumption that each `newSeed`

is a random independent number, this probability is fairly small: 1 - (1 - 2 ^{-27})^{222} = 0.03076676574.

So it is likely that we will find exactly one seed with the brute force. Using that seed we can now simulate the return value for all future calls to `next()`

and thus `nextDouble()`

and `Math.random()`

also. So we just managed to predict all future values of this generator!

It is well known that `Math.random()`

is insecure so none of this is news to anyone. I just didn't expect it to be this easy and that you can do it by just observing a single previous value!

Complete code for this can be found here.

As for the original competition this was written for, I actually ended up losing anyway. This only managed to guess a couple thousand out of 100000 correctly before exceeding the time limit. The other cheaters did a much better job by using java reflection to modify local variables to win.

]]>Fortunately their CAPTCHAs ~~are~~ were really weak (EDIT: Project Euler made their CAPTCHAs significantly harder since this post): digits only, uniform font/font-size/font-color/background, well separated, unobstructed, etc. It seems like the only variation is that each digit is slightly rotated. Here are a few typical examples:

Writing a naive solver for these CAPTCHAs is pretty easy. The rest of this post will outline how (link to code is at the end).

First step is to extract each digit. Since all the digits are the same solid color, you can flood-fill to find all the contiguous region with the same color as the font, sweeping left to right. Here is what that looks like:

Once you have separated out the digits you need to classify them. Since there aren’t that many variations for each digit, you can actually just label enough training data to cover almost all the different cases.

So I went and labeled about 20 CAPTCHAs manually. This means 20 * 5 labeled digits, which is an average of 10 examples for each digit. This was good enough to cover most of the orientations each digit could be rotated in. For example these were all the labeled images for 2:

Then to classify new digits, you just need to match it up with the labeled digit it is most similar to. A pixel-wise comparison worked well as the distance function and can already classify more than half the CAPTCHAs correctly. Adding a threshold for matching only when the number of pixels is approximately correct made the classification perfect as far as I can tell. It makes sense that area is a good feature for distinguishing the digits since area is invariant under rotation (if not for pixel interpolation) and there are only a few digits with similar number of pixels (e.g., 6 and 9).

And that’s all there is to it! Once you have the solver it is easy to whip up a mechanize/beautifulsoup python script to submit/sleep(30)/repeat. Thirty minutes later it managed to submit all my answers with only 2 out of 60 CAPTCHAs failing, both succeeding on the automatic refresh. Both failures were due to overlapping digits where it only managed to extract 4 contiguous regions instead of 5 (which is a fixable problem but not worth the complexity in this case).

The solver is about 80 lines of python with the only external dependency being scipy for reading images.

The source code for the solver/submitter scripts can be found here (no answers included and please don’t use this for submitting answers to problems you didn’t solve yourself!).

]]>