CAPTCHA - OCR breaker

Nitrax · July 20, 2016, 12:24pm

Hi fellas,

Today I will show you how to bypass OCR based CAPTCHA. The first thing that you have to keep in mind is that every CAPTCHAs are different and require an adapted image processing. For the sake of this article, I selected two distinct types of CAPTCHA, each of them needing a specific approach to be resolved.

Theory

OCR based CAPTCHA flooded the market around 2000 with aim was to secure any web services online against abuses such as ticket scalping, automatic submission URL or spams.

Those CAPTCHAs consist on the recognition of a string of characters in order to achieve, as its acronym say, a Reverse Turing Test to verify that the end user is human, preventing bots usage.

To make it works and avoid automation, companies process randomized strings by applying distortion, rotation and background randomization. However, for usability concern, those transformations have to be carefully balanced to not reduce its usability and it’s this weakness that we will exploit to thwart it.

OCR based CAPTCHA are all about image processing. Our aim is to isolate the string to recognized in order to validate the authentication. The methodology is as follow :

load image

remove background

if OCR successful
    then submit the result
else if rotation or distortion
    then rotation / distortion attenuation
        if OCR successful
            then submit the result
        else
            isolate each letter to improve the OCR result. Once the string recognized, submit the result

As you can see in the pseudo code above, the methodology depends completely on your CAPTCHA and the degree of distortion and randomization which have been applied to the image. That is why, it is almost impossible to develop an universal CAPTCHA breaker. Indeed, there are too many factors to take into consideration and only a modular approach will be successful to reach this goal.

Practical cases

The codes below have been developed, for practical reasons, with NodeJS. Indeed, this language is very easy to handle and provides a colossal amount of modules, reducing the development time.

Example 1

This example is quite straightforward and trivial. Indeed, the string color is completely different from the background and a simple loop, replacing each pixel value other than 15 or above (black) by 255 (white) will be enough to isolate the data.

for (var y = this.height * 4; y >= 0; y--) {
    for (var x = 0; x < this.width; x++) {
        let idx = (this.width * y + x);

        if (this.data[idx] > 15) {
            this.data[idx] = 255;
        }
    }
}

Result

Example 2

This time, it is well more complicated. Indeed, the background is truly random and uses the same colors than the characters that we have to isolate. To resolve this issue and reduce the impact of the processing on the string, I calculated the image color threshold, by excluding, prior, pixels close to the white color. Then I replaced the pixels value inferior to this threshold by 255.

for (var y = this.height * 4; y >= 0; y--) {
    for (var x = 0; x < this.width; x++) {
        let idx = (this.width * y + x);

        if (this.data[idx] < 234) {
            threshold += this.data[idx];
            ++counter;
        }
    }
}

for (var y = this.height * 4; y >= 0; y--) {
    for (var x = 0; x < this.width; x++) {
        let idx = (this.width * y + x);

        if (this.data[idx] < threshold / counter - 2) {
            this.data[idx] = 255;
        }
    }
}

Result

Real case

In order to show you a well more concrete application, I picked a CAPTCHA challenge on http://canyouhack.it and developed a tool to resolve it, following the methodology stated earlier.

'use strict';

const fs = require('fs');
const PNG = require('pngjs').PNG;
const request = require('request');
const tesseract = require('node-tesseract');

const FILENAME = 'captcha.png';

/**
 * Remove the background image in order to isolate the string to recognize
 * @param name Image name
 */
function removeBackgroung(name) {
    fs.createReadStream(name)
      .pipe(new PNG())
      .on('parsed', function() {

        for (var y = this.height * 4; y >= 0; y--) {
            for (var x = 0; x < this.width; x++) {
                let idx = (this.width * y + x);

                if (this.data[idx] != 0) {
                    this.data[idx] = 255;
                }
            }
        }

        this.pack().pipe(fs.createWriteStream(name));
    });
}

/**
 * Perform string recognization through tesseract node module.
 */
function ocr() {
    tesseract.process(__dirname + '/../' + FILENAME, (err, text) => {
        if (err) {
            console.error(err);
        } else {
            let char = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890';
            let value = "";

            for (var i = 0; i < text.length; i++) {
                if (char.indexOf(text[i]) > -1) {
                    value += text[i];
                }
            }

            request.get(options.url + "?text=" + value, {jar: jar}).pipe(fs.createWriteStream('result.png'));
        }
    });
}

let options = {
    url: 'http://canyouhack.it/Content/Challenges/Captcha/Captcha1.php'
}

let jar = request.jar();
let cookie = "";

// Get the image and save the session cookie.
let stream = request.get(options)
    .on('response', (res) => {
        cookie = request.cookie(res.headers["set-cookie"].toString());
    })
    .pipe(fs.createWriteStream(FILENAME));

// Once the image saved, call the image processing and string recognization functions.
stream.on('finish', () => {
    setTimeout(function() {
        jar.setCookie(cookie, options.url);
        removeBackgroung(FILENAME);
        ocr();
    }, 1000);
});

Original CAPTCHA

Post processing

Token

Notes

Once again, every CAPTCHAs are different and require a specific image processing. Consequently, this tool will not work with other types of CAPTCHA ! However, you now have all the keys to create your own CAPTCHA breaker. Lastly, keep in mind that OCR engines such as tesseract or openCV are not perfect and false positives can occur. Nonetheless, it easy to reach 60% of success, which is enough in most case. Indeed, according to researchers, if a CAPTCHA can be bypassed by a percentage above 1%, it is considered as broken and not safe to use.

That’s it for today.

See you for the next tutorial of this series.

Best,
Nitrax

pry0cc · July 20, 2016, 12:28pm

Pretty decent article! Have you written your own captcha module?

Nitrax · July 20, 2016, 12:31pm

Did you mean a CAPTCHA generator ?

oaktree · July 20, 2016, 3:48pm

Wait, hold on.

So what you’ve done here is clear up the image. What I’m missing is… where is the actual character/letter parsing going on and how is that accomplished? Or is that the tessaract module? How does tessaract work?

I saw a lecture once (on the internet) about training a CNN to recognize slightly altered characters/letters…

0x00pf · July 20, 2016, 4:27pm

In that case it will probably work better to use some mathematical morphology or rank filters if you prefer. A combination of erosions and dilatations operators and maybe a median filter at the end should help you to get a clean image. Actually a median filter applied to the result of your processing should remote all that salt and pepper noise.

I have a couple of questions:

Why do you loop height*4 instead of height?. I’m not familiar with the environment you used so may be a stupid question.
What is the Token image in your last example?. This may also be a stupid question

The threshold algorithm you are using will fail miserably with many images. Otsu’s method works pretty well in many cases, as it is based on image statistical parameters… well it basically splits the histogram at the right place

Congrats. Nice tut mate!

Nitrax · July 20, 2016, 4:48pm

Exactly ! In most case, removing the background is enough for OCR API or libraries to process the image, which is, by the way, well performed by tesseract ! However, in C++, openCV provides a lot of powerful functionalities which improve considerably the recognition performances. Moreover, splitting the image ‘manually’ is only necessary if a failure occurs and quite trivial to develop (processing that I can add to this tut if you find it relevant).

I will not be able to explain to you how works in details tesseract. I just heard that it was one of the most well known OCR library available.

I hope it answers your question.

Best,
Nitrax

Cromical · July 20, 2016, 4:49pm

Isn’t tesseract some kind optical recognition software? Upon further study it does actually seem like a good option. And a good name to boot. (Been supported by Google since 2006) - (Open source since 2005 after Hewlett Packard released it) - (popular on Ubuntu and Windows). And it lacks a GUI, so you’d have to run it from the command line.
Here’s the architecture:

Nitrax · July 20, 2016, 5:00pm

I’m happy to see that you enjoyed this tut mate

You are right about the weakness of the algorithm implemented. In fact, it was more like a proof of concept that something else and this code base can be easily improved by the implementation of the algorithms that you pointed out in your comment.

This part was weird for me too but the size of the image downloaded and accessible from this callback was simply divided by 4 … I didn’t see anything on the internet about this bug and due to a lack of time, I wasn’t able to dive deeper on this way.

That was just to prove that the tool created was functional and allows to validate this challenge.

P.S : No questions are stupid ^^

Nitrax · July 20, 2016, 5:03pm

Yep, it is and it has been wrapped in several languages e.g. NodeJS, C++, Go, …

Cromical · July 20, 2016, 5:14pm

It’s honestly great. Seems it’s died down though, although I can imagine it being extremely popular in the 2000s.

0x00pf · July 20, 2016, 5:15pm

I would say, the library converts the image to 32bits representation, so you have 4 bytes per pixel. That would mean that what you actually have to multiply by 4 is the width. As you are applying the threshold to each of the pixels components individually it just works, but if you try to get the colour of the pixel you will get something strange.

Nitrax · July 20, 2016, 5:19pm

I didn’t think about that. Thanks for your input ! You solved it

Nitrax · July 20, 2016, 5:29pm

A lot of CAPTCHA used nowadays are, security speaking, as poor as the examples above ^^ I know it’s a shame but it’s the reality ^^

oaktree · July 20, 2016, 5:52pm

Yeah. Probably R G B A, then – a byte for each.

Cromical · July 20, 2016, 6:43pm

Yes, sad really.

system · January 21, 2018, 12:30am

This topic was automatically closed after 30 days. New replies are no longer allowed.