Hi fellas,
Today I will show you how to bypass OCR based CAPTCHA. The first thing that you have to keep in mind is that every CAPTCHAs are different and require an adapted image processing. For the sake of this article, I selected two distinct types of CAPTCHA, each of them needing a specific approach to be resolved.
Theory
OCR based CAPTCHA flooded the market around 2000 with aim was to secure any web services online against abuses such as ticket scalping, automatic submission URL or spams.
Those CAPTCHAs consist on the recognition of a string of characters in order to achieve, as its acronym say, a Reverse Turing Test to verify that the end user is human, preventing bots usage.
To make it works and avoid automation, companies process randomized strings by applying distortion, rotation and background randomization. However, for usability concern, those transformations have to be carefully balanced to not reduce its usability and it’s this weakness that we will exploit to thwart it.
OCR based CAPTCHA are all about image processing. Our aim is to isolate the string to recognized in order to validate the authentication. The methodology is as follow :
load image
remove background
if OCR successful
then submit the result
else if rotation or distortion
then rotation / distortion attenuation
if OCR successful
then submit the result
else
isolate each letter to improve the OCR result. Once the string recognized, submit the result
As you can see in the pseudo code above, the methodology depends completely on your CAPTCHA and the degree of distortion and randomization which have been applied to the image. That is why, it is almost impossible to develop an universal CAPTCHA breaker. Indeed, there are too many factors to take into consideration and only a modular approach will be successful to reach this goal.
Practical cases
The codes below have been developed, for practical reasons, with NodeJS. Indeed, this language is very easy to handle and provides a colossal amount of modules, reducing the development time.
Example 1
This example is quite straightforward and trivial. Indeed, the string color is completely different from the background and a simple loop, replacing each pixel value other than 15 or above (black) by 255 (white) will be enough to isolate the data.
for (var y = this.height * 4; y >= 0; y--) {
for (var x = 0; x < this.width; x++) {
let idx = (this.width * y + x);
if (this.data[idx] > 15) {
this.data[idx] = 255;
}
}
}
Result
Example 2
This time, it is well more complicated. Indeed, the background is truly random and uses the same colors than the characters that we have to isolate. To resolve this issue and reduce the impact of the processing on the string, I calculated the image color threshold, by excluding, prior, pixels close to the white color. Then I replaced the pixels value inferior to this threshold by 255.
for (var y = this.height * 4; y >= 0; y--) {
for (var x = 0; x < this.width; x++) {
let idx = (this.width * y + x);
if (this.data[idx] < 234) {
threshold += this.data[idx];
++counter;
}
}
}
for (var y = this.height * 4; y >= 0; y--) {
for (var x = 0; x < this.width; x++) {
let idx = (this.width * y + x);
if (this.data[idx] < threshold / counter - 2) {
this.data[idx] = 255;
}
}
}
Result
Real case
In order to show you a well more concrete application, I picked a CAPTCHA challenge on http://canyouhack.it and developed a tool to resolve it, following the methodology stated earlier.
'use strict';
const fs = require('fs');
const PNG = require('pngjs').PNG;
const request = require('request');
const tesseract = require('node-tesseract');
const FILENAME = 'captcha.png';
/**
* Remove the background image in order to isolate the string to recognize
* @param name Image name
*/
function removeBackgroung(name) {
fs.createReadStream(name)
.pipe(new PNG())
.on('parsed', function() {
for (var y = this.height * 4; y >= 0; y--) {
for (var x = 0; x < this.width; x++) {
let idx = (this.width * y + x);
if (this.data[idx] != 0) {
this.data[idx] = 255;
}
}
}
this.pack().pipe(fs.createWriteStream(name));
});
}
/**
* Perform string recognization through tesseract node module.
*/
function ocr() {
tesseract.process(__dirname + '/../' + FILENAME, (err, text) => {
if (err) {
console.error(err);
} else {
let char = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890';
let value = "";
for (var i = 0; i < text.length; i++) {
if (char.indexOf(text[i]) > -1) {
value += text[i];
}
}
request.get(options.url + "?text=" + value, {jar: jar}).pipe(fs.createWriteStream('result.png'));
}
});
}
let options = {
url: 'http://canyouhack.it/Content/Challenges/Captcha/Captcha1.php'
}
let jar = request.jar();
let cookie = "";
// Get the image and save the session cookie.
let stream = request.get(options)
.on('response', (res) => {
cookie = request.cookie(res.headers["set-cookie"].toString());
})
.pipe(fs.createWriteStream(FILENAME));
// Once the image saved, call the image processing and string recognization functions.
stream.on('finish', () => {
setTimeout(function() {
jar.setCookie(cookie, options.url);
removeBackgroung(FILENAME);
ocr();
}, 1000);
});
Original CAPTCHA
Post processing
Token
Notes
Once again, every CAPTCHAs are different and require a specific image processing. Consequently, this tool will not work with other types of CAPTCHA ! However, you now have all the keys to create your own CAPTCHA breaker. Lastly, keep in mind that OCR engines such as tesseract or openCV are not perfect and false positives can occur. Nonetheless, it easy to reach 60% of success, which is enough in most case. Indeed, according to researchers, if a CAPTCHA can be bypassed by a percentage above 1%, it is considered as broken and not safe to use.
That’s it for today.
See you for the next tutorial of this series.
Best,
Nitrax