CAPTCHA - Randomness applicability

Hi fellas,

To continue with my CAPTCHA series, I will introduce you the concept of randomness and its applicability to the domain of CAPTCHA breaking with a deeper digression to cryptography, steganography and forensics.

Without further ado, I would like to get to the heart of the matter by explaining you the concept of randomness. In a few words, randomness represents the lack of purpose, logic and objectivity of an event. This theory generated a bunch of debates within the sociological and philosophical community. Indeed, well-known thinkers such as Rob Weatherill stated that “Biologically speaking life is anti-entropic. Life opposes entropy. In the natural course of events in the universe, entropy increase, there is an increase in randomness” (Sovereignty of Death). Consequently, death would increase entropy ? Weird hein ? :slight_smile:

Fortunately, this concept is well more practical in computer science. In fact, randomness is one of most important aspect of encryption nowadays. However, succeeding to generate a truly random sequence is currently impractical and the term pseudo-random is preferred. As you probably already guessed, randomness is a big deal and encryption / steganography rely, for most of the current scheme, on the quality of the pseudo random generator used to seed the key generation. In order to assess their quality, each new design has to pass several batteries of tests e.g. DieHarder, NIST, Sexton, etc to reach the minimum required standard to be usable. Regardless, it is almost impossible to verify the degree of randomness produced, making it one of the most challenging area.

Well, now that you are aware of the current state of the arts of this odd and curious concept, I will present you ENT, a tool used to assess the randomness properties of a file.

ENT

ENT highlights 6 fundamental parameters:

The values below represent the typical value obtained over a random and larger file.

Entropy : Information density of the content of a file expressed as a number of bits per character (7.999982).

Compression ratio : The compression ratio is highly tied with entropy. Higher is the entropy, lower will be the compression ratio (0).

Chi-Square distribution : This test is the reference in this domain and extremely sensitive to pseudo random generator errors. This value is calculated for a stream of bytes in the file and expressed as an absolute number and a percentage which indicate how frequently a truly random sequence would exceed this value (253.42 with 50%).

Arithmetic mean : This value is calculated by summing all bytes in a file and dividing by its length (127.5).

Monte Carlo estimation : Each successive sequence of six bytes is used to generate point coordinates (first three bytes are used for X and the rest for Y). If the distance of the randomly-generated point is less than the radius of a circle inscribed within the square, then it is considered as a hit. For a large file, this estimation tends to the value of PI (3.142422057 with an error of 0.03 %).

Serial correlation coefficient : This value shows in which extent a byte depend on of its predecessor (0).

Steganalysis practical case

Steganalysis is to steganography what cryptanalysis is to cryptography. It consists on the detection of a hidden content inside a carrier. In practice, detecting the use of steganography is quite difficult due to the fact that it is extremely hard to define a norm for each file format. Indeed, several factors e.g. color density, color palette, etc will impact on the file property, increasing the percentage of false positives. However, in the majority of cases, the difference is obvious.

Let’s see a typical jpeg file ! Here is the output of ENT for such format :

Entropy : 7.963678
Compression ratio : 0
Chi-square distribution : 16545.95 at 0.0.1%
Arithmetic mean : 128.0575
Monte Carlo estimation : 3.174692110 with an error of 1.05%
Serial correlation coefficient: -0.002625

As you can see, those values are quite close to a random file and people are often mistaken about that.

Original image & output

After steganography processing

Analysis

Current steganography use, mainly, cryptography to hide information within the carrier. Consequently, the entropy will increase, modifying the image properties. As you can see in the images above, the entropy, the arithmetic mean and the Monte Carlo estimation are abnormally close to a random file, showing the usage of encryption ! Awesome right ? Now you are able to determine if yes or not an image carries secret information !

CAPTCHA breaking

You probably wonder how this assessment could be use to break a CAPTCHA ! Let me introduce you HumanAuth.

HumanAuth is a CAPTCHA based on the differentiation of natural and artificial image. More specifically, detecting what comes from nature (lac, tree, …) and what have been created by humans (belt, car, …). But how can we break and automatize an attack against such security measure ? Indeed, it is easier than you think !

The big difference between natural and artificial image is the presence of grass, tree, bushes, … which, once the image downloaded, require a highest compression than usual ! As I stated before, the compression ratio is correlated to the entropy. Consequently, a natural image will have a greater entropy that an artificial one !

Tree

Belt

Conclusion

Randomness is an inescapable concept for whom planned to evolve in the hacking area should be aware of. Indeed, it is used in many important areas e.g. Cryptography, Steganography, etc. As you see, randomness and CAPTCHA seem to have no relation at all but, it helped us to bypass the security measure in place.

Last words : Think out the box !

7 Likes

Great article mate! I was looking forward to something like this, and honestly props to you for starting the “CAPTCHA series” so far it’s been nothing but interesting and fun to read. :smile:

1 Like

I love it when universal constants like Pi and e just keep coming back.

I appreciate this article and am happy to finally see that it is easier than expected.

2 Likes

Im sorry if I come across a total noob, but what does entropy of a file mean? I’m completely lost on how this can be calculated and what it means?

2 Likes

Correct me if I’m wrong but I believe entropy in this case has to do with the best possible scenario of data transfer with the least possible loss. By “loss” I mean data(bytes) loss. As the legend Shannon said “an absolute limit on the best possible average length of lossless encoding or compression of an information source.” Hope that clears things up a little bit more.

Fyi, you can find more just by googling about information theory. An insane amount of communications, if not all, are based on this field.

1 Like

@pry0cc In the context of randomness, the entropy defines with which extent a bit is tied to another within bytes. Higher is the entropy; lower is the relation between them. That is why it is expressed by a number, within a range from 0 to 8. As @_py described in his post above, entropy is affected by compression algorithms. Indeed, such algorithms use bit redundancy to reduce the size of a file without data loss. The consequence of such operation is, of course, a rise of entropy !

To conclude, entropy is not, as said @_py the best possible scenario of data transfer with the least possible loss but it just refers to the degree of randomness within a byte.

By the way, it exists a bunch of algorithms used to calculate this parameter, but I’m not aware of them.

Hope it helps to clarify things.

Best,
Nitrax

5 Likes

Glad to know it ! :slight_smile:

1 Like

That really clears things up. So how is correlation between bits calculated?

Here is an explanation about how entropy is calculated !

https://asecuritysite.com/encryption/ent

2 Likes

This topic was automatically closed after 30 days. New replies are no longer allowed.