To continue with my CAPTCHA series, I will introduce you the concept of randomness and its applicability to the domain of CAPTCHA breaking with a deeper digression to cryptography, steganography and forensics.
Without further ado, I would like to get to the heart of the matter by explaining you the concept of randomness. In a few words, randomness represents the lack of purpose, logic and objectivity of an event. This theory generated a bunch of debates within the sociological and philosophical community. Indeed, well-known thinkers such as Rob Weatherill stated that “Biologically speaking life is anti-entropic. Life opposes entropy. In the natural course of events in the universe, entropy increase, there is an increase in randomness” (Sovereignty of Death). Consequently, death would increase entropy ? Weird hein ?
Fortunately, this concept is well more practical in computer science. In fact, randomness is one of most important aspect of encryption nowadays. However, succeeding to generate a truly random sequence is currently impractical and the term pseudo-random is preferred. As you probably already guessed, randomness is a big deal and encryption / steganography rely, for most of the current scheme, on the quality of the pseudo random generator used to seed the key generation. In order to assess their quality, each new design has to pass several batteries of tests e.g. DieHarder, NIST, Sexton, etc to reach the minimum required standard to be usable. Regardless, it is almost impossible to verify the degree of randomness produced, making it one of the most challenging area.
Well, now that you are aware of the current state of the arts of this odd and curious concept, I will present you ENT, a tool used to assess the randomness properties of a file.
ENT highlights 6 fundamental parameters:
The values below represent the typical value obtained over a random and larger file.
Entropy : Information density of the content of a file expressed as a number of bits per character (7.999982).
Compression ratio : The compression ratio is highly tied with entropy. Higher is the entropy, lower will be the compression ratio (0).
Chi-Square distribution : This test is the reference in this domain and extremely sensitive to pseudo random generator errors. This value is calculated for a stream of bytes in the file and expressed as an absolute number and a percentage which indicate how frequently a truly random sequence would exceed this value (253.42 with 50%).
Arithmetic mean : This value is calculated by summing all bytes in a file and dividing by its length (127.5).
Monte Carlo estimation : Each successive sequence of six bytes is used to generate point coordinates (first three bytes are used for X and the rest for Y). If the distance of the randomly-generated point is less than the radius of a circle inscribed within the square, then it is considered as a hit. For a large file, this estimation tends to the value of PI (3.142422057 with an error of 0.03 %).
Serial correlation coefficient : This value shows in which extent a byte depend on of its predecessor (0).
Steganalysis practical case
Steganalysis is to steganography what cryptanalysis is to cryptography. It consists on the detection of a hidden content inside a carrier. In practice, detecting the use of steganography is quite difficult due to the fact that it is extremely hard to define a norm for each file format. Indeed, several factors e.g. color density, color palette, etc will impact on the file property, increasing the percentage of false positives. However, in the majority of cases, the difference is obvious.
Let’s see a typical jpeg file ! Here is the output of ENT for such format :
Entropy : 7.963678
Compression ratio : 0
Chi-square distribution : 16545.95 at 0.0.1%
Arithmetic mean : 128.0575
Monte Carlo estimation : 3.174692110 with an error of 1.05%
Serial correlation coefficient: -0.002625
As you can see, those values are quite close to a random file and people are often mistaken about that.
Original image & output
After steganography processing
Current steganography use, mainly, cryptography to hide information within the carrier. Consequently, the entropy will increase, modifying the image properties. As you can see in the images above, the entropy, the arithmetic mean and the Monte Carlo estimation are abnormally close to a random file, showing the usage of encryption ! Awesome right ? Now you are able to determine if yes or not an image carries secret information !
You probably wonder how this assessment could be use to break a CAPTCHA ! Let me introduce you HumanAuth.
HumanAuth is a CAPTCHA based on the differentiation of natural and artificial image. More specifically, detecting what comes from nature (lac, tree, …) and what have been created by humans (belt, car, ..). But how can we break and automatize an attack against such security measure ? Indeed, it is easier than you think !
The big difference between natural and artificial image is the presence of grass, tree, bushes, … which, once the image downloaded, require a highest compression than usual ! As I stated before, the compression ratio is correlated to the entropy. Consequently, a natural image will have a greater entropy that an artificial one !
Randomness is an inescapable concept for whom planned to evolve in the hacking area should be aware of. Indeed, it is used in many important areas e.g. Cryptography, Steganography, etc. As you see, randomness and CAPTCHA seem to have no relation at all but, it helped us to bypass the security measure in place.
Last words : Think out the box !