Continuing the discussion from CAPTCHA - Randomness applicability:

When I first discovered Information Theory many years ago I was amazed. The fact of being able to quantify the amount of information in a sequence of bits just looked awesome. So I would try to explain this as simpler as I can. There will be a bit of Math, but despite of those logarithms all around the place, the idea behind is pretty simple and elegant.

The Entropy is a magnitude that measures the unpredictability of the information contained by a set of possible values. The idea is that, if we can predict the information in the message, then we can *reconstruct* the message, and, if we can predict what comes next in a message, we can do some interesting stuff: compress data or having a chance to find out the next number in a sequence.

The Entropy measures, in a sense, our chances to predict the next symbol in a message.

# Amount of Information

Soâ€¦ How can we measure the amount of information contained in a message?. Letâ€™s figure this out using one classic example:

Suppose we have the following two symbols:

- It is not raining at Atacama Desert
- It is raining at Atacama Desert

So, which one of those symbols gives you more information?. In case you do not know, the Atacama desert at South America is one of the driest places in the work. It does not rain very often there.

Yes, you are completely right. If somebody tells you that it is raining at Atacama desert it is giving you a lot of informationâ€¦ becauseâ€¦ the probability of raining at that place is very, very low. If somebody says that it is not raining at the Atacama desert, you are not getting much information, because that is happening all the time.

Therefore, the amount of information of a given symbol/message is inversely proportional to the probability of occurrence of that symbol/message. The lower the probability, the higher the information we are received.

# Entropy

Back to the original definition, the entropy should give us a number that will provide a measurement of the predictability/unpredictability of the messages we may compose with a given set of symbols.

The expression to calculate the Entropy for a given set of symbols is:

H = SUM {p

_{i}* log (1 / p_{i})} = - SUM {(p_{i}) * log (p_{i})}

*You know 1/x = x ^{ -1} and exponents get out of the log multiplying*

where p_{i} is the probability of the i-th symbol in our set.

Note

In case you do not know, the function log

_{ b }(x) = y such as b^{ y }= x

For instance: log_{ 2}(8) = 3 because 2^{ 3 }= 8 log_{ 10 }(1000) = 4 because 10^{ 4 }=1000

So, what we have is a weighted average (we are multiplying each value in the average by its probability) of the logarithm of the inverse of the probability.

As we said, the amount of information in a symbol is inversely proportional to its predictability (remember the rain at Atacama?). As probabilities can only take values between 0 and 1, using the log function allows us to scale those values conveniently, providing a higher value for lower probabilities. You knowâ€¦ the log (1) equals 0 (because b ^{ 0} = 1 for all b) and the log (0) equals -infinite.

Letâ€™s do some numbering to make more sense out of that expression works.

# Messages

So, letâ€™s assume we have an alphabet composed of 4 symbols: A, B, C, D (yes I took this example from the wiki page :).

Now, letâ€™s suppose that all symbols have the same probability (1/4 it should be). Letâ€™s calculate the entropy for this set of symbols:

H = - (1/4 * log (1/4) + 1/4 * log (1/4) +â€¦)

H = - 4 * 1/4 * log(1/4)

H = log (1/4)

Now, letâ€™s choose a base for our logarithm. Depending on the base we chose, the result will be bits (or shannon if we use base 2), nats (if we use base e) or hartleys (if we use base 10). Letâ€™s go for base 2 and get some bits!

H = - log

_{2}(1/4) = log_{2}(4) = 2 bits

*Note the - is gone in the second step*

So, this is the maximum entropy, or in other words the amount of bits we have to use to encode this four symbols in binary. As all of them are equally probable, we cannot *compress* them. In other words, any sequence of those values is random and the entropy is maximal. We cannot predict the next value, it may be any of the four symbols.

Now letâ€™s chose a different set of probabilities:

p

_{ a}= 1/2

p_{ b}= 1/4

p_{ c}= 1/8

p_{ d}= 1/8

H = - (1/2 * log

_{2}(1/2) + 1/4 * log_{2}(1/4) + 1/8 * log_{2}(1/8) + 1/8 * log_{2}(1/8))

H = - (1/2 * (-1) + 1/4 * (-2) + 1/8 * (-3) + 1/8 * (-3))

H = - (-1/2 - 1/2 - 2*3/8) = - ( -1 - 3/4) = - (-7/4) = 1.75 bits

So, we need 1.75 bits to encode a message using this alphabetâ€¦ but how?

# Prefix Codes and Huffman Compression

Now, you may be wondering how do you can code 4 symbols using less than 2 bits. Well the answer is: variable-length codes We will assign longest codes to the less probable symbols. The simplest way to build such codes is using a prefix code

A prefix code assigns a code to each symbol in a way that there is no whole codeword that is a prefix of any other code word. The Huffman code is a prefix code that is optimal. Well, letâ€™s forget about this for now.

For our previous example, letâ€™s build a prefix code so we can encode our messages using (in average) 1.75 bits instead of 2.

```
Symbol |A |B |C |D
--------+----+----+------+-----
Prob |.5 |.25 |0.125 |0.125
Code 1 |0 |1 |1 |1
Code 2 |0 |10 |11 |11
Code 3 |0 |10 |110 |111
```

`Code1`

to `Code3`

are different iterations were we add a bit to the symbols with higher probability. The last line represents our final code.

At `Code 1`

we assign `0`

to `A`

(most probable symbol) and `1`

as first bit for all other symbols, in order to fullfil our prefix condition. We are done with `A`

but we have to continue with the rest. Now `A`

, the symbol that will appear more often, has the sorted codewordâ€¦ just one bit

At `Code 2`

we assign an extra `0`

to `B`

(the second more probable symbol). So the symbol `B`

(the second most common symbol) will have a 2 bits codeword. We add a `1`

to the rest of the symbols that have lower probabilities.

Finally, at `Code 3`

we assign extra `0`

to symbol `C`

and an extra `1`

to symbol `D`

. Now, the two less frequent symbols will use 3 bits instead of 2, but as they appear a lot less than the symbols `A`

and `B`

, in average the length of the message should be smaller.

Also note that, none of the code words is a prefix of any other. This will allow us to always unambiguously decode a sequence composed of those symbols. For instance

Given the sequence:

`000101101010011110000010101000`

We can decode it:

```
Variable-Len 0 0 0 10 110 10 10 0 111 10 0 0 0 0 10 10 10 0 0
A A A B C B B A D B A A A A B B B A A
Fixed-Len 00 00 00 01 02 01 01 00 11 01 00 00 00 00 01 01 01 00 00
```

Our sequence has 19 symbols. Using the normal 2 bits per symbol, the whole sequence will require 38 bits to be encoded , but we just used 31 bits. This gave us 31/19 = 1.63 bits per symbol, that is pretty close to our theoretical 1.75 bits per symbol.

If you produce a longer sequence with proper symbol probabilities you should get a value even closer to the theoretical entropy value.

# Nice, but what does all this mean?

Well, roughly speaking the entropy measures the randomness of a message. For completely random messages entropy will be maximal. Otherwise, the sequence is not random and therefore it can be predicted at some extend.

The direct application of this predictability is compression. If we can predict the next symbol in a sequence, then we do not need to store it or transmitted itâ€¦ we are making the sequence shorterâ€¦ we are compressing it.

The other direct application of predictability is actually predicting the next value in the sequence. This is directly related to the great post from Nitrax that you should read.

However, the entropy itself, does not help to predict the values. The entropy just let us know how much redundancy (predictability) is in our symbol sequence and therefore how far we can go predicting values in the sequence. Nitrax post is great because it takes advantage of this property in an unexpected wayâ€¦ As he saidâ€¦ â€śThink out of the boxâ€ť.

By the way, Huffman codes (prefix codes build using the Huffman algorithm) are used all around the world. As an example, the JPEG standard uses Huffman codes internally.

Well, not sure if this has clarified something or actually add more noise to your brain!