Information Theory 101. Entropy

0x00pf · August 23, 2016, 4:07pm

Continuing the discussion from CAPTCHA - Randomness applicability:

When I first discovered Information Theory many years ago I was amazed. The fact of being able to quantify the amount of information in a sequence of bits just looked awesome. So I would try to explain this as simpler as I can. There will be a bit of Math, but despite of those logarithms all around the place, the idea behind is pretty simple and elegant.

The Entropy is a magnitude that measures the unpredictability of the information contained by a set of possible values. The idea is that, if we can predict the information in the message, then we can reconstruct the message, and, if we can predict what comes next in a message, we can do some interesting stuff: compress data or having a chance to find out the next number in a sequence.

The Entropy measures, in a sense, our chances to predict the next symbol in a message.

Amount of Information

So… How can we measure the amount of information contained in a message?. Let’s figure this out using one classic example:

Suppose we have the following two symbols:

It is not raining at Atacama Desert
It is raining at Atacama Desert

So, which one of those symbols gives you more information?. In case you do not know, the Atacama desert at South America is one of the driest places in the work. It does not rain very often there.

Yes, you are completely right. If somebody tells you that it is raining at Atacama desert it is giving you a lot of information… because… the probability of raining at that place is very, very low. If somebody says that it is not raining at the Atacama desert, you are not getting much information, because that is happening all the time.

Therefore, the amount of information of a given symbol/message is inversely proportional to the probability of occurrence of that symbol/message. The lower the probability, the higher the information we are received.

Entropy

Back to the original definition, the entropy should give us a number that will provide a measurement of the predictability/unpredictability of the messages we may compose with a given set of symbols.

The expression to calculate the Entropy for a given set of symbols is:

H = SUM {p_i * log (1 / p_i)} = - SUM {(p_i) * log (p_i)}

You know 1/x = x^-1 and exponents get out of the log multiplying

where p_i is the probability of the i-th symbol in our set.

Note

In case you do not know, the function log_b(x) = y such as b^y = x
For instance: log₂ (8) = 3 because 2³ = 8 log₁₀ (1000) = 4 because 10⁴=1000

So, what we have is a weighted average (we are multiplying each value in the average by its probability) of the logarithm of the inverse of the probability.

As we said, the amount of information in a symbol is inversely proportional to its predictability (remember the rain at Atacama?). As probabilities can only take values between 0 and 1, using the log function allows us to scale those values conveniently, providing a higher value for lower probabilities. You know… the log (1) equals 0 (because b ⁰ = 1 for all b) and the log (0) equals -infinite.

Let’s do some numbering to make more sense out of that expression works.

Messages

So, let’s assume we have an alphabet composed of 4 symbols: A, B, C, D (yes I took this example from the wiki page :).

Now, let’s suppose that all symbols have the same probability (1/4 it should be). Let’s calculate the entropy for this set of symbols:

H = - (1/4 * log (1/4) + 1/4 * log (1/4) +…)
H = - 4 * 1/4 * log(1/4)
H = log (1/4)

Now, let’s choose a base for our logarithm. Depending on the base we chose, the result will be bits (or shannon if we use base 2), nats (if we use base e) or hartleys (if we use base 10). Let’s go for base 2 and get some bits!

H = - log₂ (1/4) = log₂ (4) = 2 bits

Note the - is gone in the second step

So, this is the maximum entropy, or in other words the amount of bits we have to use to encode this four symbols in binary. As all of them are equally probable, we cannot compress them. In other words, any sequence of those values is random and the entropy is maximal. We cannot predict the next value, it may be any of the four symbols.

Now let’s chose a different set of probabilities:

p_a = 1/2
p_b = 1/4
p_c = 1/8
p_d = 1/8

H = - (1/2 * log₂(1/2) + 1/4 * log₂(1/4) + 1/8 * log₂(1/8) + 1/8 * log₂(1/8))
H = - (1/2 * (-1) + 1/4 * (-2) + 1/8 * (-3) + 1/8 * (-3))
H = - (-1/2 - 1/2 - 2*3/8) = - ( -1 - 3/4) = - (-7/4) = 1.75 bits

So, we need 1.75 bits to encode a message using this alphabet… but how?

Prefix Codes and Huffman Compression

Now, you may be wondering how do you can code 4 symbols using less than 2 bits. Well the answer is: variable-length codes We will assign longest codes to the less probable symbols. The simplest way to build such codes is using a prefix code

A prefix code assigns a code to each symbol in a way that there is no whole codeword that is a prefix of any other code word. The Huffman code is a prefix code that is optimal. Well, let’s forget about this for now.

For our previous example, let’s build a prefix code so we can encode our messages using (in average) 1.75 bits instead of 2.

Symbol  |A   |B   |C     |D
--------+----+----+------+-----
Prob    |.5  |.25 |0.125 |0.125
Code 1  |0   |1   |1     |1
Code 2  |0   |10  |11    |11
Code 3  |0   |10  |110   |111

Code1 to Code3 are different iterations were we add a bit to the symbols with higher probability. The last line represents our final code.

At Code 1 we assign 0 to A (most probable symbol) and 1 as first bit for all other symbols, in order to fullfil our prefix condition. We are done with A but we have to continue with the rest. Now A, the symbol that will appear more often, has the sorted codeword… just one bit

At Code 2 we assign an extra 0 to B (the second more probable symbol). So the symbol B (the second most common symbol) will have a 2 bits codeword. We add a 1 to the rest of the symbols that have lower probabilities.

Finally, at Code 3 we assign extra 0 to symbol C and an extra 1 to symbol D. Now, the two less frequent symbols will use 3 bits instead of 2, but as they appear a lot less than the symbols A and B, in average the length of the message should be smaller.

Also note that, none of the code words is a prefix of any other. This will allow us to always unambiguously decode a sequence composed of those symbols. For instance

Given the sequence:
000101101010011110000010101000

We can decode it:

Variable-Len 0  0  0  10 110 10 10 0  111 10 0  0  0  0  10 10 10 0  0
             A  A  A  B  C   B  B  A   D   B A  A  A  A  B  B  B  A  A
Fixed-Len    00 00 00 01 02  01 01 00 11  01 00 00 00 00 01 01 01 00 00

Our sequence has 19 symbols. Using the normal 2 bits per symbol, the whole sequence will require 38 bits to be encoded , but we just used 31 bits. This gave us 31/19 = 1.63 bits per symbol, that is pretty close to our theoretical 1.75 bits per symbol.

If you produce a longer sequence with proper symbol probabilities you should get a value even closer to the theoretical entropy value.

Nice, but what does all this mean?

Well, roughly speaking the entropy measures the randomness of a message. For completely random messages entropy will be maximal. Otherwise, the sequence is not random and therefore it can be predicted at some extend.

The direct application of this predictability is compression. If we can predict the next symbol in a sequence, then we do not need to store it or transmitted it… we are making the sequence shorter… we are compressing it.

The other direct application of predictability is actually predicting the next value in the sequence. This is directly related to the great post from Nitrax that you should read.

However, the entropy itself, does not help to predict the values. The entropy just let us know how much redundancy (predictability) is in our symbol sequence and therefore how far we can go predicting values in the sequence. Nitrax post is great because it takes advantage of this property in an unexpected way… As he said… “Think out of the box”.

By the way, Huffman codes (prefix codes build using the Huffman algorithm) are used all around the world. As an example, the JPEG standard uses Huffman codes internally.

Well, not sure if this has clarified something or actually add more noise to your brain!

Nitrax · August 23, 2016, 4:26pm

Love it mate ! I appreciate that you took the time to clarify the concept of entropy ! Very well explained with some self-explaining comparisons, simplifying the approach of such notion !

By the way, the Huffman algorithm is widely used in the area of malware classification too. It is a reliable alternative to the Kolmogorov complexity. I think that I will further develop this concept in my next article.

Once again, well done and great initiative !

Best,
Nitrax

0x00pf · August 23, 2016, 4:35pm

that sounds great… looking forward to that malware stuff!

_py · August 23, 2016, 5:45pm

Fantastic explanation! Right when I was having my refresher in Information Theory for my uni exam, you drop this bomb article. It’d be great to see some compression algos in the future (i.e Huffman, Shannon.)

oaktree · August 23, 2016, 9:57pm

@0x00pf, I’ve honestly started “liking” your articles before even reading them. I just expect them to be good.

oaktree · August 23, 2016, 11:37pm

OK Just one tiny correction:

The log function only approaches 0 asymptotically. Thus, logx(0) does not exist for any x, but can be noted as -infinity.

0x00pf · August 24, 2016, 5:24am

That’s completely right @oaktree thanks for the correction

Donnette · August 24, 2016, 8:31pm

You are wicked smart! Thank you!!!
I love your ability to explain!

“If you can’t explain it simply, you don’t understand it well enough.” Albert Einstein

system · January 21, 2018, 12:41am

This topic was automatically closed after 30 days. New replies are no longer allowed.