What’s Entropy? (Part-II)

In the previous part we look at entropy in the retrospect of energy, while this part will focus on the entropy in the paradigm of probability and Information theory.

So, like last time we’ll use an example to understand the entropy in probability theory.

Entropy in Probability Theory

Suppose we have 3 sets of balls, each set contains 4 balls. The First set consists of 4 green balls, the second contains 3 green and 1 yellow and the third or last set consists of 2 green and 2 yellow balls.

We have to pick a ball randomly from a set and then put that ball back, then pick again. We have to repeat this process 4 times, so that we got four outcomes.

What will be the probability that we’ll pick the green ball every time?

As we put the ball back every time, so the probability of all picks will be independent of each other. So, we can apply the multiplicative law of probability, thus the probability of getting green balls in every four picks from set 1 will be 1 as the chances of getting a yellow ball is zero.

In case of set 2, the probability of getting a green ball will be 3/4 and for yellow it’ll be 1/4. So, the total the probability of getting green balls each time will be 0.105.

In case of set 3, the probability of getting a green ball will be 1/2, and It’ll be the same for yellow as well. So, the total the probability of getting green balls each time will be 0.063.

As you can see the values of total probability are quite small, so we can use some function to make them large enough to get some clear pattern. The function that’s suitable in this condition is logarithm base 2.

Logarithm gives the negative value. So, to get positive values we take negative logarithm and take an average of that resultant positive values, thees values are the Entropy of each set.

Shannon Entropy

Let's understand the concept of Entropy by means of decision tree and for that we need to see another example here.

Consider we have two machines that generate output consist of four alphabets i.e. A, B, C, D. machine 1 generates all alphabets with equal probability of 25%. But machine 2 generates A with 50% probability, B and C with 12.5% probability and D with 25% probability. Now the question is which machine is producing more information?

Claude Shannon the father of information theory, rephrased the question, If you have to predict the next alphabet for each machine what is the minimum of yes and no (binary) questions you would expect to ask?

In case of machine 1, we have to ask minimum 2 questions (as you can see in the figure below) to know the alphabet.

Average number of questions to be asked can be calculated as

where n is the numbers of questions required to reach the alphabet and in the case of machine 1 its value is 2.

In case of machine 2, if we use the same approach as machine 1 we will get a similar tree and same average number of questions.

So, what’s the difference between these two machines?

We can get better out of this by asking questions differently, according to the probability of alphabets.

As the probability of A is much greater, so we can first question to separate A from other alphabets.

In this way we can be more efficient as our average number of question decreases by using this approach.

So we can say that machine 2 is producing less information because there’s less uncertainty about its output.

Claude Shannon calls this measure of average uncertainty, Entropy

He uses the letter H to represent it, and the unit of Entropy he chooses was based on the uncertainty of a fair coin flip, and he calls this bit

So the generalized formula for entropy H is

Here we can generalize the n, as we know the number of question asked are actually the height of the tree and the height of the tree is calculated by using logarithm of base 2. So here’s the another reason for using logarithm for calculating entropy.

So, now we take log base 2 of the number of outcomes at a level of tree.

But how can we calculate the number of outcomes at a level? The number of outcomes at a level is also based on the probability as shown below

where p is the probability of that outcome.

So, the equation becomes

Shannon writes this equation slight differently as

He inverts the expression inside the logarithm by using the property of logarithm because of that there’s a negative sign in it.

Entropy in Information Theory

Information theory is concerned with representing data in a compact fashion (a task known as data compression or source coding), as well as with transmitting and storing it in a way that is robust to errors (a task known as error correction or channel coding).

— Page 56, Machine Learning: A Probabilistic Perspective, 2012


The number of bits required to represent the event is known as information

The choice of the base-2 logarithm means that the units of the information measure is in bits (binary digits). The negative sign ensures that the result is always positive or zero.

Information will be zero when the probability of an event is 1 or a certainty, e.g. there is no surprise. So, we can say

Information provides a way to quantify the amount of surprise for an event measured in bits

Suppose we have four words which are dog, cat, fish and bird. We need to transmit these words through some communication signal in a way that at the other end we can identify the word unambiguously.

So, how much information we need or say how many bits are required to represent these words uniquely.

The answer is 2, so we can represent these four words distinctively with 2 bits.

Consider the probability of these words are equally likely i.e. 1/4 each. Hence, we can calculate entropy by using the formula from the previous section and get that on average we need 2 bits to represent these four words.

But if we have skewed probabilities i.e. dog 50%, cat 25%, fish 12.5% and bird 12.5%. Then we can change the information of words as well.

By using this representation, we just need 1.75 bits on average to represent these four words.

There is no representation which will give us an average length of less than 1.75 bits on these words. So, there is simply a fundamental limit, and we call this fundamental limit the entropy of the distribution.

You can see we compress that data as now we need less number of bits to transmit data.

In case of equal probabilities, we need the same amount of information to represent each event. But in case of skewed probabilities, we need less information to represent likely (high probability) events, and need more information to represent unlikely (low probability)events.

Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution

There’s a cost of using less information i.e. when we shorten the code for some words, we have to increase the length of codes for other words like we do in our example.

So, actually we are sacrificing some space when we shorten the codes. For example if we use 01 code, we're actually sacrificing every other code that have prefix of 01 such as we can’t use 010 or 011010110 anymore because of ambiguity

We can calculate the sacrifice mathematically as

where L is the length of bits, if we apply this formula on above example we’ll lose a quarter of all possible codes.

We have to trade off between using less information and loss of space, when we are compressing data, that’s one of the core objective of information theory. That’s just a glimpse of information theory.


In information theory, understanding entropy helps you understand data compression. The entropy of a message is in a certain sense a measure of how much information it really contains.

Entropy is the measure of uncertainty, it’ll be high when probabilities are balanced(surprising) , and it’ll be low if the probabilities are more skewed(unsurprising)