Entropy

What it is

Entropy is a way to measure the amount of uncertainty a probability distribution exhibits. The more uncertain it is which values are sampled next, the higher the entropy value.

A second way to interpret entropy is to think about it in terms of information needed to communicate an outcome. The more uncertain the outcome, the more information needs to be communicated, the higher the entropy.

Entropy is a scientific concept that can also be found in other fields such as Physics or Chemistry. The information-theoretic approach that's used in Cryptography was defined by "the father of information theory" Claude Shannon in his 1948 paper A Mathematical Theory of Communication.

Entropy and its relation to information

As mentioned above, one can interpret entropy as the amount of information that needs to be shared to communicate an (uncertain) outcome.

To see why that is, let's work through some examples in which two characters called Alice and Bob want to communicate the results of coin flips over a long distance. They agreed prior to use a binary encoding to communicate the outcomes.

Example #1: Heavily biased coin

In the first example, Alice uses a heavily biased coin which will always turn heads. The fact that Alice uses such coin is also known to Bob. How much information does Alice need to share with Bob so that Bob knows what the result of the coin flip was?

If we think about it, there's no need to share any information at all. No matter how often Alice flips the coin, it will always land heads. So the entropy of every outcome that's shared is exactly $0$ bits.

Thinking about this mathematically we can express it as follows.

There's exactly one outcome, so to express this outcome we need $0$ bits which can be used to describe our $2^0 = 1$ possibilities. The entropy in bits can then be calculated as the binary logarithm (logarithm to the base $2$ ) of the possible outcomes which translates to $log_{2}(2^0) = 0$ bits of entropy.

Example #2: Fair coin

In the second example, Alice uses a fair coin which shows heads or tails with equal probability. Alice wants to communicate the result of two subsequent coin flips to Bob. How many bits does she need to do so and how much entropy does this information have?

To answer the first question we can see easily see that there are $4$ potential outcomes.

Heads, Heads
Head, Tails
Tails, Heads
Tails, Tails

To express this in binary we need $2$ bits which can be used to describe the $2^2 = 4$ listed outcomes. To calculate the entropy in bits we can once again use the binary logarithm which shows us that this information has $log_{2}{2^2} = 2$ bits of entropy.

This matches our intuition. Given heads and tails are equally likely to occur there's a lot of uncertainty for Bob as to what the result of Alice's two consecutive coin flips might be. Alice therefore needs to communicate exactly two bits of information to ensure that Bob receives the correct information.

The general formula we can derive is: $log_{2}(2^N) = N$ bits of entropy.

Aside: Calculating entropy can also help in finding data compression opportunities.

In our example above we could've sent the result of the two coin flips as two ASCII characters where H would represent heads and T represents tails. A heads, tails flip would be encoded as HT.

To represent one ASCII character we need to use 1 byte or 8 bits (the early ASCII standard used 7 bits, but most modern implementations use 8 bits). So if we send the string HT over the wire we'd need $2 \times 8 = 16$ bits.

Using our binary encoding we only need 2 bits. A whopping 8x improvement when it comes to data savings. Note that the information is still retained as it doesn't matter if we use ASCII characters or binary strings as long as there's an agreement on the data interpretation.

Entropy and its relation to uncertainty

Entropy can also be interpreted in relation to uncertainty. More specifically, entropy can be understood as the uncertainty that's present when a value is sampled from a random distribution.

To get a better understanding why that's the case we'll look at two examples in which Alice and Bob once again want to communicate the results of coin flips over a long distance using a binary encoding.

Example #1: Fair coin

The first example considers a fair coin, meaning that it's equally likely that flipping it results in heads or tails.

Alice flips the coin once and wants to send the result over to Bob. How much uncertainty will be expressed in the information she sends over to Bob?

The first thing to note is that Alice needs at most $1$ bit of information given that there are only two potential outcomes (heads or tails) and she can encode each outcome using a bit that's either set to $0$ or $1$ .

This $1$ bit of information can therefore be used to describe our $2^1 = 2$ outcomes.

Given that we are dealing with a fair coin and we have $2$ outcomes there's a $\frac{1}{2^1} = \frac{1}{2}$ chance that any one of such outcomes is sampled from the distribution. Calculating the binary logarithm of such probability results in $log_{2}(\frac{1}{2}) = -1$ . Note that the result is negative. More on that in a second.

The overall entropy can then be calculated via the "Shannon Entropy" formula which is the negative sum of all probabilities multiplied by their binary logarithm (it's the negative sum so that the individual binary logarithm results of probabilities turn positive).

\begin{aligned} H(x) &= -\sum_{i} P(x_i) \times log_{2}P(x_i) \\ &= - p_1 \times log_{2}(p_1) - p_2 \times log_{2}(p_2) - \dots p_n \times log_{2}(p_n) \end{aligned}

Plugging in our two potential probabilities that each have an equal likelihood of $\frac{1}{2}$ shows us that the entropy is $1$ bit.

\begin{aligned} H(x) &= \left[-\frac{1}{2} \times log_{2}\left(\frac{1}{2}\right)\right] + \left[-\frac{1}{2} \times log_{2}\left(\frac{1}{2}\right)\right] \\ &= \frac{1}{2} + \frac{1}{2} \\ &= 1 \end{aligned}

We can observe that the entropy of a bit string for which values were sampled form a uniform distribution and which has length $n$ is exactly $n$ bits. The reason being that a uniform distribution maximizes uncertainty. There's no bias when sampling values from a uniform distribution. It's equally likely (uncertain) whether a $0$ or $1$ is sampled (or not).

Example #2: Biased coin

In the second example, Alice uses a biased coin which shows heads $\frac{1}{3}$ and tails $\frac{2}{3}$ of the time.

There's once again $1$ bit of information that needs to be shared because the potential outcome of a coin flip can be encoded in one single bit ( $0$ for heads, $1$ for tails).

Again, this $1$ bit of information can therefore describe the $2^1 = 2$ possible outcomes.

This time around we're dealing with a biased coin so we have to calculate the binary logarithm for each individual probability.

p_1 = log_{2}\left(\frac{1}{3}\right) \approx -1.58

p_2 = log_{2}\left(\frac{2}{3}\right) \approx -0.58

Using the "Shannon Entropy" formula we can calculate the entropy as follows.

\begin{aligned} H(x) &= \left[-\frac{1}{3} \times log_{2}\left(\frac{1}{3}\right)\right] + \left[-\frac{2}{3} \times log_{2}\left(\frac{2}{3}\right)\right] \\ &\approx (0.53) + (0.39) \\ &\approx 0.92 \end{aligned}

As we can see, the entropy is lower. This is because there's more uncertainty as to which value is sampled from the distribution.

Entropy and its relation to randomness and keys

Now that we've explored the relationship between entropy and information as well as entropy and uncertainty, we can combine both relationships to see why entropy is such an integral part when it comes to randomness in Cryptography.

Key generation is a fundamental algorithm used in Cryptography to derive key material which oftentimes needs to be kept secret as per Kerckhoffs's Principle.

The process that generates keys heavily relies on randomness to ensure that keys are unpredictable. The bit string which represents the key should therefore exhibit very high entropy in order to make it "unguessable".

As we've learned, this means that individual bits should be sampled from a uniform distribution which ensures that there's an exact 50% chance that a $0$ or $1$ is chosen.

Assuming the distribution we're working with is in fact uniform we can calculate the entropy for a 256-bit key as follows using the "Shannon Entropy" formula.

\begin{aligned} H(x) &= 2^{256} \times (-2^{-256} \times log_{2}(2^{-256})) \\ &= -log_{2}(2^{-256}) \\ &= 256 \end{aligned}

As we can see, such a 256-bit key has an entropy of 256 bits which is the maximum amount of entropy possible for a 256-bit key. We can conclude that this bit string contains a lot of uncertainty when it comes to the bit values it's composed of. This is due to the uniform distribution which maximizes uncertainty.

References

The following resources have been invaluable for me to learn the concepts discussed in this article.

You should definitely give them a read if you want to dive deeper into the topic.