Skip to content

ChaCha20

Table of contents

Open Table of contents

What it is

ChaCha20 is a stream cipher invented by Daniel J. Bernstein which allows for fast encryption and decryption of arbitrary binary data.

Its design is optimized to take advantage of modern CPU architectures and is therefore fast in environments where hardware acceleration (e.g. via the AES-NI instructions for AES on Intel and AMD CPUs) is absent. The focus on simplicity and heavy usage of Add-Rotate-XOR (ARX) semantics makes it easy to implement without introducing side-channel attack vectors.

Besides Bernstein’s original ChaCha20 Whitepaper it’s also specified in RFC 8439 which describes how ChaCha20 should be implemented to be used in TLS, SSH and other environments.

This writeup explains how the ChaCha20 variant defined in RFC 8439 works. The basic principles translate to Bernstein’s original paper as well.

How it works

Before jumping straight into the specifics, it’s useful to take a few paragraphs and talk about the ChaCha20 mechanics from a high-level perspective.

ChaCha20’s input are a 32-byte key, a 12-byte nonce, as well as an optional 4-byte block counter. The ChaCha20 specification also defines a constant of 32-bytes which are the ASCII encoding of the Nothing-up-my-sleeve number “expand 32-byte k”.

These four values (key, nonce, block counter and constant) are turned into 32-bit little endian values (words) and laid out into a block like the following:

C C C C
K K K K
K K K K
B N N N

Here, C stands for constant, K for key, B for block counter and N for nonce.

As can be seen, this block is of size = 1632=51216 * 32 = 512 bits.

This block is ChaCha20’s internal state which is scrambled with various operations so that the original state (the layout we just examined) has nothing to do with the resulting state. The bits of the end state can then be XORed with 512 plaintext bits to produce 512 bits of ciphertext.

To encrypt more than 512 bits, we can use the counter value. To do so, we would do the following:

  1. Split the plaintext into nn 512-bit blocks
  2. Set the counter value to 1
  3. Setup the initial state with the key, nonce and current counter value
  4. Scramble the state until the end state is computed
  5. Take the 512 bits from that state and XOR them with the current 512-bit block of the plaintext
  6. Increment the counter value
  7. Go to step #3

This has to be done until all the plaintext blocks have been XORed with the state (non-used state bits can be discarded if the last plaintext block isn’t exactly 512 bits).

The astute reader might be thinking that this looks suspiciously similar to a slight variation of a One-Time Pad and this is exactly right.

Given tha ChaCha20 processes data in 512-bit blocks, one can also draw similarities to a block cipher operating in counter mode.

Thanks to ChaCha20’s usage of XOR to combine state bits with plaintext bits to produce the ciphertext, it’s trivial to “implement” the decryption algorithm. The decryption algorithm is the exact same as the encryption algorithm, so there’s nothing to implement!

With this high-level overview of ChaCha20 we’re well equipped to dive deeper into the internals.

Internal State

As already described above, ChaCha20’s core primitive is its 512-bit state that’s internally represented as 16 32-bit words in little endian order.

The choice of 32-bit words in little endian order isn’t arbitrary. Such a layout allows for fast processing on modern CPUs which in large parts are little endian based.

Imagining this state as a 2-Dimensional block we can identify each word via an index number starting from 0:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

The first four words (0 - 3) are reserved for the constants 0x61707865, 0x3320646e, 0x79622d32 and 0x6b206574 which written in ASCII translate to the Nothing-up-my-sleeve number “expand 32-byte k”.

The next eight words (4 - 11) are used for the user-provided key. Word 12 is used for an internal block counter value, while the last 3 words (13 - 15) are used for the user-provided nonce.

Knowing this, we can update the 2-Dimensional block schematic by replacing constant words with C, key words with K, block counter words with B and nonce words with N respectively:

C C C C
K K K K
K K K K
B N N N

This is the initial state that’s always constructed before being scrambled to produce a final block (more on that in a bit).

As we’ve seen, the key and nonce values are the only user-provided inputs. While not always necessary, it’s sometimes also possible to specifically set the block counter value in implementations of ChaCha20. The block counter is an internal state variable that usually starts with 0 or 1 and is incremented whenever a new block was produced.

Rounds

The main idea behind ChaCha20 is to take the aforementioned, initial state and permute it by manipulating its words via two core operations in multiple rounds.

Doing this long enough (20 rounds, hence the name ChaCha20) ensures that the final state’s words look nothing like the values we started with.

The question is just how we should scramble the state words such that the implementation is fast in software while also ensuring cryptographic security.

Quarter Round

At the core of the scrambling of the state words is the Quarter Round function which takes four word values a, b, c, and d and updates them like the following pseudocode illustrates:

a += b;
d ^= a;
d <<<= 16;

c += d;
b ^= c;
b <<<= 12;

a += b;
d ^= a;
d <<<= 8;

c += d;
b ^= c;
b <<<= 7;

As can be seen, values are added (+=), XORed (^=) as well as rotated by left bit-shifting (<<<=). The professional cryptographer immediately sees that this resembles Add-Rotate-XOR (ARX) semantics.

Performing these operations in that order on the four words ensures that their individual bits are manipulated multiple times.

The next question is which four words we should update. ChaCha20 uses two variants which use the Quarter Round function internally to ensure that all words in the state will be updated.

Column Round

The first Quarter Round variant is concerned with updating all the columns of the state and is therefore called Column Round.

As a reminder, here’s our schematic from above that describes the state with its word indexes:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

As we can see there are four columns and to update each individual column we can use one Quarter Round respectively:

  1. 0, 4, 8, 12
  2. 1, 5, 9, 13
  3. 2, 6, 10, 14
  4. 3, 7, 11, 15

The first Quarter Round, for example would set the a, b, c and d values to the word values at position 0, 4, 8 and 12 and use those values to run the Quarter Round logic.

[a]  1  2  3
[b]  5  6  7
[c]  9 10 11
[d] 13 14 15

To run one full Column Round which mixes all four columns, we would therefore run four individual Quarter Rounds.

Diagonal Round

The second Quarter Round variant is focused on updating the values located at diagonals across the state and is therefore called Diagonal Round.

Here’s once again the schematic from above that shows which index number corresponds to which word of the state:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

ChaCha20 defines four diagonals, each of which will use the Quarter Round logic to update the state:

  1. 0, 5, 10, 15
  2. 1, 6, 11, 12
  3. 2, 7, 8, 13
  4. 3, 4, 9, 14

The first Quarter Round, for example would set a, b, c and d to the word values at position 0, 5, 10 and 15 and use the Quarter Round internals to update such word values.

[a]  1  2  3
4  [b]  6  7
8  9  [c] 11
12 13 14 [d]

To run one full Diagonal Round which mixes the words at the diagonal indexes, we would therefore run four individual Quarter Rounds

Double Round

Having a way to updated all the columns via a Column Round and all the word values at the four diagonals via a Diagonal Round we can now use these two primitives and run them back-to-back to mix the state.

This operation is called a Double Round as we’re first updating the state via the Column Round followed by the Diagonal Round.

The more often we use the Double Round, the more scrambled the state will be. But how many iterations are enough?

Careful analysis showed that 20 rounds in total should be enough to be sure that the scrambled state is next to impossible to unscramble (with one caveat we’ll discuss shortly). We can therefore run 10 Double Rounds to implement this assumption.

Now it should also be clear where the 20 in ChaCha20’s name comes from: 102=2010 * 2 = 20, therefore ChaCha20.

Block Function

Now that we’ve seen how the state is manipulated via the different round functions we can put everything together to produce one block of 512-bit scrambled state that can then be used for encryption / decryption.

However, before doing so we need to address one minor issue. As it currently stands, all the operations we’ve performed on the state are invertible. It’s trivial for an attacker to use the Quarter Round function to undo the scrambling by simply running the function’s inverse operations. It’s easy to combat this, however.

All we need to do is to add the initial, unscrambled state word-by-word to the scrambled one. The main idea is that if you don’t know the initial state to begin with (and the attacker shouldn’t, because it includes the key), then adding this, initial state to the scrambled one will make it impossible for an attacker to invert the operations.

With that out of the way we can define the different steps that make up the Block Function:

  1. Setup the initial state with an initial counter value (usually that value is set to 0 or 1)
  2. Save this initial state in a separate variable so it can be added later
  3. Run 10 Double Rounds and save the output as the updated state
  4. Add the initial state word-by-word to the updated state
  5. Increment the counter value

Running these operations will produce one block with 512 bits of scrambled state.

ChaCha20 Overview

Encryption / Decryption

With the Block Function defined we can now use it to produce a so-called Key Stream which can be used to encrypt / decrypt data.

Encryption

To encrypt a plaintext using a key and nonce pair we’ll first setup the initial state with a counter value of 1 (setting the counter to 1 is a common practice but not strictly necessary).

Next up, we’ll split the plaintext we want to encrypt into 512-bit blocks. We’ll then use the Block Function to produce the scrambled state which we’ll then combine with the first plaintext block via the XOR operation to produce the first ciphertext block.

Then we increment the counter and setup the next, initial state which is then scrambled via the Block Function to be XORed with the second plaintext block to produce the second ciphertext block. This procedure is repeated until the whole plaintext is encrypted.

If the last plaintext block isn’t exactly 512 bits we can simply discard the unused bits from the scrambled state.

The bits that were produced by scrambling the state are called the Key Stream.

ChaCha20 Encryption

Taking a closer look at this process we can now see why there’s a key and nonce pair being used. Given that ChaCha20 is a symmetric cipher, there’s only one key that needs to be known to both, the party who encrypts data as well as the party who decrypts the data. Establishing a shared secret via Elliptic Curve Diffie-Hellman is quite involved, so it’s useful if the same key can be reused to encrypt multiple plaintexts.

This is where the nonce comes in. The nonce (number used once) can be utilized to “parameterize the key”, so that it can be reused multiple times. The nonce can be shared publicly but should only ever be used once per key as the same key + nonce pair will always produce the same sequence of Key Stream material (scrambled bits).

Decryption

Given that encryption is just an XOR of the produced scrambled state blocks (the Key Stream) with the plaintext blocks, similar to what a One-Time Pad construction does, we can use the exact same procedure again to decrypt a ciphertext.

To do so we’d split the ciphertext into 512-bit blocks and reuse the logic described above to produce the Key Stream material which is then XORed with the ciphertext blocks to produce the plaintext blocks.

ChaCha20 Decryption

Why it works

ChaCha20’s core primitive is the internal state consisting of 16 32-bit words (512 bits) that are manipulated to produce 512 bits which can be used via XOR for encryption / decryption.

The security of ChaCha20 relies on the attacker not being able to control more than 50% of such internal state. The following schematic depicts the layout of the initial state where C represents the constant words, K are the key words, B is the block counter word and N are the nonce words.

C C C C
K K K K
K K K K
B N N N

Examining the layout of this initial state we can see that half of it is already occupied by the key (K) which is unknown to the attacker. Furthermore, another 1/4th of the state is initialized with the constant (C) the attacker can’t control either. This only leaves the block counter (B) and the nonce (N) being controllable by the attacker, which is 1/4th of the whole state.

As observed by Claude Shannon in his seminal work ”A Mathematical Theory of Cryptography”, a symmetric cipher needs good Confusion and Diffusion properties for it to be secure. As a quick reminder, confusion ensure that the linkage between the key and the ciphertext is obfuscated, whereas diffusion is concerned with hiding statistical properties of the plaintext that make Cryptanalysis easier. The Quarter Round functions that’s implemented via ARX semantics and applied to the state’s columns and diagonals is designed in a way to ensure that the resulting output resembles good confusion and diffusion properties.

Given that ChaCha20 is designed around simple ARX operations, it’s very fast in software. This is due to the fact that state manipulations are performed on 32-bit words which allows for the exploitation of modern CPU architectures.

References

The following resources have been invaluable for me to learn the concepts discussed in this article.

You should definitely give them a read if you want to dive deeper into the topic.