What is a Hash Function and How Does it Work?
Use Hash Functions to generate a cryptographic fingerprint of your data
Hash functions are Cryptography's workhorse. They're used in Digital Signature Algorithms, Merkle Trees, Zero Knowledge Proofs, Password Storage and many other Protocols and Cryptographic applications.
A hash function is a primitive that provides integrity and authenticity.
You might've used them quite a lot without noticing it. Your operating system, for example, computes a hash of the files it downloads to update the system. This hash is then compared with the pre-computed one by the manufacturer to ensure that the download didn't corrupt any files (Note that there's a huge difference between a checksum function and a cryptographic hash function).
Generally speaking, a hash function is a deterministic, one-way function that turns arbitrary-length data into a randomly looking, unique fixed-size string of bytes.
Hash Function Properties
The aforementioned description is quite a mouthful so let's unpack it.
For any input you pass into the hash function it will always produce the same output.
While it's possible to always compute the same output based on an input it's impossible to reverse this process, i.e. one can't reconstruct the input based on the output.
There's no limit to the length of input data you can pass into the hash function.
Unique, Randomly Looking Fixed-size Output
For every input, there's exactly one output the input maps to.
The output that's produced should look random and a small change in the input should result in a "significant" change of output bytes that are generated.
Furthermore, the output of the hash function will always have the same byte length (determined by the hash function that's used).
There are three major security properties a hash function should possess.
The hash function should be one-way. It shouldn't be possible to reconstruct the input based on the output.
Second Pre-Image Resistance
If there's an input A that results in a hash X then it shouldn't be possible to find a second input B that results in the same hash X.
It shouldn't be possible to find two inputs that result in the same output hash.
The attentive reader might ask what the difference is between Second Pre-Image Resistance and Collision Resistance. The main difference is that with Second Pre-Image Resistance the first input is "fixed" and one has to find a second input that results in the same hash as the first to break the hash function. With Collision Resistance any two inputs that result in the same output would break the hash function.
The Birthday Bound
With most Cryptographic primitives 128-bit security is considered the minimum for it being called "secure". 128-bit security means that an attacker would need to perform 2^128 operations to break the system (e.g. exploring the whole input space if the key is of length 128-bit).
With hash functions, we aim for 256-bit outputs (32 bytes) to derive the same level of security. The reason for this is the so-called Birthday Bound (sometimes called "Birthday Paradox", though it's not really a paradox):
How many people do you think, have to gather in a room such that the probability that two share the same birthday is roughly 50%? The surprising answer is ~23.
This observation can be translated into the world of hash functions where "sharing the same birthday" means "producing a collision" (see "Security Properties" above).
When one randomly generates output byte strings via a hash function from a space of 2^N possibilities we can expect that a collision will be found with a ~50% chance after 2^N/2 outputs were generated.
So to be on the safe side we want our output space to be 256-bit because 256/2 = 128 which is the minimum level of security we should support.
How SHA-2 Works
Now that we know the properties as well as the security requirements a cryptographic hash function should meet we can take a look at a real-world example by analyzing how the SHA-2 family of hash functions (SHA-256, SHA-384, SHA-512, etc.) works under the covers.
The abbreviation "SHA" stands for "Secure Hash Algorithm" and the 2 in the name indicates that it's the next version after SHA-1 which was designed in the early 90s by the NSA (Note: SHA-1 shouldn't be used anymore as collisions were found which renders it insecure).
Generally speaking, there are four different versions in the SHA-2 family of hash functions:
The most used versions are SHA-256 and SHA-512 which generate outputs of 256 and 512 bits respectively.
The core building block of SHA-2 is a one-way compression function that takes two inputs of size X and Y and generates an output of size either X or Y. This compression function is implemented via the Davies-Meyer structure that in turn uses a block cipher that encrypts a fixed-size block of data per iteration.
Multiple compression functions are chained together in a so-called Merkle-Damgård construction which splits up the input to be hashed into individual fixed-size input chunks to be fed into their respective compression function to produce the hash output (also called digest).
The first "intermediate input" that's passed into the first compression function is called "Initialization Vector" (IV) which is a value that can be found in the SHA-2 specification to ensure that the generated output is deterministic. Such values are common in Cryptography and are also called "Nothing-up-my-sleeve number".
Here's a full description of the SHA-2 Algorithm:
Apply padding to the input that should be hashed so that its length is a multiple of the block size
Slice the input into blocks that are passed into the compression function
Iteratively apply the compression function with the input blocks and the output of the previous compression function as the inputs
The hash output (digest) is the output of the last compression function
SHA-2 and length-extension attacks
Studying the SHA-2 construction in detail shows that one can think of the outputs of the compression functions as the output of the hash function's intermediate state. So if there's a chain of five compression functions, the output of the second compression function can be interpreted as the second state (out of five) of the SHA-2 hash function.
This property makes the SHA-2 function vulnerable to length-extension attacks in which a hash that's considered "final" is fed into a compression function again so that additional information is included in a new, "final" hash.
Other Hash Functions
Given that there are different needs a hash function should satisfy, there are also other families of hash functions one should know about that address such needs.
The SHA-3 hash function is the successor of the aforementioned SHA-2 hash function that's built around a permutation function called "keccak-f" and a "sponge construction" that first "absorbs" inputs so that it can be "squeezed" later on to generate the digest.
SHA-3 comes in four different versions (for 224, 256, 384 and 512-bit outputs), allows for variable output lengths and isn't vulnerable to length extension attacks. If you're looking for a robust cryptographic hash function, then SHA-3 should be top of mind.
A close relative to SHA-3 is the Keccak hash function which is SHA-3 with a change in padding rules. If one doesn't trust the NIST standardization process, then going with Keccak provides a good alternative.
Extendable Output Functions (XOF)
Sometimes it's useful to have fine-grained control over the output size of the hash digest. That's where XOFs (Extendable Output Functions) can help.
Two of the most prominent XOFs are SHAKE and cSHAKE, which are both based on the SHA-3 hash function.
As you might remember, SHA-3 had two phases called "absorb" and "squeeze". With SHA-3, "squeeze" is called as often as there are still bits to read. Once all the bits were read, the digest is finalized. With SHAKE and cSHAKE one can call the "squeeze" functionality an arbitrary amount of times until the desired output length was obtained.
The major difference between SHAKE and cSHAKE is that cSHAKE takes a string input that allows for further customization of SHAKE (the c stands for "customizable").
A very important thing to keep in mind while using SHAKE and cSHAKE is that their output length determines the hash function's security. If in doubt use a length of 256 bits.
Password Hash Functions
Hash functions are commonly used to implement user account systems in which the user's password is stored as a hash in the database (ideally in combination with a user-specific salt). That way user passwords aren't at risk if there's a data breach as the attacker only has access to the (salted) hashes of the passwords. The pre-image resistance property (see above) prevents the attacker from easily "undoing" the hash to recover the password in plaintext.
When the user tries to sign in, the only thing required is to take the password, hash it and compare it against the stored value in the database. If there's a match the user used the correct password and access can be granted.
Cryptographers derived hash functions for this very specific use case. Those password hash functions are constructed in a way that makes optimizing their efficiency via hardware hard (e.g. by using memory-heavy computations that are expensive to optimize). The reason is that attackers usually try to guess a victim's password via techniques like dictionary attacks. If computing a hash is expensive, then such attacks take much longer.
Popular password hash functions are PBKDF2, bcrypt, scrypt and Argon2.
Proper use of Hash Functions
Stick to standards and the state of the art
If you're starting a greenfield project and you don't have to maintain an existing codebase that uses old hash functions it makes sense to do some research to find the hash function that's best suited for your project needs.
Try to migrate away from SHA-1 and MD5 if possible as collisions have been found for both of them.
Also, be aware of the length extension attack SHA-2 suffers from (see above). If you have to use SHA-2 for some reason hash the input twice to mitigate this problem.
Cryptographic vs. non-cryptographic hash functions
There's a huge difference between cryptographic- and non-cryptographic hash functions.
If your application requires certain security properties from a hash function such as pre-image resistance, second pre-image resistance or collision resistance, then there's no way around using a proper cryptographic hash function such as SHA-3.
If, however, your application uses hashes in a context where security doesn't matter that much (e.g. to implement a Bloom filter), then it's perfectly fine to use simpler and usually faster hash functions such as MurmurHash.
Serialize and Deserialization
When hashing a list of data points make sure that you use a serialization format to protect yourself against attacks due to ambiguity.
For example, hashing the following list of values that resembles a transaction of money from Alice to Bob
hash("alice""bob""2""1000") introduces such a loophole as an attacker can update the hash payload to look like this
hash("alice""bob""2100""0"). Both hashes would result in the same digest.
To mitigate this problem serialization can be used such as prefixing the list elements with their length. Here's the updated example
There's a special hash function called "TupleHash" which should be used to hash a list of values.
Don't reuse keys
One of the main rules of cryptography also applies to hash functions...
"Don't reuse keys!"
This is especially important when you're dealing with customizable XOFs.
Be aware that truncating a hash function's output might reduce its security against collision resistance. Depending on the application's requirements it might be fine. If you're not sure which length is still secure, err on the side of using 256 bits.
If you want to dive deeper into the topics discussed in this blog post you might want to take a look at the following resources: