A hash function takes a message, m, and returns a pseudo-random string of letters/numbers which should be unique to that message. Let's say the hash function returns "aBc67D" for the message "I love dogs". This function should not return the same "aBc67D" for "Donuts are cool".
Hashing algorithms have 3 requirements:
- A hashing algorithm needs to be reasonably fast to compute and reasonably fast to verify.
- If you change one single bit anywhere in the message, the outputted string must look completely different.
- You must avoid collisions.
For the first point, the algorithms have to be fast. Hashing algorithms are often used by much slower algorithms (such as RSA) to speed up the algorithm. As a side note, most hashing algorithms are designed to be one-way functions.
The algorithm has to be fast, but it does not have to be efficient. Efficient hashing algorithms make causing collisions easier for attacks. Hashing algorithms need to be resistant towards "pre-image attacks". That is, given a hash, it should be extremely difficult to calculate retrace the deterministic steps taken to reproduce the value that created the hash (i.e. the pre-image).
For the second point, if using SHA-5 and we input:
abc = a9993e364706816aba3e25717850c26c9cd0d89d
An important feature we want is that if we were to change the original string even slightly by say one letter the entire outputted hash has to change. If we input "abd" we get:
abd = cb4cc28df0fdbe0ecf9d9662e294b118092a5735
So the idea of randomness here is that the first hash does not look anything like the second hash despite us only changing 1 letter. In fact, with some code:
For the third point, this is the donut/dogs example above. Sometimes, it's okay to not avoid collisions because there are billions of different messages and a fixed length hash message. But, by not avoiding collisions other people can artificially make a message with the same hash as another file and this is an issue - because it's a fake message.
Let's walk through how SHA-1, an old hashing algorithm, works in detail. Although all hashing algorithms work differently and SHA-1 isn't used in the real world anymore, seeing how SHA-1 works in detail will enable us to generalise how hashing algorithms work.
SHA-1 takes a bit string of length less than 264 bits and produces a 160-bit hash value known as the message digest. Note: I will be using hexadecimal numbering for brevity. Hashing algorithms never take a message of size X, and return a message of size X. They always 'compress' and create a digest of the message. Remember rule 1 for hashing algorithms from earlier? We want hashing algorithms to be fast. Producing large messages is not fast.
SHA-1 was published in 1993 by NIST, but was developed by the NSA (yes, that NSA). Not much is known about the history of SHA-1, so I apologise for not including history here.
Suppose we want to encode the message 'abc' using SHA-1. We start off by converting it into binary:
$$ abc = 01100001 01100010 01100011$$
SHA-1 starts with an internal state. Let's say our internal state of SHA-1 is:
$$ h_0, h_1, h_2, h_3, h_4$$
The internal state size is exactly the same as the length it produces (160). So each internal state H has 160/5 = 32-bit words which is 4 bytes each. We split it into chunks because it's easier to calculate. We start by initializing these internal states with five random strings of hex characters: