Often I'll hear about how you can optimise a for loop to be faster or how switch statements are faster than if statements. Most computers have over 1 core, with the ability to support multiple threads. Before worrying about optimising for loops or if statements try to attack your problem from a different angle.
Divide and Conquer is one way to attack a problem from a different angle. Don't worry if you have zero experience or knowledge on the topic. This article is designed to be read by someone with very little programming knowledge.
I will explain this using 3 examples. The first will be a simple explanation. The second will be some code. The final will get into the mathematical core of divide and conquer techniques. (Don't worry, I hate maths too).
Divide and conquer is where you divide a large problem up into many smaller, much easier to solve problems. The rather small example below illustrates this.
We take the equation "3 + 6 + 2 + 4" and cut it down into the smallest set of equations, which is [3 + 6, 2 + 4]. It could also be [2 + 3, 4 + 6]. The order doesn't matter, as long as we turn this one long equation into many smaller equations.
Let’s say we have 8 numbers:
$$4 + 6 + 3 + 2 + 8 + 7 + 5 + 1$$
We want to add them all together. We first divide the problem into 8 equal sub-problems. We do this by breaking the addition up into individual numbers.
$$4 6 3 2 8 7 5 1$$
We then add 2 numbers at a time.
Then 4 numbers into 8 numbers which is our resultant.
Why do we break it down to individual numbers at stage 1? Why don't we just start from stage 2? Because while this list of numbers is even if the list was odd you would need to break it down to individual numbers to better handle it.
A divide and conquer algorithm tries to break a problem down into as many little chunks as possible since it is easier to solve with little chunks. It does this with recursion.
Before we get into the rest of the article, let's learn about recursion first.
Recursion is when a function calls itself. It's a hard concept to understand if you've never heard of it before. This page provides a good explanation.
Matryoshka dolls are these cute little things:
We open up the bigger one, and inside is a slightly smaller one. Inside that one is another slightly small doll. Let's say, inside the last doll is a key. But we do not know how many dolls there are. How do we write a function that opens up the dolls until we find a key?
We could use a while loop, but recursion is preferred here.
To program this, we can write:
def getKey(doll):
item = doll.open()
if item == key:
return key
else:
return getKey(item)
getKey(doll)
The function repeatedly calls itself until it finds a key, at which point it stops. The finding key point is called a break case or exit condition.
We always add a break case to a recursive function. If we didn't, it'd just be an infinite loop! Never ending.
Computer scientists love recursion. Because it's so hard for normal people to understand, we have a schadenfreude sensation watching people struggle. Haha just kidding!
We love recursion because it's used in maths all the time. Computer scientists are mathematicians first, coders second. Anything that brings code closer to real-life mathematics is good.
Not just because some people love maths, but because it makes it easier to implement. Need to calculate the Fibonacci numbers? The maths equation for this is:
$$ F(n) = \begin{cases} n, \text{If n = 0 or 1} \\ F(n - 1) + F(n - 2), \; \text{if n > 1} \end{cases}$$
A natural recurrence in our formula! Instead of translating it into loops, we can just calculate it:
def F(n):
if n == 0 or n == 1:
return n
else:
return F(n-1)+F(n-2)
This is one of the reasons why functional programming is so cool.
Also, as you'll see throughout this article, recursion reads so much nicer than loops. And hey, maybe you can feel a little happier when your coworker doesn't understand recursion but you do ;)
The technique is, as defined in the famous Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein, is:
If the problem is small, then solve it directly. Otherwise, divide the problem into smaller subsets of the same problem.
2. Conquer
Conquer the smaller problems by solving them recursively. If the sub-problems are small enough, recursion is not needed and you can solve them directly.
3. Combine
Take the solutions to the sub-problems and merge them into a solution to the original problem.
Let's look at another example, calculating the factorial of a number.
n = 6
def recur_factorial(n):
if n == 1:
return n
else:
return n * recur_factorial(n-1)
print(recur_factorial(n))
With the code from above, some important things to note. The Divide part is also the recursion part. We divide the problem up at return n * recur_factorial(n-1)
.
The recur_factorial(n-1)
part is where we divide the problem up.
The conquer part is the recursion part too, but also the if statement. If the problem is small enough, we solve it directly (by returning n). Else, we perform return n * recur_factorial(n-1)
.
Combine. We do this with the multiplication symbol. Eventually, we return the factorial of the number. If we didn't have the symbol there, and it was return recur_factorial(n-1)
it wouldn't combine and it wouldn't output anything similar to the factorial. (It'll output 1, for those interested).
We'll explore how divide and conquer works in some famous algorithms, Merge Sort and the solution to the Towers of Hanoi.
Merge Sort is a sorting algorithm. The algorithm works as follows:
In this image, we break down the 8 numbers into separate digits. Just like we did earlier. Once we've done this, we can begin the sorting process.
It compares 51 and 13. Since 13 is smaller, it puts it in the left-hand side. It does this for (10, 64), (34, 5), (32, 21).
It then merges (13, 51) with (10, 64). It knows that 13 is the smallest in the first list, and 10 is the smallest in the right list. 10 is smaller than 13, therefore we don't need to compare 13 to 64. We're comparing & merging two sorted lists.
In recursion we use the term base case to refer to the absolute smallest value we can deal with. With Merge Sort, the base case is 1. That means we split the list up until we get sub-lists of length 1. That's also why we go down all the way to 1 and not 2. If the base case was 2, we would stop at the 2 numbers.
If the length of the list (n) is larger than 1, then we divide the list and each sub-list by 2 until we get sub-lists of size 1. If n = 1, the list is already sorted so we do nothing.
Merge Sort is an example of a divide and conquer algorithm. Let's look at one more algorithm to understand how divide and conquer works.
The Towers of Hanoi is a mathematical problem which compromises 3 pegs and 3 discs. This problem is mostly used to teach recursion, but it has some real-world uses. The number of pegs & discs can change.
Each disc is a different size. We want to move all discs to peg C so that the largest is on the bottom, second largest on top of the largest, third largest (smallest) on top of all of them. There are some rules to this game:
We want to use the smallest number of moves possible. If we have 1 disc, we only need to move it once. 2 discs, we need to move it 3 times.
The number of moves is a power of 2 minus 1. Say we have 4 discs, we calculate the minimum number of moves as $2^4 = 16 - 1 = 15$.
To solve the above example we want to store the smallest disc in a buffer peg (1 move). See below for a gif on solving Tower of Hanoi with 3 pegs and 3 discs.
Notice how we need to have a buffer to store the discs.
We can generalise this problem. If we have n discs: move n-1 from A to B recursively, move largest from A to C, move n-1 from B to C recursively.
If there is an even number of pieces the first move is always into the middle. If it is odd the first move is always to the other end.
Let's code the algorithm for ToH, in pseudocode.
function MoveTower(disk, source, dest, spare):
if disk == 0, then:
move disk from source to dest
We start with a base case, disk == 0
. source
is the peg you're starting at. dest
is the final destination peg. spare
is the spare peg.
FUNCTION MoveTower(disk, source, dest, spare):
IF disk == 0, THEN:
move disk from source to dest
ELSE:
MoveTower(disk - 1, source, spare, dest) // Step 1
move disk from source to dest // Step 2
MoveTower(disk - 1, spare, dest, source) // Step 3
END IF
Notice that with step 1 we switch dest
and source
. We do not do this for step 3.
With recursion, we know 2 things:
The algorithm gets a little confusing with steps 1 and 3. They both call the same function. This is where multi-threading comes in. You can run steps 1 and 3 on different threads - at the same time.
Since 2 is more than 1, we move it down one more level again. So far you've seen what the divide and conquer technique is. You should understand how it works and what code looks like. Next, let's learn how to define an algorithm to a problem using divide and conquer. This part is the most important. Once you know this, it'll be easier to create divide and conquer algorithms.
When we have a problem that looks similar to a famous divide & conquer algorithm (such as merge sort), it will be useful.
Most of the time, the algorithms we design will be most similar to merge sort. If we have an algorithm that takes a list and does something with each element of the list, it might be able to use divide & conquer.
For example, working out the largest item of a list. Given a list of words, how many times does the letter "e" appear?
If we have an algorithm that is slow and we would like to speed it up, one of our first options is divide and conquer.
There isn't any obvious tell-tale signs other than "similar to a famous example". But as we'll see in the next section, we can check if it is solvable using divide & conquer.
Now we know how divide and conquer algorithms work, we can build up our own solution. In this example, we'll walk through how to build a solution to the Fibonacci numbers.
We can find Fibonacci numbers in nature. The way rabbits produce is in the style of the Fibonacci numbers. You have 2 rabbits that make 3, 3 rabbits make 5, 5 rabbits make 9 and so on.
The numbers start at 0 and the next number is the current number + the previous number. But by mathematicla definition, the first 2 numbers are 0 and 1.
Let's say we want to find the 5 Fibonacci number. We can do this:
# [0, 1]
0 + 1 = 1 # 3rd fib number
# [0, 1, 1]
1 + 1 = 2 # 4th fib number
# [0, 1, 1, 2]
2 + 1 = 3 # 5th fib number
# [0, 1, 1, 2, 3]
Now the first thing when designing a divide and conquer algorithm is to design the recurrence. The recurrence always starts with a base case.
We can describe this relation using a recursion. A recurrence is an equation which defines a function in terms of its smaller inputs. Recurrence and recursion sound similar and are similar.
As we saw, our base case is the 2 numbers at the start.
def f(n):
if n == 0 or n == 1:
return n
To calculate the 4th Fibonacci number, we do (4 - 1) + (4 - 2). This means (last number in the sequence) + (the number before the last). Or in other words:
The next number is the current number + the previous number.
If our number is not 0 or 1, we want to add the last 2 Fibonacci numbers together.
Let's take a look at our table quickly:
# [0, 1]
0 + 1 = 1
# [0, 1, 1]
1 + 1 = 2
# [0, 1, 1, 2]
2 + 1 = 3
# [0, 1, 1, 2, 3]
2 + 3 = 5
# [0, 1, 1, 2, 3, 5]
But what if we don't have this list stored? How do we calculate the 6th number without creating a list at all? Well we know that the 6th number is the 5th number + the 4th number. Okay, what are those? The 5th number is the 4th number + the 3rd number. The 4th number is the 3rd number + the second number.
We know that the second number is always 1, as we've reached a base case.
Eventually we break it down to the basecases. Okay, so we know our code calls itself to calculate the Fibonacci numbers of the previous ones:
def f(n):
if n == 0 or n == 1:
return n
else:
f(n-1) f(n-2)
Okay, how do we merge the Fibonacci numbers at the end? As we saw, it is the last number added to the current number.
def f(n):
if n == 0 or n == 1:
return n
else:
f(n-1) + f(n-2)
Now we've seen this, let's turn it into recursion using a recurrence. Luckily for us, it's incredibly easy to go from a recurrence to code or from code to a recurrence, as they are both recurrences!
$$ F(n) = \begin{cases} n, \text{If n = 0 or 1} \\ F(n - 1) + F(n - 2), \; \text{if n > 1} \end{cases}$$
We often calculate the result of a recurrence using an execution tree. We saw this earlier when exploring how to build it in code. For F(6) this looks like:
n is 4, and n is larger than 0 or 1. So we do f(n-1) + f(n-2). We ignore the addition for now. This results in 2 new nodes, 3 and 2. 3 is larger than 0 or 1 so we do the same. Same for 2. We do this until we get a bunch of nodes which are either 0 or 1.
We then add all the nodes together. 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 1 + 0 + 0 + 1 = 8.
When we have a problem that looks similar to a famous divide & conquer algorithm (such as merge sort), it will be useful.
Most of the time, the algorithms we design will be most similar to merge sort. If we have an algorithm that takes a list and does something with each element of the list, it might be able to use divide & conquer.
For example, working out the largest item of a list. Given a list of words, how many times does the letter "e" appear?
Normally if our algorithm follows a famous divide & conquer (algorithm) we can infer our big o from that.
This is no different from calculating the big o notation of our own algorithms.
Greedy vs Divide & Conquer vs Dynamic Programming | ||
---|---|---|
Greedy | Divide & Conquer | Dynamic Programming |
Optimises by making the best choice at the moment | Optimises by breaking down a subproblem into simpler versions of itself and using multi-threading & recursion to solve | Same as Divide and Conquer, but optimises by caching the answers to each subproblem as not to repeat the calculation twice. |
Doesn't always find the optimal solution, but is very fast | Always finds the optimal solution, but is slower than Greedy | Always finds the optimal solution, but could be pointless on small datasets. |
Requires almost no memory | Requires some memory to remember recursive calls | Requires a lot of memory for memoisation / tabulation |
Once you've identified how to break a problem down into many smaller pieces, you can use concurrent programming to execute these pieces at the same time (on different threads) speeding up the whole algorithm.
Divide and conquer algorithms are one of the fastest and perhaps easiest ways to increase the speed of an algorithm and are useful in everyday programming. Here are the most important topics we covered in this article:
The next step is to explore multi-threading. Choose your programming language of choice and Google, as an example, "Python multi-threading". Figure out how it works and see if you can attack any problems in your own code from this new angle.
You can also learn about how to solve recurrences (finding out the asymptotic running time of a recurrence), which is the next article I'm going to write. If you don't want to miss it, or you liked this article do consider subscribing to my email list 😁✨
Greedy algorithms aim to make the optimal choice at that given moment. Each step it chooses the optimal choice, without knowing the future. It attempts to find the globally optimal way to solve the entire problem using this method.
We call algorithms greedy when they utilise the greedy property. The greedy property is:
At that exact moment in time, what is the optimal choice to make?
Greedy algorithms are greedy. They do not look into the future to decide the global optimal solution. They are only concerned with the optimal solution locally. This means that the overall optimal solution may differ from the solution the algorithm chooses.
They never look backwards at what they've done to see if they could optimise globally. This is the main difference between Greedy and Dynamic Programming.
To be extra clear, one of the most Googled questions about greedy algorithms is:
"What problem-solving strategies don't guarantee solutions but make efficient use of time?"
The answer is "Greedy algorithms". They don't guarantee solutions, but are very time efficient. However, in the next section we'll learn that sometimes Greedy solutions give us the optimal solutions.
Greedy algorithms are quick. A lot faster than the two other alternatives (Divide & Conquer, and Dynamic Programming). They're used because they're fast.
Sometimes, Greedy algorithms give the global optimal solution everytime. Some of these algorithms are:
These algorithms are Greedy, and their Greedy solution gives the optimal solution.
We're going to explore greedy algorithms using examples, and learning how it all works.
Your algorithm needs to follow this property:
At that exact moment in time, what is the optimal choice to make?
And that's it. There isn't much to it. Greedy algorithms are easier to code than Divide & Conquer or Dynamic Programming.
Imagine you're a vending machine. Someone gives you £1 and buys a drink for £0.70p. There's no 30p coin in pound sterling, how do you calculate how much change to return?
For reference, this is the denomination of each coin in the UK:
1p, 2p, 5p, 10p, 20p, 50p, £1
The greedy algorithm starts from the highest denomination and works backwards. Our algorithm starts at £1. £1 is more than 30p, so it can't use it. It does this for 50p. It reaches 20p. 20p < 30p, so it takes 1 20p.
The algorithm needs to return change of 10p. It tries 20p again, but 20p > 10p. It next goes to 10p. It chooses 1 10p, and now our return is 0 we stop the algorithm.
We return 1x20p and 1x10p.
This algorithm works well in real life. Let's use another example, this time we have the denomination next to how many of that coin is in the machine, (denomination, how many)
.
(1p, 10), (2p, 3), (5p, 1), (10p, 0), (20p, 1p), (50p, 19p), (100p, 16)
The algorithm is asked to return change of 30p again. 100p (£1) is no. Same for 50. 20p, we can do that. We pick 1x 20p. We now need to return 10p. 20p has run out, so we move down 1.
10p has run out, so we move down 1.
We have 5p, so we choose 1x5p. We now need to return 5p. 5p has run out, so we move down one.
We choose 1 2p coin. We now need to return 3p. We choose another 2p coin. We now need to return 1p. We move down one.
We choose 1x 1p coin.
Our algorithm selected these coins to return as change:
# (value, how many we return as change)
(10, 1)
(5, 1)
(2, 2)
(1, 1)
Let's code something. First, we need to define the problem. We'll start with the denominations.
denominations = [1, 2, 5, 10, 20, 50, 100]
# 100p is £1
Now onto the core function. Given denominations and an amount to give change, we want to return a list of how many times that coin was returned.
If our denominations
list is as above, [6, 3, 0, 0, 0, 0, 0]
represents taking 6 1p coins and 3 2p coins, but 0 of all other coins.
denominations = [1, 2, 5, 10, 20, 50, 100]
# 100p is £1
def returnChange(change, denominations):
toGiveBack = [0] * len(denominations)
for pos, coin in reversed(list(enumerate(denominations))):
We create a list, the size of denominations long and fill it with 0's.
We want to loop backwards, from largest to smallest. Reversed(x)
reverses x and lets us loop backwards. Enumerate means "for loop through this list, but keep the position in another variable". In our example when we start the loop. coin = 100
and pos = 6
.
Our next step is choosing a coin for as long as we can use that coin. If we need to give change = 40
we want our algorithm to choose 20, then 20 again until it can no longer use 20. We do this using a for loop.
denominations = [1, 2, 5, 10, 20, 50, 100]
# 100p is £1
def returnChange(change, denominations):
# makes a list size of length denominations filled with 0
toGiveBack = [0] * len(denominations)
# goes backwards through denominations list
# and also keeps track of the counter, pos.
for pos, coin in enumerate(reversed(denominations)):
# while we can still use coin, use it until we can't
while coin <= change:
While the coin can still fit into change, add that coin to our return list, toGiveBack
and remove it from change.
denominations = [1, 2, 5, 10, 20, 50, 100]
# 100p is £1
def returnChange(change, denominations):
# makes a list size of length denominations filled with 0
toGiveBack = [0] * len(denominations)
# goes backwards through denominations list
# and also keeps track of the counter, pos.
for pos, coin in enumerate(reversed(denominations)):
# while we can still use coin, use it until we can't
while coin <= change:
change = change - coin
toGiveBack[pos] += 1
return(toGiveBack)
print(returnChange(30, denominations))
# returns [0, 0, 0, 1, 1, 0, 0]
# 1x 10p, 1x 20p
The runtime of this algorithm is dominated by the 2 loops, thus it is $O(n^2)$.
It is optimal locally, but sometimes it isn't optimal globally. In the change giving algorithm, we can force a point at which it isn't optimal globally.
The algorithm for doing this is:
We'll pick 1, 15, 25.
We'll ask for change of 30. Now, let's see what our Greedy algorithm does.
[5, 0, 1]
It choses 1x 25p, and 5x 1p. The optimal solution is 2x 15p.
Our Greedy algorithm failed because it didn't look at 15p. It looked at 25p and thought "yup, that fits. Let's take it."
It then looked at 15p and thought "that doesn't fit, let's move on".
This is an example of where Greedy Algorithms fail.
To get around this, you would either have to create currency where this doesn't work or to brute-force the solution. Or use Dynamic Programming.
Dijkstra's algorithm finds the shortest path from a node to every other node in the graph. In our example, we'll be using a weighted directed graph. Each edge has a direction, and each edge has a weight.
Dijkstra's algorithm has many uses. It can be very useful within road networks where you need to find the fastest route to a place. We also use the algorithm for:
The algorithm follows these rules:
Our first step is to pick the starting node. Let's choose A. All the distances start at infinity, as we don't know their distance until we reach a node that knows the distance.
We mark off A on our unvisited nodes list. The distance from A to A is 0. The distance from A to B is 4. The distance from A to C is 2. We updated our distance listing on the right-hand side.
We pick the smallest edge where the vertex hasn't been chosen. The smallest edge is A -> C, and we haven't chosen C yet. We visit C.
Notice how we're picking the smallest distance from our current node to a node we haven't visited yet. We're being greedy. Here, the greedy method is the global optimal solution.
We can get to B from C. We now need to pick a minimum. $min(4, 2 + 1) = 3$.
Since A -> C -> B is smaller than A -> B, we update B with this information. We then add in the distances from the other nodes we can now reach.
Our next smallest vertex with a node we haven't visited yet is B, with 3. We visit B.
We do the same for B. Then we pick the smallest vertex we haven't visited yet, D.
We don't update any of the distances this time. Our last node is then E.
There are no updates again. To find the shortest path from A to the other nodes, we walk back through our graph.
We pick A first, C second, B third. If you need to create the shortest path from A to every other node as a graph, you can run this algorithm using a table on the right-hand side.
Dijkstra's Table | ||
---|---|---|
Node | Distance from A | Previous node |
A | 0 | N/A |
B | 3 | C |
C | 2 | A |
D | 5 | B |
E | 6 | B |
Using this table it is easy to draw out the shortest distance from A to every other node in the graph:
Prim's algorithm is a greedy algorithm that finds a minimum spanning tree for a weighted undirected graph. It finds the optimal route from every node to every other node in the tree.
With a small change to Dijkstra's algorithm, we can build a new algorithm - Prim's algorithm!
We informally describe the algorithm as:
We have this graph.
Our next step is to pick an arbitrary node.
We pick the node A. We then examine all the edges connecting A to other vertices. Prim's algorithm is greedy. That means it picks the shortest edge that connects to an unvisited vertex.
In our example, it picks B.
We now look at all nodes reachable from A and B. This is the distinction between Dijkstra's and Prim's. With Dijkstra's, we're looking for a path from 1 node to a certain other node (nodes that have not been visited). With Prim's, we want the minimum spanning tree.
We have 3 edges with equal weights of 3. We pick 1 randomly.
It is helpful to highlight our graph as we go along, because it makes it easier to create the minimum spanning tree.
Now we look at all edges of A, B, and C. The shortest edge is C > E with a weight of 1.
And we repeat:
The edge B > E with a weight of 3 is the smallest edge. However, both vertices are always in our VISITED
list. Meaning we do not pick this edge. We instead choose C > F, as we have not visited
The only node left is G, so let's visit it.
Note that if the edge weights are distinct, the minimum spanning tree is unique.
We can add the edge weights to get the minimum spanning tree's total edge weight:
$$2 + 3 + 3 + 1 + 6 + 9 = 24$$
Imagine you are a thief. You break into the house of Judy Holliday - 1951 Oscar winner for Best Actress. Judy is a hoarder of gems. Judy's house is lined to the brim with gems.
You brought with you a bag - a knapsack if you will. This bag has a weight of 7. You happened to have a listing of Judy's items, from some insurance paper. The items read as:
Judy's Items | ||
---|---|---|
Name | Value | Weight |
Diamonds | 16 | 5 |
Francium | 3 | 1 |
Sapphire | 6 | 2 |
Emerald | 2 | 1 |
The first step to solving the fractional knapsack problem is to calculate $\frac{value}{weight}$ for each item.
Judy's Items | |||
---|---|---|---|
Name | Value | Weight | Value / weight |
Diamonds | 16 | 5 | 3.2 |
Francium | 3 | 1 | 3 |
Sapphire | 6 | 2 | 3 |
Emerald | 2 | 1 | 2 |
And now we greedily select the largest ones. To do this, we can sort them according to $\frac{value}{weight}$ in descending order. Luckily for us, they are already sorted. The largest one is 3.2.
knapsack value = 16
knapsack total weight = 5 (out of 7)
Then we select Francium (I know it's not a gem, but Judy is a bit strange 😉)
knapsack value = 19
knapsack weight = 6
Now, we add Sapphire. But if we add Sapphire, our total weight will come to 8.
In the fractional knapsack problem, we can cut items up to take fractions of them. We have a weight of 1 left in the bag. Our sapphire is weight 2. We calculate the ratio of:
$$\frac{weight\;of\;knapsack\;left}{weight\;of\;item}$$
And then multiply this ratio by the value of the item to get how much value of that item we can take.
$$\frac{1}{2} * 6 = 3$$
knapsack value = 21
knapsack weight = 7
The greedy algorithm can optimally solve the fractional knapsack problem, but it cannot optimally solve the {0, 1} knapsack problem. In this problem instead of taking a fraction of an item, you either take it {1} or you don't {0}. To solve this, you need to use Dynamic Programming.
The runtime for this algorithm is O(n log n). Calculating $\frac{value}{weight}$ is O(1). Our main step is sorting from largest $\frac{value}{weight}$, which takes O(n log n) time.
Greedy vs Divide & Conquer vs Dynamic Programming | ||
---|---|---|
Greedy | Divide & Conquer | Dynamic Programming |
Optimises by making the best choice at the moment | Optimises by breaking down a subproblem into simpler versions of itself and using multi-threading & recursion to solve | Same as Divide and Conquer, but optimises by caching the answers to each subproblem as not to repeat the calculation twice. |
Doesn't always find the optimal solution, but is very fast | Always finds the optimal solution, but is slower than Greedy | Always finds the optimal solution, but could be pointless on small datasets. |
Requires almost no memory | Requires some memory to remember recursive calls | Requires a lot of memory for memoisation / tabulation |
To learn more about Divide & Conquer and Dynamic Programming, check out these 2 posts I wrote:
Greedy algorithms are very fast, but may not provide the optimal solution. They are also easier to code than their counterparts.
By the end of this article, you'll thoroughly understand Big O notation. You'll also know how to use it in the real world, and even the mathematics behind it!
In computer science, time complexity is the computational complexity that describes the amount of time it takes to run an algorithm.
Big O notation is a method for determining how fast an algorithm is. Using Big O notation, we can learn whether our algorithm is fast or slow. This knowledge lets us design better algorithms.
This article is written using agnostic Python. That means it will be easy to port the Big O notation code over to Java, or any other language. If the code isn't agnostic, there's Java code accompanying it.
We could run an algorithm 10,000 times and measure the average time taken.
➜ python3 -m timeit '[print(x) for x in range(100)]'
100 loops, best of 3: 11.1 msec per loop
➜ python3 -m timeit '[print(x) for x in range(10)]'
1000 loops, best of 3: 1.09 msec per loop
# We can see that the time per loop changes depending on the input!
Say we have an algorithm that takes a shopping list and prints out every item on the shopping list. If the shopping list has 3 items, it'll execute quickly. If it has 10 billion items, it'll take a long time.
What is the “perfect” input size to get the “perfect” measure of how long the algorithm takes?
Other things we need to consider:
For this reason, we use Big O (pronounced Big Oh) notation.
Big O is a formal notation that describes the behaviour of a function when the argument tends towards the maximum input.
It was invented by Paul Bachmann, Edmund Landau and others between 1894 and 1820s. Popularised in the 1970s by Donald Knuth.
Big O takes the upper bound. The worst-case results in the worst execution of the algorithm. For our shopping list example, the worst-case is an infinite list.
Instead of saying the input is 10 billion, or infinite - we say the input is n size. The exact size of the input doesn't matter, only how our algorithm performs with the worst input. We can still work out Big O without knowing the exact size of an input.
Big O is easy to read once we learn this table:
Constant | Logarithm | Linear | Polynomial | Exponential |
O($1$) | O($log\;n$) | O($n$) | O($n^2$), O($n^3$), O($n^x$) | O($2^n$) |
Where the further right they are, the longer it takes. n
is the size of the input. Big O notation uses these functions to describe algorithm efficiency.
In our shopping list example, in the worst-case of our algorithm it prints out every item in the list sequentially. Since there are n
items in the list, it takes $O(n)$ time to complete the algorithm.
Other asymptotic (time-measuring) notations are:
Big Omega (lower bound) | Big Theta (average bound) | Big O (max bound) |
$\omega (n)$ | $\theta (n)$ | $O(n)$ |
Informally this is:
Let's walk through every single column in our "The Big O Notation Table".
Constant algorithms do not scale with the input size, they are constant no matter how big the input. An example of this is addition. $1 + 2$ takes the same time as $500 + 700$. They may take more physical time, but we do not add more steps in the algorithm for addition of big numbers. The underlying algorithm doesn't change at all.
We often see constant as $O(1)$, but any number could be used and it would still be constant. We sometimes change the number to a 1, because it doesn't matter at all about how many steps it takes. What matters is that it takes a constant number of steps.
Constant time is the fastest of all Big O time complexities. The formal definition of constant time is:
It is upper-bounded by a constant
An example is:
def OddOrEven(n):
return "Even" if n % 2 else "Odd"
Or in Java:
boolean isEven(double num) { return ((num % 2) == 0); }
In most programming languages, all integers have limits. Primitive operations (such as modulo, %
) are all upper-bounded by this limit. If we go over this limit, we get an overflow error.
Because of this upper-bound, it satisfies $O(1)$.
Here's a quick explainer of what a logarithm is.
$$Log_{3}^{9}$$
What is being asked here is “3 to what power gives us 9?” This is 3 to the power of 2 gives us 9, so the whole expression looks like:
$$Log_{3}^{9} = 2$$
A logarithmic algorithm halves the list every time it’s run.
Let's look at binary search. Given the below sorted list:
a = [1, 2, 3, 4, 5, 6 , 7, 8, 9, 10]
We want to find the number "2".
We implement Binary Search as:
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first <= last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
In English this is:
The algorithm halves the input every single time it iterates. Therefore it is logarithmic. Other examples include:
Linear time algorithms mean that every single element from the input is visited exactly once, O(n) times. As the size of the input, N, grows our algorithm's run time scales exactly with the size of the input.
Remember our shopping list app from earlier? The algorithm ran in O(n) which is linear time!
Linear time is where every single item in a list is visited once, in a worst-case scenario.
To read out our shopping list, our algorithm has to read out each item. It can't half the list, or add more items that we didn't add. It has to list all n items, one at a time.
shopping_list = ["Bread", "Butter", "The Nacho Libre soundtrack from the 2006 film Nacho Libre", "Reusable Water Bottle"]
for item in shopping_list:
print(item)
Let's look at another example.
Given the list:
a = [2, 16, 7, 9, 8, 23, 12]
How do we work out what the largest item is?
We need to program it like this:
a = [2, 16, 7, 9, 8, 23, 12]
max_item = a[0]
for item in a:
if item > max_item:
max_item = item
We have to go through every item in the list, 1 by 1.
Polynomial time is a polynomial function of the input. A polynomial function looks like $n^2$ or $n^3$ and so on.
If one loop through a list is $O(n)$, 2 loops must be $O(n^2)$. For each loop, we go over the list once. For each item in that list, we go over the entire list once. Resulting in n^{2} operations.
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for i in a:
for x in a:
print("x")
For each nesting on the same list, that adds an extra +1 onto the powers.
So a triple nested loop is $O(n^3)$.
Bubblesort is a good example of an $O(n^2)$ algorithm. The sorting algorithm takes the first number and swaps it with the adjacent number if they are in the wrong order. It does this for each number, until all numbers are in the right order - and thus sorted.
def bubbleSort(arr):
n = len(arr)
# Traverse through all array elements
for i in range(n):
# Last i elements are already in place
for j in range(0, n-i-1):
# traverse the array from 0 to n-i-1
# Swap if the element found is greater
# than the next element
if arr[j] > arr[j+1] :
arr[j], arr[j+1] = arr[j+1], arr[j]
# Driver code to test above
arr = [64, 34, 25, 12, 22, 11, 90]
bubbleSort(arr)
As a side note, my professor refers to any algorithm with a time of polynomial or above as:
A complete and utter disaster! This is a disaster! A catastrophe!
But the thing with large time complexities is that they often show us that something can be quickened.
For instance, a problem I had. Given a sentence, how many of those words appear in the English Dictionary? We can imagine the $O(n^2)$ method. One for loop
through the sentence, another through the dictionary.
dictionary = ["a", "an"] # imagine if this was the dictionary
sentence = "hello uu988j my nadjjrjejas is brandon nanndwifjasj banana".split(" ")
counter = 0
for word in sentence:
for item in dictionary:
if word == item:
counter = counter + 1
$O(n^2)$! A disaster! But, knowing that this is a disaster means we can speed it up.
Dictionaries are sorted by default. What if we sort our list of words in the sentence, and checked each word that way? We only need to loop through the dictionary once. And if the word we want to check is less than the word we're on, we switch to the second word in the list.
Now our algorithm is $O(n log n)$. We recognise that this isn't a disaster, so we can move on! Knowing time complexities isn't only useful in interviews. It's an essential tool to improve our algorithms.
We can hasten many polynomial algorithms we construct using knowledge of algorithmic design.
Exponential time is $2^n$, where 2 depends on the permutations involved.
This algorithm is the slowest of them all. You saw how my professor reacted to polynomial algorithms. He was jumping up and down in furiosity at exponential algorithms!
Say we have a password consisting only of numbers (10 numbers, 0 through to 9). we want to crack a password which has a length of n. To bruteforce through every combination we'll have:
$$10^n$$
Combinations to work through.
One example of exponential time is to find all the subsets of a set.
>>> subsets([''])
['']
>>> subsets(['x'])
['', 'x']
>>> subsets(['a', 'b'])
['', 'a', 'b', 'ab']
We can see that when we have an input size of 2, the output size is $2^2 = 4$.
Now, let's code up subsets
.
from itertools import chain, combinations
def subsets(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
Taken from the documentation for itertools. What's important here is to see that it exponentially grows depending on the input size. Java code can be found here.
Exponential algorithms are horrific, but like polynomial algorithms we can learn a thing or two. Let's say we have to calculate $10^4$. We need to do this:
$$10 * 10 * 10 * 10 = 10^2 * 10^2$$
We have to calculate $10^2$ twice! What if we store that value somewhere and use it later so we do not have to recalculate it? This is the principle of Dynamic Programming, which you can read about here.
When we see an exponential algorithm, dynamic programming can often be used to speed it up.
Again, knowing time complexities allows us to build better algorithms.
Here's our Big O notation graph where the numbers are reduced so we can see all the different lines.
You can find the code for this graph here.
Rarely will time complexity be as easy as counting how many for loops we have. What if our algorithm looks like $O(n + n^2)$? We can simplify our algorithm using these simple rules:
If we have an algorithm described as $O(2n)$, we drop the $2$ so it becomes $O(n)$.
$O(n² + n)$ becomes $O(n²)$. Only keep the larger one in Big O.
If we have a sum such as $O(b² + a)$ we can’t drop either without knowledge of what b and a are.
Yup! The hardest part is figuring out what our program's complexity is first. Simplifying is the easy part! Just remember the golden rule of Big O notation:
"What is the worst-case scenario here?"
This is the fastest time possible for a comparison sort. We cannot get any faster unless we use some special property, like Radix sort. O(n log n) is the fastest comparison sort time.
It's rather famous, because Mergesort runs in O(n log n). Mergesort is a great algorithm not only because it sorts fast, but because the idea is used to build other algorithms.
Mergesort is used to teach divide & conquer algorithms. And for good reason, it's a fantastic sorting algorithm that has roots outside of sorting.
Mergesort works by breaking down the list of numbers into individual numbers:
And then sorting each list, before merging them:
def mergeSort(alist):
print("Splitting ",alist)
if len(alist)>1:
mid = len(alist)//2
lefthalf = alist[:mid]
righthalf = alist[mid:]
mergeSort(lefthalf)
mergeSort(righthalf)
i=0
j=0
k=0
while i < len(lefthalf) and j < len(righthalf):
if lefthalf[i] <= righthalf[j]:
alist[k]=lefthalf[i]
i=i+1
else:
alist[k]=righthalf[j]
j=j+1
k=k+1
while i < len(lefthalf):
alist[k]=lefthalf[i]
i=i+1
k=k+1
while j < len(righthalf):
alist[k]=righthalf[j]
j=j+1
k=k+1
print("Merging ",alist)
alist = [54,26,93,17,77,31,44,55,20]
mergeSort(alist)
print(alist)
Notice the le10
at the top? This one is so large, it makes all other times look constant!
This time complexity is often used as a joke, referring to Bogo Sort. I have yet to find a real life (not-a-joke) algorithm that runs in O(n!) that isn't an algorithm calculating O(6!).
Our own algorithms will normally be based on some famous algorithm that already has a Big O notation. If it's not, do not worry! Working out the Big O of our algorithm is easy.
Just think:
"What is the absolute worst input for my program?"
Take, for instance, a sequential searching algorithm.
def search(listInput, toFind):
for counter, item in enumerate(listInput):
if toFind == item:
return (counter, item)
return "did not find the item!"
The best input would be:
search(["apples"], "apples")
But the worst input is if the item was at the end of a long list.
search(["apples", "oranges", "The soundtrack from the 2006 film Nacho Libre", "Shrek"], "Shrek")
The worst-case scenario is $O(n)$, because we have to go past every item in the list to find it.
What if our search algorithm was binary search? We learnt that binary search divides the list into half everytime. This sounds like log n
!
What if our binary search looks for an object, and then looks to find other similar objects?
# here we want to find the film shrek, find its IMDB rating and find other films with that IMDB rating. We are using binary search, then sequential search
toFind = {title: "Shrek", IMDBrating: None}
ret = search(toFind)
ret = search(ret['IMDBrating'])
We find Shrek with an IMDB score of 7.8. But we're only sorted on the title, not the IMDB rating. We have to use sequential search to find all other films with the same rating.
Binary search is $O(log n)$ and sequential search is $O(n)$, this makes our algorithm $O(n log n)$. This isn't a disaster, so we can sure it's not a terrible algorithm.
Even in the instances where our algorithms are not strictly related to other algorithms, we can still compare them to things we know. $O(log n)$ means halfing. $O(n^2)$ means a nested for loop.
One last thing, we don't always deal with n
. Take this below algorithm:
x = [1, 2, 3, 4, 5]
y = [2, 6]
y = iter(y)
counter = 0
total = 0.0
while counter != len(x):
# cycles through the y list.
# multiplies 2 by 1, then 6 by 2. Then 2 by 3.
total = total + x[counter] * next(y)
counter += 1
print(total)
We have 2 inputs, x and y. Our notation is then $O(x + y)$. Sometimes we cannot make our notation smaller without knowing more about the data.
I made this little infographic for you! The "add +1 for every nested for loop" depends on the for loop, as we saw earlier. But explaining that all over again would take up too much space 😅
Okay, this is where it gets hard. A lot of complaints against Big O notation is along the lines of:
"You didn't really teach it, to really understand it you have to understand the maths!"
And I kinda agree. The surface level knowledge above will be good for most interviews, but the stuff here is the stuff needed to master Big O notation.
Just as a reminder, we want to master asymptotic time complexity as it allows us to create better algorithms.
I'm going to be writing out the formal notation, and then explaining it simply. Over-simplification causes misinformation, so if you are studying for a test take my simplifications as generalities and not the truth.
The mathematics is the whole truth, and you would be better of studying the maths rather than studying my simplifications. As I once read on the internet:
Shut up and calculate.
First things first, when I said:
Big O notation is the worst-case
That's not true at all. It's a white lie designed to help you learn the basics. Often used to get us to know enough to just pass interviews, but not enough to use it in the real world.
The formal definition of Big O notation is:
The upper-bounded time limit of the algorithm
Now, this often means "the worst-case" but not always. We can put upper bounds on whatever we want. But more often than not, we put upper-bounds on the worst-case. In one of our examples, we'll come across a weird formula where "the worst-case" isn't necessarily the one we choose for Big O.
This is an important distinction to make, because some caveats will confuse us otherwise.
Given 2 positive functions, $f(n)$ and $g(n)$ we say $f(n)$ is $O(g(n))$, written $f(n) \in O(g(n))$, if there are constants $c$ and $n_0$ such that:
$$f(n) \le c * g(n) \forall \;n \geq n_o$$
Also, sometimes $n_0$ is called $k$. But $c$ stays the same.
This is saying that the function (algorithm) is a part of another function (the Big O notation used). Simplifying again: Our algorithm falls within the range of a Big O notation time complexity (O(n), O(log n), etc). So our algorithm is that time complexity (to simplify it).
Let's see an example.
$$7n - 4 \in O(n)$$
Here we are claiming that $7n - 4$ is in $O(n)$ time. In formal Big O notation, we don't say it is that time. We say it falls within the range of that time.
We need to find constants $c$ and $n_0$ such that $7n-4 \le cn$ for all $n \geq n_0$.
One choice is $c = 7$ and $n_0 = 1$. $7 * 7 = 42 - 4 = 38$ and $7 * 1 = 7$ so for all where $n \geq 7$ this function holds true.
This is just one of the many choices, because any real number $c \geq 7$ and any integer $n_0 \geq 1$ would be okay.
Another way to rephrase this is:
$$7n-4 \le 7n \; where \; n \geq 1$$
The left hand side, $7n-4$ is f(n). c = 10. g(n) = n. Therefore we can say $f(n) =O(n)$ because $g(n) = n$. We say $f(n) \in O(n)$.
All we have to do is substitute values into the formula until we find values for c and n that work. Let's do 10 examples now.
$$f(n) = 4n^2 + 16n + 2$$
Is f(n) O(n^{4})?
We need to take this function:
$$f(n) = 4n^2 + 16n + 2$$
and say "is this less than some constant times $n^4$?" We need to find out if there is such a constant.
$$n^2 + 16n + 2 \le n^4$$
Let's do a chart. If $n = 0$ we get:
$$0 + 0 + 2 = 2 \le 0$$
This isn't true, so N = 0 is not true.
When $n = 1$:
$$ 4 * 1 * 16 * 2 = 22 \le 1^4 = 22 \le 1$$
Is not true. Let's try it again with n = 3.
$$50 \le 16$$
Not true, so let's try another one. $n = 3$.
$$86 \le 3^3 = 86 \le 81$$
Not true. Looks like the next one should work as we are approaching the tipping point. $n = 4$.
$$ 130 \le 256$$
This is true. When $n = 4$ or a greater number then this function where it's less than N^{4} becomes True. When $C = 1, N \geq 4$ this holds true.
The answer to the question "is this function, $n^2 + 16n + 2$, Big O of n^{4} true?" Yes, when $c = 1$ and $n \geq 4$."
Note: I'm saying $c=1$ but I'm not writing $cn$ every time. Later on, using c will become important. But for these starter examples we'll just assume $c = 1$ until said otherwise.
$$3n^2 + n + 16$$
Is this $O(n^2)$?
We know that $n \le n^2$ for all $n \geq 1$. Also, $16 \le n^2$ for $n \geq 4$.
So:
$$3n^2 + n + 16 \le 3n^2 + n^2 + n^2 = 5n^2$$
for all $n \geq 4$.
The constant C is 5, and $n_0 = 3$.
$$13n^3 + 7n \;log \;n + 3$$
is $O(n^3)$
Because $log \;n \le n \geq n^2$ for all $n \geq 1$, and for similar reasons as above we may conclude that:
$13n^3 + 6nlog \;n + 3 \le 21 n^3$ for all 'large enough' n.
In this instance, $c = 21$.
$$45n^5 - 10n^2 + 6n - 12$$
is $O(n^2)$?
Any polynomial $a_k n^k + ... + a_2 n^2 + a_1 n + a_0$ with $a_k > 0$ is $O(n^k)$.
Along the same lines, we can arrgue that any polynomial $a_k n^k + ... + a_2 n^2 + a_1 n + a-0$ with $a_k > 0$ is also $O(n^j)$ for all $j \geq k$.
Therefore $45n^5 - 10n^2 + 6n - 12$ is $O(n^2)$ (and is also $O(n^8)$ or $O(n^9)$ or $O(n^k)$ for any $n \geq 5$).
$$\sqrt{n}$$
is $O(n)$?
This doesn't hold true. $\sqrt{n} = n^{1/2}$. Therefore $O(n^{1/2}) < O(n)$.
I hope you appreciate the easy example to break up the hard maths 😉
$$ 3 \;log \;n + \;log \;log \;n$$
is $O(log \;n)$?
First we have this equality that $log \;n \le n$ for every $n \geq 1$. We can put a double log here like so: $log\;log\;n \le log \;n$. Log log n is smaller than log n. We replaced "n" with "log n" on both sides of $log \;n \le n$. So:
$$3 \;log \;n + \;log\;log \;n \le 4 \;log \;n$$
So:
$$c = 4, n_0 = 1$$
$log \; n$
is $< O(\sqrt{n})$
Log n grows slower than any function where this holds:
$$log \; m \le m^\epsilon$$ for every $\epsilon > 0$ no matter how small it is, as long as it is positive.
Using this inequality if we plug in $\epsilon = \frac{1}{2}$ and we plug that into our equation $\sqrt{m} = m^{\frac{1}{2}}$.
Knowing that $log \; m \le m^\epsilon$ we know that $O(log \; n) < O(\sqrt{n})$
$$2n + 3$$
What is the Big O of this?
$$2n + 3 \le 10n, n \geq 1$$
$$f(n) = O(n)$$.
This is because n is more than or equal to 1, it will always be larger than g(n) which is $2n + 3$. Therefore, we have $O(n)$.
$$2n + 3 \le 10n$$
We don't have to write 10, it can be whatever we want so long as the equation holds true.
$$2n + 3 \le 2n + 3n$$
$$2n + 3 \le 5n, n \geq 1$$
Therefore $f(n) = O(n)$.
Or we can write:
$$2n + 3 \le 5n^2 , n \geq 1$$
$f(n) = 2n + 3$
$c = 5$
$g(n) = n^2$
Can this same function be both $O(n)$ and $O(n^2)$? Yes. It can be. This is where our definition of big o comes into play. It's the upperbounded limit. We can say it is $n^2, 2^n$ and any higher. But we cannot say it's lower.
When we write big o, we often want to use the closet function. Otherwise we could say that every algorithm has an upperbound of $O(2^n)$, which isn't true.
Note: what we want to do (choose the closet function) is just personal preference for most courses. All functions which work and are the upperbound, are true.
There's a fantastic video on this strange concept here (and I took this example from there).
Big O represents how long an algorithm takes but sometimes we care about how much memory (space complexity) an algorithm takes too. If you're ever stuck, come back to this page and check out the infographics!
In this short 10 minute article, you’ll learn what the functional paradigm is in Python. You’ll also learn about list comprehensions.
In an imperative paradigm, we do things by giving the computer a sequence of tasks and then it executes them. While executing them, it can change states. For example, let’s say we set A = 5
, then we change the value of A
. We have variables in the sense that the value inside the variable varies.
In a functional paradigm, we don’t tell the computer what to do but we tell it what stuff is. What the greatest common divisor of a number is, and so on.
Variables cannot vary. Once we set a variable, it stays that way forever. Because of this, functions have no side effects in the functional paradigm. A side effect is where the function changes something outside of it. Let’s look at an example:
a = 3
def some_func():
global a
a = 5
some_func()
print(a)
The output is 5. In the functional paradigm, changing variables is a big no-no and having functions affect things outside of their scope is also a big no-no. The only thing a function can do is calculate something and return it.
Now you might think "no variables, no side effects? Why is this good?". Good question, gnarly stranger reading this.
If a function is called twice with the same parameters, it’s guaranteed to return the same result. If you’ve learnt about mathematical functions, you’ll appreciate this benefit. We call this referential transparency. Because functions have no side effects, if we are building a program which computes things, we can speed up the program. If the program knows that func(2)
equates to 3
, we can store this in a table. This prevents the program from running the same function when we already know the answer.
Typically, in functional programming, we do not use loops. We use recursion. Recursion is a mathematical concept, it means “feeding into itself”. With a recursive function, the function calls itself as a sub-function. Here’s a nice example of a recursive function in Python:
def factorial_recursive(n):
# Base case: 1! = 1
if n == 1:
return 1
# Recursive case: n! = n * (n-1)!
else:
return n * factorial_recursive(n-1)
Some programming languages are also lazy. This means they don’t compute or do anything until the very last second. If we write some code to perform 2 + 2
, a functional program will only calculate that when we need to use the resultant. We’ll explore laziness in Python soon.
To understand map, let’s first look at what iterables are. An iterable is anything we can iterate over. These are lists or arrays, but Python has many different iterables. We can even create our own iterable objects which by implementing magic methods. A magic method is like an API that helps our objects become more Pythonic. We need to implement 2 magic methods to make an object an iterable:
class Counter:
def __init__(self, low, high):
# set class attributes inside the magic method __init__
# for “initialise”
self.current = low
self.high = high
def __iter__(self):
# first magic method to make this object iterable
return self
def __next__(self):
# second magic method
if self.current > self.high:
raise StopIteration
else:
self.current += 1
return self.current - 1
The first magic method, __iter__
or dunder iter (double underscore iter) returns the iterative object, we often use this at the start of a loop. Dunder next, __next__
, returns what the next object is.
Let’s check this out:
for c in Counter(3, 8):
print(c)
This will print:
3
4
5
6
7
8
In Python, an iterator is an object which only has an __iter__
magic method. This means that we can access positions in the object, but cannot iterate through the object. Some objects will have the magic method __next__
and not the __iter__
magic method, such as sets (talked about later in this article). For this article, we’ll assume everything we touch is an iterable object.
Now we know what an iterable object is, let’s go back to the map function.
The map function lets us apply a function to every item in an iterable. We want to apply a function to every item in a list, but know that it’s possible for most iterables. The syntax for map takes 2 inputs, the function to apply and the iterable object.
map(function, iterable)
Say we have a list of numbers like:
[1, 2, 3, 4, 5]
And we want to square every number, we can write code like this:
x = [1, 2, 3, 4, 5]
def square(num):
return num*num
print(list(map(square, x)))
Functional Python is lazy. If we didn’t include the list()
the function would store the definition of the iterable, not the list itself. We need to tell Python “turn this into a list” for us to use this.
It’s weird to go from non-lazy evaluation to lazy evaluation suddenly in Python. You’ll get used to it if you think more in the functional mindset than an imperative mindset.
Now it’s nice to write a normal function like square(num)
but it doesn’t look right. Do we have to define a whole function just to use it once in a map? Well, we can define a function in map using a lambda (anonymous) function.
Lambda functions are a one-line function, used for a short period of time. We often use them with higher order functions along with filter, map, and reduce. This lambda expression squares a number given to it:
square = lambda x: x * x
Now let’s run this:
>>> square(3)
9
I hear you. “Brandon, where are the arguments? what the heck is this? that looks nothing like a function?”
Well, it’s confusing but can be explained. We’re assigning something to the variable square
. this part:
lambda x:
Tells Python that this is a lambda function, and the input is called x. Anything after the colon is what we do with the input, and it returns whatever the result of that is.
To simplfy our square program into one line we can do:
x = [1, 2, 3, 4, 5]
print(list(map(lambda num: num * num, x)))
In a lambda expression, all the arguments go on the left and the stuff we want to do with them go on the right. It gets a little messy, no one can deny that. There’s a certain pleasure in writing code that only other functional programmers can read. Also, it’s super cool to take a function and turn it into a one-liner.
Reduce is a function that applies a given function on an iterable and returns one thing. Normally we’ll perform a computation on a list to reduce it down to one number. Reduce looks like this:
reduce(function, list)
We can (and often will) use lambda expressions as the function.
The product of a list is every single number multiplied together. To program this:
product = 1
x = [1, 2, 3, 4]
for num in x:
product = product * num
But with reduce we can write:
from functools import reduce
product = reduce((lambda x, y: x * y),[1, 2, 3, 4])
To get the same product. The code is shorter, and with knowledge of functional programming, it is neater.
The filter function takes an iterable and filters out all the things we don’t want in that iterable.
Filter takes a function and a list. It applies the function to each item in the list and if that function returns True, it does nothing. If it returns False, it removes that item from the list.
The syntax looks like:
filter(function, list)
Let’s see a small example, without filter we’ll write:
x = range(-5, 5)
new_list = []
for num in x:
if num < 0:
new_list.append(num)
With filter, this becomes:
x = range(-5, 5)
all_less_than_zero = list(filter(lambda num: num < 0, x))
Higher order functions can take functions as parameters and return functions. An example is:
def summation(nums):
return sum(nums)
def action(func, numbers):
return func(numbers)
print(action(summation, [1, 2, 3]))
# Output is 6
Or an simple example of the second definition, return functions
, is:
def rtnBrandon():
return “brandon”
def rtnJohn():
return “john”
def rtnPerson():
age = int(input(“What’s your age?”))
if age == 21:
return rtnBrandon()
else:
return rtnJohn()
Higher-order functions make non-varying variables easier to work with. We need not store a variable anywhere if all we’re doing is passing data through a long tunnel of functions.
All functions in Python are first-class objects. We define a first-class object as having one or more of these features:
So all functions in Python are first-class and can be used as a higher-order function.
Partial application (also called closures) is weird but is cool. We can call a function without supplying all the arguments it requires. Let’s see this in an example. We want to create a function which takes 2 arguments, a base and an exponent, and returns base to the power of the exponent, like so:
def power(base, exponent):
return base ** exponent
Now we want to have a dedicated square function, to work out the square of a number using the power function:
def square(base):
return power(base, 2)
This works, but what if we want a cube function? or a function to the power of 4? Can we keep on writing them forever? Well, we could. But programmers are lazy. If we repeatedly do the same thing, it’s a sign that there is a much quicker way to speed things up and that will allow us to not repeat things. We can use partial applications here. Let’s see an example of the square function using a partial application:
from functools import partial
square = partial(power, exponent=2)
print(square(2))
# output is 4
Isn’t that cool! We can call functions which require 2 arguments, using only 1 argument by telling Python what the second argument is.
We can also use a loop, to generate a power function that works from cubed up to powers of 1000.
from functools import partial
powers = []
for x in range(2, 1001):
powers.append(partial(power, exponent = x))
print(powers[0](3))
# output is 9
You might have noticed, but a lot of the things we want to do in functional programming revolve around lists. Other than the reduce function & partial application, all the functions we have seen generate lists. Guido (the inventor of Python) dislikes functional stuff in Python because Python already has its own way to generate lists.
If we write “import this” into a Python IDLE session, we’ll get:
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let’s do more of those!
This is the Zen of Python. It’s a poem about what something being Pythonic means. The part we want to relate to here is:
There should be one — and preferably only one — obvious way to do it.
In Python, map & filter can do the same things as a list comprehension (discussed next). This breaks one rule of the Zen of Python, so these parts of functional programming ‘pythonic’.
Another talking point is Lambda. In Python, a lambda function is a normal function. Lambda is syntactic sugar. Both are equivalent:
foo = lambda a: 2
def foo(a):
return 2
A regular function can do everything a lambda function can, but it doesn’t work the other way around. A lambda function cannot do everything that a regular function can do.
This was a short argument about why functional programming doesn’t fit into the whole Python ecosystem very well. You may have noticed I mentioned list comprehensions earlier, we’ll discuss them now.
Earlier, I mentioned that anything we could do with map or filter, we could do with a list comprehension. This is the part where we’ll learn about them.
A list comprehension is a way to generate lists in Python. The syntax is:
[function for item in iterable]
So let’s square every number in a list, as an example:
print([x * x for x in [1, 2, 3, 4]])
Okay, so we can see how we can apply a function to every item in a list. How do we go around applying a filter? Well, look at this code from earlier:
x = range(-5, 5)
all_less_than_zero = list(filter(lambda num: num < 0, x))
print(all_less_than_zero)
We can convert this into a list comprehension like so:
x = range(-5, 5)
all_less_than_zero = [num for num in x if num < 0]
List comprehensions support if statements like this. We no longer need to apply a million functions to something to get what you want. If we’re trying to make some kind of list chances are that it’ll look cleaner and easier using a list comprehension.
What if we want to square every number below 0 in a list? Well, with lambda, map and filter we’ll write:
x = range(-5, 5)
all_less_than_zero = list(map(lambda num: num * num, list(filter(lambda num: num < 0, x))))
This is long and complicated. With a list comprehension it is:
x = range(-5, 5)
all_less_than_zero = [num * num for num in x if num < 0]
A list comprehension is only good for, well, lists. Map and filter work on any iterable, so what’s up with that? We can use any comprehension for any iterable object we encounter.
We can generate any iterable using a comprehension. Since Python 2.7, we can even generate a dictionary (hashmap).
# Taken from page 70 chapter 3 of Fluent Python by Luciano Ramalho
DIAL_CODES = [
(86, ‘China’),
(91, ‘India’),
(1, ‘United States’),
(62, ‘Indonesia’),
(55, ‘Brazil’),
(92, ‘Pakistan’),
(880, ‘Bangladesh’),
(234, ‘Nigeria’),
(7, ‘Russia’),
(81, ‘Japan’),
]
>>> country_code = {country: code for code, country in DIAL_CODES}
>>> country_code
{’Brazil’: 55, ‘Indonesia’: 62, ‘Pakistan’: 92, ‘Russia’: 7, ‘China’: 86, ‘United States’: 1, ‘Japan’: 81, ‘India’: 91, ‘Nigeria’: 234, ‘Bangladesh’: 880}
>>> {code: country.upper() for country, code in country_code.items() if code < 66}
{1: ‘UNITED STATES’, 7: ‘RUSSIA’, 62: ‘INDONESIA’, 55: ‘BRAZIL’}
If it’s an iterable, we can generate it. Let’s look at one last example of sets. If you don’t know what a set is, check out this other article I wrote. The TL;DR is:
# taken from page 87, chapter 3 of Fluent Python by Luciano Ramalho
>>> from unicodedata import name
>>> {chr(i) for i in range(32, 256) if ‘SIGN’ in name(chr(i), ‘’)}
{’×’, ‘¥’, ‘°’, ‘£’, ‘©’, ‘#’, ‘¬’, ‘%’, ‘µ’, ‘>‘, ‘¤’, ‘±’, ‘¶’, ‘§’, ‘<’, ‘=’, ‘®’, ‘$’, ‘÷’, ‘¢’, ‘+’}
You may notice that sets have the same curly braces as dictionaries. Python is smart. It’ll know whether we're writing a dictionary comprehension or a set comprehension based on whether we provide the extra value for the dictionary or not. If you want to learn more about comprehensions, check out this visual guide.
Functional programming is beautiful and pure. Functional code can be clean, but it can also be messy. You should use what you want to use.
No talk about downloading things on BitTorrent. Or the best clients to do so.
Just a deep-dive into the technical side of it.
Anyone can read this article. Requires ZERO knowledge on networking or BitTorrent to read this.
BitTorrent is one of the most common protocols for transferring large files. In February 2013, BitTorrent was responsible for 3.35% of all worldwide bandwidth, more than half of the 6% of total bandwidth dedicated to file sharing.
Let's dive right in.
Bram Cohen invented the BitTorrent protocol in 2001. Cohen wrote the first client implementation in Python.
Cohen collected free pornography to lure beta testers to use BitTorrent in the summer of 2002.
In traditional downloading, the server uploads the file, and the client downloads the file.
For popular files, this isn't very effective.
500 people downloading the same file will put the server under strain. This strain will cap the upload speed, so clients can not download the file fast.
Second, client-server costs a lot of money. The amount we pay increases with how popular a file is.
Third, it’s centralised. Say the system dies, the file no longer exists - no one can download it.
BitTorrent aims to solve these problems.
In a peer-to-peer network, every peer is connected to every other peer in the network.
Semi-centralised peer-to-peer networks possess one or more peers with higher authority than most peers.
BitTorrent is a way to share files. It’s often used for large files. BitTorrent is an alternative to a single source sharing a file, such as a server. BitTorrent can productively work on lower bandwidth.
The first release of the BitTorrent client had no search engine and no peer exchange, users who wanted to upload a file had to create a small torrent descriptor file that they would upload to a torrent index site.
When a user wants to share a file, they seed their file. This user is called a seeder. They upload a torrent descriptor file to an exchange (we’ll talk about this later). Anyone who wants to download that file will download this torrent descriptor.
We call those who download peers. Their torrent client will connect to a tracker (discussed later) and the tracker will send them a list of IP addresses of other seeds and peers in the swarm. The swarm is all PC’s related to a certain torrent.
The torrent descriptor file contains a list of trackers and metadata on the file we’re downloading.
A peer will connect to a seed and download parts of the file.
Once the peer completes a download, they could function as a seed. Although, it is possible to function as a seed while also downloading (and is very common).
Once the seed has shared the file to a peer, that peer will act as a seed. Instead of the client-server model where only 1 server exists to upload the file, in BitTorrent, multiple people can upload the same file.
BitTorrent splits the file up into chunks called pieces, each of a certain size. Sometimes it’s 256KB, sometimes it’s 1MB. As each peer receives a piece, they become a seed of that piece for other peers.
With BitTorrent, we do not have a single source to download from. We could download a few pieces from your home country, then download a few that your home country doesn’t own from a far-away country.
The protocol hashes the pieces to make sure no seed has tampered with the original file. Then stores the hash in the torrent descriptor on the tracker.
This is how BitTorrent works at a very high level. We will now go into detail. We aim to answer these questions:
And much more.
It's a dictionary (or hashmap) file.
The file is described as:
The URL of the tracker. Remember earlier when we contacted the tracker server to find other peers using the same file? We found that tracker by using the announce key in the torrent descriptor file.
This maps to a dictionary whose keys depend on whether one or more files are being shared. The keys are:
Files (child of info, is a list)
Files only exists when multiple files are being shared. Files is a list of dictionaries. Each dictionary corresponding to a file.
Each of these dictionaries has 2 keys.
Length - the size of the file in bytes.
Path - A list of strings corresponding to subdirectory names, the last of which is the actual file name.
The size of the file in bytes (only when one file is being shared)
Suggested filename. Or the suggested directory name.
The number of bytes per piece.
The piece's length must be a power of two and at least 16KiB.
This is $$2^8 \; KiB = 256 \; KiB = 262,144 \; B$$
A hash list.
A list of hashes calculated on various chunks of data. We split the data into pieces. Calculate the hashes for those pieces, store them in a list.
BitTorrent uses SHA-1, which returns a 160-bit hash. Pieces will be a string whose length is a multiple of 20 bytes.
If the torrent contains multiple files, the pieces are formed by concatenating the files in the order they appear in the files directory.
All pieces in the torrent are the full piece length except for the last piece which may be shorter.
Now, I can guess what you’re thinking.
"SHA-1? What is this? The early 2000s?"
And I agree. BitTorrent is moving from SHA-1 to SHA256.
Still confused? Not to worry! I designed this JSON file that describes what a torrent file looks like. Note: I’ve concatenated some things. This makes it easier to read and understand the general layout. I made the numbers up, following the rules of BitTorrent’s torrent descriptor.
{
"Announce": "url of tracker",
"Info": {
"Files": [
{
"Length": 16,
"path": "/folder/to/path"
},
{
"length": 193,
"path": "/another/folder"
}
]
},
"length": 192,
"name":" Ubuntu.iso",
"Pieces length": 262144,
"Pieces": [AAF4C61DDCC5E8A2DABEDE0F3B482CD9AEA9434D, CFEA2496442C091FDDD1BA215D62A69EC34E94D0]
}
One of the largest questions in BitTorrent is “what pieces should I select to download?”
With a traditional client-server model, we download the whole file. But now, we get to pick what pieces to download.
The idea is to download the pieces that no one else has - the rare pieces. By downloading the rare pieces, we make them less rare by uploading them.
BitTorrent uses TCP, a transmission protocol for packets. TCP has a mechanism called slow start.
Slow start is a mechanism which balances the speed of a TCP network connection. It escalates the amount of data transmitted until it finds the network’s maximum carrying capacity. cwdn
stands for the Congestion Window.
TCP does this because if we send 16 connections at once, the server may not be used to the traffic and congestion will happen on the network.
If we’re not regularly sending data, TCP may cap our network connection at a slower speed than normal.
BitTorrent makes sure to always send data by breaking pieces down into further sub-pieces.
Each sub-piece is about 16KB in size. The size for a piece is not fixed, but it is somewhere around 1MB.
The protocol always has some number of requests (five) for a sub-piece pipe-lined. When a new sub-piece is download, the client sends a new request. This helps speed things up.
Sub-pieces can be downloaded from other peers.
Two core policies govern the Piece Selection Algorithm.
Once the BitTorrent client requests a sub-piece of a piece, any remaining sub-pieces of that piece are requested before any sub-pieces from other pieces.
In this image, it makes sense to download all the sub-pieces of this piece first rather than start downloading another piece.
The main policy in BitTorrent is to pick the rarest first. We want to download the piece which the fewest other peers own.
This is so we can make it ‘un-rare’. If only one peer has a piece and they go offline, no one will get the complete file.
A plethora of benefits exist for this policy.
Growing the seed
Rarest first makes sure that we download only new pieces from the seed.
The seed will begin as a bottleneck. The one peer with the file.
A downloader can see what pieces their peers possess, and the rarest first policy will cause us to fetch the pieces from the seed which have not been uploaded by other peers.
Let's visualise this.
The list of nodes (peers) is inter-connected. I cannot draw this as the diagram is unfavourable.
Each arrow towards a sub-piece what that peer has downloaded. We downloaded a sub-piece that no one else has other than the seed. This means this sub-piece is rare.
Our upload rate is higher than that of the seed, so all peers will want to download from us. Also, they would want to download the rarest pieces first, and as we are one of 2 holders of the rarest piece.
When everyone downloads from us, we can download faster from them. This is the tit-for-tat algorithm (discussed later).
Increased download speed
The more peers that hold the piece, the faster the download can happen. This is because we can download sub-pieces from other peers.
Enable uploading
A rare piece is most wanted by other peers and getting a rare piece means peers will be interested in uploading from us. As we will see later, the more we upload, the more we can download.
Most common last
It is sensible to leave the most common pieces to the end of the download. As many peers hold common pieces, the probability of being able to download them is much larger than that of rare pieces.
Prevent rarest piece missing
When the seed dies, all the different pieces of the file should be distributed somewhere among the remaining peers.
Once we download, we have nothing to upload. We need the first piece, fast. The rarest first policy is slow. Rare pieces are downloaded slower because we can download its sub-pieces from only a few peers.
Sometime’s a peer with a slow transfer rate will try to give us a sub-piece. Causing a delay of the download. To prevent this, there is “endgame mode”.
Remember the pipe-lining principle? There are always several requests for sub-pieces pending.
When all the sub-pieces a peer lacks are requested, they broadcast this request to all peers. This helps us get the last chunk of the file.
If a peer has the missing sub-piece, they will send that back to our computer.
Once a sub-piece arrives, we send a cancel-message telling the other peers to ignore our request.
No centralised resource allocation in BitTorrent exists. Instead, every peer maximises their download rate.
A peer will download from whoever they can. To decide who to upload to, they will use a variant of the "tit-for-tat"algorithm.
The tit-for-tat strategy comes from game theory. The essence is:
"Do onto others as they do onto you"
Choking is a temporary refusal to upload to another peer, but we can still download from them.
To cooperate peers upload, and to not cooperate they “choke” the connection to their peers. The principle is to upload to peers who have uploaded to us.
We want several bidirectional connections at the same time and to achieve Pareto Efficiency.
We consider an allocation Pareto Efficient if there is no other allocation in which some individual is better off and no individual is worse off.
Thus the big question, is how to determine which peers to choke and which to unchoke?
A peer always unchokes a fixed number of its peers (the default is 4).
Current download rates decide which peers to unchoke. We use a 20-second average to decide this. Because of the use of TCP (slow-start) rapidly choking and unchoking is bad. Thus, this is calculated every 10 seconds.
If our upload rate is high more peers will allow us to download from them. This means that we can get a higher download rate if we are a good uploader. This is the most important feature of the BitTorrent protocol.
The protocol prohibits many “free riders” which are peers who only download and don’t upload.
For a peer-to-peer network to be efficient, all peers need to contribute to the network.
BitTorrent also allows an additional unchoked peer, where the download rate criteria aren’t used.
We call this optimistic unchoking. Checking unused connections aren’t better than the ones in use.
We shift the optimistic unchoke every 30 seconds. Enough time for the upload reaches full speed. Same for the upload. If this new connection turns out to be better than one of the existing unchoked connections, it will replace it.
The optimistic unchoke is randomly selected.
This also allows peers who do not upload and only download to download the file, even if they refuse to cooperate. Albeit, they will download at a much slower speed.
What happens if all peers uploading to another peer decide to choke it? We then have to find new peers, but the optimistic unchoking mechanism only checks one unused connection every 30 seconds. To help the download rate recover more, BitTorrent has snubbing.
If a client hasn’t received anything from a particular peer for 60 seconds, it will presume that it has been ‘snubbed’.
Following the mentality of tit-for-tat, we retaliate and refuse to upload to that peer (except if they become an optimistic unchoke).
The peer will then increase the number of optimistic unchokes to find new connections quicker.
We see that using the choking algorithm implemented in BitTorrent we favour peers who are kind to us. If I can download fast from them, we allow them to upload fast from me.
What about no downloads? Then it’s impossible to know which peers to unchoke using this choking algorithm. When a download is completed, we use a new choking algorithm.
This new choking algorithm unchokes peers with the highest upload rate. This ensures that pieces get uploaded faster, and they get replicated faster.
Peers with good upload rates are also not being served by others.
Trackers are special types of server that help in communication between peers.
Communication in BitTorrent is important. How do we learn what other peers exist?
The tracker knows who owns the file, and how much.
Once a peer-to-peer download has started, communication can continue without a tracker.
Since the creation of the distributed hash table method for trackerless torrents, BitTorrent trackers are largely redundant.
These are trackers that anyone can use.
The Pirate Bay operated one of the most popular public trackers until disabling it in 2009, opting only for magnet links (discussed soon).
Private trackers are private. They restrict use by requiring users to register with the site. The method for controlling registration is often an invitation system. To use this tracker we need an invitation.
Multi-tracker torrents contain multiple trackers in a single torrent file. This provides redundancy if one tracker fails, the other trackers can continue to maintain the swarm for the torrent.
With this configuration, it is possible to have multiple unconnected swarms for a single torrent - which is bad. Some users can connect to one specific tracker while being unable to connect to another. This can create a disjoint set which can impede the efficiency of a torrent to transfer the files it describes.
Earlier, I talked about how the Pirate Bay got rid of trackers and started using trackerless torrents.
When we download a torrent, we get a hash of that torrent. To download the torrent without a tracker, we need to find other peers also downloading the torrent. To do this, we need to use a distributed hash table.
Let's explore Distributed Hash Tables.
Distributed Hash Tables (DHT) give us a dictionary-like interface, but the nodes are distributed across a network. The trick with DHTs is that the node that gets to store a particular key is found by hashing that key.
In effect, each peer becomes a mini-tracker.
Each node (client/server implementing the DHT protocol) has a unique identifier known as the “node ID”. We choose node IDs at random from the same 160-bit space as BitTorrent infohashes.
Infohashes are a SHA-1 hash of:
We use a distance metric to compare two node IDs or a node ID and an infohash for “closeness”.
Nodes must have a routing table containing the contact information for a few other nodes.
Nodes know about each other in the DHT. They know many nodes with IDs that are close to their own but few with far-away IDs.
The distance metric is XOR and is interpreted as an integer.
$$distance(A, B) = |A \oplus B |$$
Smaller values are closer.
When a node wants to find peers for a torrent, they use the distance metric to compare the infohash of the torrent with the IDs of the nodes in its routing table or the ID of one node with the ID of another node.
Then they contact the nodes in the routing table closet to the infohash and asks them for contact information of peers downloading the torrent.
If a contacted node knows about peers for the torrent, they return the peer contact information with the response. Otherwise, the contacted node must respond with the contact information of the nodes in its routing table closet to the infohash of the torrent.
The original node queries nodes that are closer to the target infohash until it cannot find any closer nodes. After the node exhausts the search, the client then inserts the peer contact information for itself onto the responding nodes with IDs closest to the infohash of the torrent. In the future, other nodes can easily find us.
The return value for a query for peers includes an opaque value known as the “token.” For a node to announce that its controlling peer is downloading a torrent, it must present the token received from the same queried node in a recent query for peers.
When a node attempts to “announce” a torrent, the queried node checks the token against the querying node’s IP address. This is to prevent malicious hosts from signing up other hosts for torrents.
The querying node returns the token to the same node that they receive the token from. We must accept tokens for a reasonable amount of time after they have been distributed. The BitTorrent implementation uses the SHA-1 hash of the IP address concatenated onto a secret that changes every five minutes and tokens up to ten minutes old are accepted.
Every node maintains a routing table of known good nodes. We use the routing table starting points for queries in the DHT. We return nodes from the routing table in response to queries from other nodes.
Not all nodes we learn about are equal. Some are “good” and some are not. Many nodes using the DHT can send queries and receive responses, but cannot respond to queries from other nodes. Each node’s routing table must contain only known good nodes.
A good node is a node has responded to one of our queries within the last 15 minutes. A node is also good if it has ever responded to our queries and has sent us a query within the last 15 minutes. After 15 minutes of inactivity, a node becomes questionable. Nodes become bad when they fail to respond to multiple queries in a row. Nodes that we see are good are given priority over nodes with an unknown status.
The routing table covers the entire node ID space from 0 to 2^{160}. We subdivide the routing table into “buckets” that each cover a portion of the space.
An empty table has one bucket with an ID space range of min=0, max=2^{160}.
An empty table has only one bucket so any node must fit within it. Each bucket can only hold K nodes, currently eight, before becoming “full.”
When a bucket is full of known good nodes, we may add no more nodes unless our node ID falls within the range of the bucket. The bucket is replaced by two buckets each with half of the old bucket. Nodes from the old bucket are distributed among the new buckets.
For a new table with only one bucket, we always split the full bucket into two new buckets covering the ranges 0..2^{159} and 2^{159}..2^{160}.
When the bucket is full of good nodes, we simply discard the new node. When nodes in the bucket become bad (if they do) we replace them with a new node.
When nodes are considered questionable and haven’t been since, in the last 15 minutes, the least recently seen node is pinged. The node either responds or doesn’t respond. A response means we move to the next node. We do this until we find a node that fails to respond. If we don’t find any, then the bucket is considered good.
When we do find one, we try one more time before discarding the node and replacing them with a new good node.
Each bucket should maintain a “last changed” property to show how “fresh” the contents are.
When a node in a bucket is pinged and responds, or a node is added to a bucket, or a node is replaced with another node, the bucket’s last changed property is updated.
Buckets are refreshed if the last changed property has not been updated in the last 15 minutes.
Few attacks on the BitTorrent network exist. Everything is public. Our IP address, what we’re downloading - everything.
Why attack an open network?
Why attack a completely open network?
Only 7 entries are listed on Exploit-DB - a database of known exploits against a service. And most of them relate to specific clients.
The principal attack on the BitTorrent network is to stop piracy. We’ve gone this far without talking about piracy, but it is often synonymous with BitTorrent.
The main attack on BitTorrent is Torrent Poisoning.
This attack aims to get the IP addresses of peers pirating content or to poison the content in some way.
Madonna’s American Life album release is an example of content poisoning. Before the release, tracks were released of similar length and file size. The tracks featured a clip of Madonna saying:
"What the fuck do you think you're doing?"
Followed by a few minutes of silence.
Here are some methods of poisoning a torrent.
The index allows users to locate the IP addresses of peers with the desired content. This method of attack makes searching for peers difficult.
The attacker inserts a large amount of invalid information into the index to prevent users from finding the correct information.
The idea is to slow down the download, by having the peer try to download pieces from an invalid peer.
They insert corrupted versions of a file into the network.
Imagine 500 copies of a file and only 2 of them being the real file, this deters pirates from finding the real file.
Most websites with lists of torrents a voting system. This deters this attack, as the top of searches is filled with non-corrupted files
However, most websites with lists of torrents a voting
This deters this attack, as the top of searches is filled with non-corrupted files.
In GameDevTycoon, the file was released before the initial upload to piracy sites. Unbeknownst to pirates, the file was corrupted. Winning the game is impossible in the pirated version. Everything else was perfect.
Most popular torrents are released by individuals or groups who built up a rapport over many years. On private trackers, individuals can be pointed to. Poisoned torrents are quickly labelled and the poster can be banned.
Or, on public trackers, downloading torrents made by trusted groups is preferable. After all, would you prefer to download Ubuntu from the Ubuntu team, or the user xxx-HACKER-ELITE-GHOST-PROTOCOL-xxx?
On public trackers, if a torrent is poisoned the torrent is reported and removed.
The simplest way to defend against a BitTorrent attack is to use an IP address not associated with you. Whether this is through a VPN or some other service.
Here are the things we've learnt:
Here are some things you may choose to do:
Hi! Bit of an unusual post. I'm writing about my first open source tool, Ciphey!
Ciphey answers the question:
"What does this encrypted text say?"
First, an important distinction.
I define encryption as:
Anything that you cannot read fluently
When normally encryption is defined as:
Text that has an algorithm applied so no one apart from the intended recipients can read it
My tool, Ciphey, isn't made for cryptographers. It's made for your aunt Linda who has never heard of the word cryptography but works in tech, so knows a little about some things.
Because of my target audience and my personal definition of encryption, Ciphey can automatically decrypt all of these:
Now, I know that you're probably cringing. Sha1 can't be decrypted! And decrypting binary? That's not an encryption, it's encoding!
Refer back to my definition and target market.
Okay, so Ciphey is cool. You input encrypted text, and Ciphey decrypts it. How?
First, Ciphey is made up of 2 core components.
Language Checker aims to answer the question:
"Is this text English?"
(and in the future, other languages)
It does this by utilising two popular algorithms.
The first is Chi Squared.
Chi Squared answers:
"How close is the frequency distribution of this text to the frequency distribution of English?"
Check out this cool video on the general idea by VSauce:
Chi-squared is very fast, but the accuracy isn't that good.
Onto the next algorithm.
What better way to check when something is English than to loop through an entire English dictionary and see how many dictionary words appear in the text?
The problem is that this is very slow. My dictionary is 500k words. You can't just loop through that every time you want to check when something is English.
This is where I use both of these algorithms, combined.
Chi squared tells me when something looks like English, and dictionary checker tells me when something consists of primarily English words.
Both together answer the question "is this English?" really well.
It's also a lot faster than normal.
Chi squared keeps a running average of all the scores it comes across. If it sees a score that's below 1 standard deviation, it goes into stage 2 of the algorithm - dictionary checker.
If 35% of the words in the string are English, it's likely to be English.
35% because of slang, passwords, usernames or program names can all exist in the text.
Yes, but I like to call it Brute Force Enhanced.
Ciphey uses a deep neural network (DNN) trained on Harry Potter to guess how likely a text is to be encrypted using a method.
As an example, the DNN might predict that the text is 81% likely to be SHA1, 1% likely to be Caesar and so on.
Ciphey then runs all of the decryption modules using multi-threading in the order of most likely to least likely.
If at any moment a decryption returns True, as it has found the plain-text, Ciphey stops and returns the answer.
This method of brute-force enhanced as well as language checker means Ciphey is very fast.
Decryption modules don't just return True. I have an internal data packet that's passed around.
{
"lc": self.lc,
"IsPlaintext?": True,
"Plaintext": translated,
"Cipher": "Caesar",
"Extra Information": "The rotation used is {counter}"
}
Self.lc is Language Checker. When a decryption module is done, it passes Language Checker back to the parent. The parent then adds the LC it received to the LC it holds as an attribute. This means that we can keep the running average accurate and the overall program accurate.
Sure, it's cool to encrypt text using 1 level of encryption. But what if someone uses 2 levels? Ciphey needs to solve this.
The way this would work is using files.
If the user enters the level command:
ciphey -l 3 "ha skjari isa"
Then Ciphey will run through 3 levels of decryption.
Every decryption Ciphey makes will be stored in a file. Ciphey will then have something like this:
for x in range(0, levels):
for word in file.open(decryption_list.txt, 'r'):
one_level_of_decryption(word)
Ciphey supports a lot, but not enough for it to be considered super cool. I need to work on this.
]]>Today, we're going to do a technical deep-dive into how Tor really works.
No mention of how to access Tor, no mention of what might be on Tor. This is how Tor works.
Without speculation and without exaggeration of what Tor is. Just a deep-dive into the technical stuff of how Tor works.
This article is designed to be read by anyone, with ZERO knowledge on networking or Tor.
Let's dive right in.
The United States Naval Research Laboratory developed The Onion Routing Protocol (T0r) to project U.S. intelligence communications online. Ironically, Tor has seen widespread use by everyone - even those organisations which the U.S. Navy fights against.
You may know Tor as the hometown of online illegal activities, a place where you can buy any drug you want, a place for all things illegal. Tor is much larger than what the media makes it out to be. According to Kings College much of Tor is legal.
When you normally visit a website, your computer makes a direct TCP connection with the website's server. Anyone monitoring your internet could read the TCP packet. They can find out what website you're visiting and your IP address. As well as what port you're connecting to.
If you're using HTTPS, no one will know what the message said. But, sometimes all an adversary needs to know is who you're connecting to.
Using Tor, your computer never communicates with the server directly. Tor creates a twisted path through 3 Tor nodes, and sends the data via that circuit.
The core principle of Tor is onion routing which is a technique for anonymous & secure communication over a public network. In onion routing messages are encapsulated in several layers of encryption.
"Onions have layers" - Shrek
So does a message going through Tor. Each layer in Tor is encryption, you are adding layers of encryption to a Tor message, as opposed to just adding 1 layer of encryption.
This is why it's called The Onion Routing Protocol, because it adds layers at each stage.
The resulting onion (fully encapsulated message) is then transmitted through a series of computers in a network (called onion routers) with each computer peeling away a layer of the ‘onion’. This series of computers is called a path. Each layer contains the next destination - the next router the packet has to go to. When the final layer is decrypted you get the plaintext (non-encrypted message).
The original author remains anonymous because each node in the network is only aware of the preceding and following nodes in the path (except the first node that does know who the sender is, but doesn’t know the final destination).
This has led to attacks where large organisations with expansive resources run servers to attempt to be the first and last nodes in the network. If the organisation's server is the first node, it knows who sent the message. If the organisation server is the last node, it knows the final destination and what the message says.
Now we have a basic overview of Tor, let's start exploring how each part of Tor works. Don't worry if you're confused, every part of Tor will be explained using gnarly diagrams 😁✨
Onion Routing is a distributed overlay network designed to anonymise TCP-based applications like web browsing, secure shell and instant messaging.
Clients choose a path through the network and build a circuit where each onion router in the path knows the predecessor and the successor, but no other nodes in the circuit. Paths and circuits are synonyms.
The original author (the question mark on the far left) remains anonymous, unless you're the first path in the node as you know who sent you the packet.
No one knows what data is being sent until it reaches the last node in the path; who knows the data but doesn't know who sent it. The second to last node in the path doesn't know what the data is, only the last node in the path does.
This has led to attacks whereby large organisations with expansive resources create Tor servers which aim to be the first and last onion routers in a path. If the organisation can do this, they get to know who sent the data and what data was sent, effectively breaking Tor.
It's incredibly hard to do this without being physically close to the location of the organisations servers, we'll explore this more later.
Throughout this article I'll be using Netflix as a normal service (Bob) and Amazon Prime Video as the adversary (Eve). In the real world, this is incredibly unlikely to be the case. I'm not here to speculate on what organisations might want to attack Tor, so I've used 2 unlikely examples to avoid the political side of it.
Each packet flows down the network in fixed-size cells. These cells have to be the same size so none of the data going through the Tor network looks suspiciously big.
These cells are unwrapped by a symmetric key at each router and then the cell is relayed further down the path. Let's go into Tor itself.
There is strength in numbers
Tor needs a lot of users to create anonymity, if Tor was hard to use new users wouldn't adopt it so quickly. Because new users won't adopt it, Tor becomes less anonymous. By this reasoning it is easy to see that usability isn't just a design choice of Tor but a security requirement to make Tor more secure.
If Tor isn't usable or designed nicely, it won't be used by many people. If it's not used by many people, it's less anonymous.
Tor has had to make some design choices that may not improve security but improve usability with the hopes that an improvement in usability is an improvement in security.
Tor is not a completely decentralised peer-to-peer system like many people believe it to be. If it was completely peer to peer it wouldn’t be very usable. Tor requires a set of directory servers that manage and keep the state of the network at any given time.
Tor is not secure against end to end attacks. An end to end attack is where an entity has control of both the first and last node in a path, as talked about earlier. This is a problem that cyber security experts have yet to solve, so Tor does not have a solution to this problem.
Tor does not hide the identity of the sender.
In 2013 during the Final Exams period at Harvard a student tried to delay the exam by sending in a fake bomb threat. The student used Tor and Guerrilla Mail (a service which allows people to make disposable email addresses) to send the bomb threat to school officials.
The student was caught, even though he took precautions to make sure he wasn’t caught.
Gurillar mail sends an originating IP address header along with the email that’s sent so the receiver knows where the original email came from. With Tor, the student expected the IP address to be scrambled but the authorities knew it came from a Tor exit node (Tor keeps a list of all nodes in the directory service) so the authorities simply looked for people who were accessing Tor (within the university) at the time the email was sent.
Tor isn't an anonymising service, but it is a service that can encrypt all traffic from A to B (so long as an end-end attack isn't performed). Tor is also incredibly slow, so using it for Netflix isn't a good use case.
When you use a VPN, the VPN forwards all your internet traffic to the appropriate destination. When it does so, the VPN encrypts your traffic. All your internet service provider can see is encrypted traffic heading from your computer to the VPN.
They can’t see inside your packets. They don’t know who you’re talking to - other than the VPN.
VPN’s aren’t private in the same way that Tor is. VPNs protect you against ISPs or local adversaries (ones monitoring your laptop’s WiFi). But, they don’t protect you from themselves.
The VPN is the man in the middle. It knows who you are and who you’re talking to. Depending on the traffic, the VPN also decrypts your packet. Meaning they know everything. With a VPN, you have to trust it. With Tor, you don’t have to put a lot of trust in.
In Tor, one rogue node is survivable. If one of the nodes in our graph earlier was an adversary, they’ll only know our IP address or our data packet. Tor protects you from Tor. VPN’s expect that you trust them.
Tor protects you from the Tor network. One rogue node is survivable. They don’t expect you to trust the network.
No one, apart from you, should know the IP addresses of the origin and destination - and know the contents of the message.
Now that we have a good handle on what Tor is, let's explore onion routing.
Given the network above, we are going to simulate what Tor does. Your computer is the one on the far left, and you're sending a request to watch Stranger Things on Netflix (because what else is Tor used for 😉). This path of nodes is called a circuit. Later on, we're going to look into how circuits are made and how the encryption works. But for now we're trying to generalise how Tor works.
We start off with the message (we haven't sent it yet). We need to encrypt the message N times (where N is how many nodes are in the path). We encrypt it using AES, a symmetric key crypto-system. The key is agreed using Diffie-Hellman. Don't worry, we'll discuss all of this later. There is 4 nodes in the path (minus your computer and Netflix) so we encrypt the message 4 times.
Our packet (onion) has 4 layers. Blue, purple, orange, and teal. Each colour represents one layer of encryption.
We send the onion to the first node in our path. That node then removes the first layer of encryption.
Each node in the path knows what the key to decrypt their layer is (via Diffie-Hellman). Node 1 removes the blue layer with their symmetric key (that you both agreed on).
Node 1 knows you sent the message, but the message is still encrypted by 3 layers of encryption, it has no idea what the message is.
As it travels down the path, more and more layers are stripped away. The next node does not know who sent the packet. All it knows is that Node 1 sent them the packet, and it's to be delivered to Node 3.
Now Node 3 strips away a layer.
The final node knows what the message is and where it’s going, but it doesn’t know who sent it. All it knows is that Node 3 sent them the message, but it doesn't know about anyone else in the path. One of the key properties here is that once a node decrypts a layer, it cannot tell how many more layers there are to decrypt. It could be as small as 1 or 2 or as large as 200 layers of encryption.
Now there's no way Amazon can find out you watch Netflix! Netflix sends back a part of Stranger Things.
Let's see how it works in reverse.
Node 4 adds its layer of encryption now. It doesn't know who originally made the request, all it knows is that Node 3 sent the request to them so it sends the response message back to Node 3.
And so on for the next few nodes.
Now the response packet is fully encrypted.
Now the packet is fully encrypted, the only one who still knows what the message contains is Node 4. The only one who knows who made the message is Node 1. Now that we have the fully encrypted response back, we can use all the symmetric keys to decrypt it.
You might be thinking "I've seen snails 🐌 faster than this" and you would be right. This protocol isn't designed for speed, but at the same time it has to care about speed.
The algorithm could be much slower, but much more secure (using entirely public key cryptography instead of symmetric key cryptography) but the usability of the system matters. So yes, it's slow. No it's not as slow as it could be. But it's all a balancing act here.
The encryption used is normally AES with the key being shared via Diffie-Hellman.
The paths Tor creates are called circuits. Let's explore how Tor chooses what nodes to use in a circuit.
Each machine, when it wants to create a circuit, chooses the exit node first, followed by the other nodes in the circuit. Tor circuits are always 3 nodes. Increasing the length of the circuit does not create better anonymity. If an attacker owns the first and last nodes in the network, you can have 1500 nodes in the circuit and it still wouldn't make you more secure.
When Tor selects the exit node, it selects it following these principles:
torrc
(the configuration file of Tor) have settings about which exit nodes not to choose? All paths in the circuit obey these rules:
If you choose the same node twice, it's guaranteed that the node will either be the guard node (the node you enter at) or the exit node, both dangerous positions. There is a 2/3 chance of it being both the guard and exit nodes, which is even more dangerous. We want to avoid the entry / exit attacks.
Operators who run more than 1 Tor node can choose to signify their nodes as 'family'. This means that the nodes have all the same parent (the operator of their network). This is again a countermeasure against the entry / exit attacks, although operators do not have to declare family if they wish. If they want to become a guard node (discussed soon) it is recommended to declare family, although not required.
Subnets define networks. IP addresses are made up of 8 octets of bits. As an example, Google's IP address in binary is:
01000000.11101001.10101001.01101010
The first 16 bits (the /16 subnet) is 01000000.11101001
which means that Tor does not choose any nodes which start with the same 16 bits as this IP address. Again, a counter-measure to the entry / exit attacks.
If subnets sound confusing, I've written this Python code to help explain them:
# ip addresses are in binary, not the usual base 10 subnets are usually powers of 2, this is 2^4.
IP = "01000000.11101001.10101001.01101010"
subnet = 16
# this will store the subnet address once we find it
subnet_ip = []
IP_list = list(IP)
counter = 0
for i in IP_list:
# we want to end the loop when we reach the subnet number
if counter >= subnet:
break
# the ip address segments each oclet of bits with full stops
# we don't want to count a fullstop as a number
# but we want to include it in the final subnet
if i == ".":
subnet_ip.append(".")
continue
else:
# else it is a number so we append and increment counter
subnet_ip.append(i)
counter = counter + 1
print("Subnet is " + ''.join(subnet_ip))
Non-running means the node currently isn't online. You don't want to pick things that aren't online. Non-valid means that some configuration in the nodes torrc
is wrong. You don't want to accept strange configurations in case they are trying to hack or break something.
A guard node is a privileged node because it sees the real IP of the user. It’s ‘expensive’ to become a guard node (maintain a high uptime for weeks and have good bandwidth).
This is possible for large companies who have 99.9% uptime and high bandwidth (such as Netflix). Tor has no way to stop a powerful adversary from registering a load of guard nodes. Right now, Tor is configured to stick with a single guard node for 12 weeks at a time, so you choose 4 new guard nodes a year.
This means that if you use Tor once to watch Amazon Prime Video, it is relatively unlikely for Netflix to be your guard node. Of course, the more guard nodes Netflix creates the more likely it is. Although, if Netflix knows you are connecting to the Tor network to watch Amazon Prime Video then they will have to wait 4 weeks for their suspicions to be confirmed, unless they attack the guard node and take it over.
Becoming a guard node is relatively easy for a large organisation. Becoming the exit node is slightly harder, but still possible. We have to assume that the large organisation has infinite computational power to be able to do this. The solution is to make the attack highly expensive with a low rate of success.
The more regular users of Tor, the harder is if for a large organisation to attack it. If Netflix controls $\frac{50}{100}$ nodes in the network:
The chance of you choosing a guard node from Netflix is 50%.
If suddenly 50 more normal user nodes join then that's $\frac{50}{150}$, reducing the probability of Netflix owning a guard node (and thus, a potential attack) and making it even more expensive.
There is strength in numbers within the Tor service.
When a Tor client starts up for the first time, it chooses a small & random set of guard nodes. For the next few months, it makes sure each circuit is using one of these pre-selected nodes as its guard node.
The official proposal from the Tor documentation states:
1. Introduction and motivation
Tor uses entry guards to prevent an attacker who controls some
a fraction of the network from observing a fraction of every user's
traffic. If users chose their entries and exits uniformly at
random from the list of servers every time they build a circuit,
then an adversary who had (k/N) of the network would deanonymize
F=(k/N)^2 of all circuits... and after a given user had built C
circuits, the attacker would see them at least once with
probability 1-(1-F)^C. With large C, the attacker would get a
sample of every user's traffic with probability 1.
To prevent this from happening, Tor clients choose a small number
of guard nodes (currently 3). These guard nodes are the only
nodes that the client will connect to directly. If they are not
compromised, the user's paths are not compromised.
But attacks remain. Consider an attacker who can run a firewall
between a target user and the Tor network, and make many of the
guards they don't control appear to be unreachable. Or consider
an attacker who can identify a user's guards, and mount
denial-of-service attacks on them until the user picks a guard
that the attacker controls.
Guard node pinning is important because of Tor’s threat model. Tor assumes that it may only take a single opening for an adversary to work out who you are talking to, or who you are. Since a single vulnerability circuit can destroy your integrity, Tor tries to minimise the probability that we will ever construct one or more vulnerable circuits.
Tor guard nodes can be DOS’d, or an attacker could have a majority share of guard nodes on the internet when you connect to try and get you. By guard node pinning, it aims to make this much harder.
In the event of an attacker working out your guard nodes and shutting them down, forcing you to connect to their guard nodes. Or, you connect to a guard node controlled by an adversary Tor has algorithms in place to try and detect this. Outined here.
The state of the Tor network is tracked and publicised by a group of 9 trusted servers (as of 2019) known as directory nodes. Each of which is controlled by a different organisation.
Each node is a seperate organisation because it provides redundancy and distributes trust. The integrity of the Tor network relies on the honesty and correctness of the directory nodes. So making the network resilient and distributing trust is critical.
Directory nodes maintain a list of currently running relays (publicly listed node in the Tor network). Once per hour directory nodes publish a consensus together. The consensus is a single document compiled and voted on by each directory node. It ensures that all clients have the same information about the relays that make up Tor.
When a Tor user (a client or a node) wants to know the current state of the network, it asks a directory node. As we’ll see later, directory nodes are essential for all parts of Tor, especially in hidden services.
Relays keep the directory nodes up to date. They send directory node(s) a notification whenever they come online or updated. Whenever a directory node receives a notification, it updates its personal opinion on the current state of the Tor network. All directory nodes then use this opinion to form a consensus of the network.
Let’s now look at what happens when disagreements arise in the directory services when forming a consensus.
The first version of Tor took a simple approach to conflict resolution. Each directory node gave the state of the network as it personally saw it. Each client believed whichever directory node it had spoken to recently. There is no consensus here among all directory nodes.
In Tor, this is a disaster. There was nothing ensuring that directory nodes were telling the truth. If an adversary took over one directory node, they would be able to lie about the state of the network.
If a client asked this adversary controlled directory for the state of the network, it’d return a list. This list contains only nodes that the adversary controlled. The client would then connect to these adversary nodes.
The second version of the Tor directory system made this attack harder. Instead of asking a single directory node for its opinion, clients asked every directory node and combined their opinions into a consensus. But, clients could form differing views on the network depending on when they had last spoken to each directory node. This gave way to statistical information leakage - not as bad as Tor 1.0. Besides, every client had to talk to every directory node, which took time and was expensive.
The third and current version of the directory system moved the responsibility of calculating a consensus from clients to directory nodes.
I’m not sure if you saw it earlier, but I made the distinction between nodes in the directory services and nodes that aren’t.
If a repressive state wants to block Tor, it uses the directory nodes. Directory nodes keep up-to-date lists of Tor relay nodes and are publicly available for anyone to download.
The state can query a directory node for a list of active Tor relays, and censor all traffic to them.
Tor keeps an up-to-date listing of countries where it is possibly blocked (censored) if you're interested.
Tor helps its users circumvent the censorship by hiding the fact they are using Tor. They do this through a proxy known as a Bridge Node. Tor users send their traffic to the bridge node, which forwards the traffic onto the user’s chosen guard nodes.
The full list of Bridge nodes is never published, making it difficult for states to completely block Tor. You can view some bridge nodes here. If this doesn’t work, Tor suggests:
Another way to get bridges is to send an email to bridges@torproject.org. Please note that you must send the email using an address from one of the following email providers: Riseup or Gmail.
It’s possible to block Tor another way. Censoring states can use Deep Packet Inspection (DPI)to analyse the shape, volume, and feel of each packet. Using DPI states can recognise Tor traffic, even when they connect to unknown IP addresses or are encrypted.
To circumvent this, Tor developers have made Pluggable Transports (PT). These transform Tor traffic flow between the client and the bridge. In the words of Tor’s documentation:
This way, censors who monitor traffic between the client and the bridge will see innocent-looking transformed traffic instead of the actual Tor traffic. External programs can talk to Tor clients and Tor bridges using the pluggable transport API, to make it easier to build interoperable programs.
Ever heard those rumours "there are websites on the dark-web, on Tor that when you visit them you'll see people doing nasty things, selling illegal things or worse: watching The Hangover Part 3"
When people talk about these websites they are talking about Tor Hidden Services.
These are a wild concept and honestly deserve an entire blogpost on their own. Hidden services are servers, like any normal computer server.
Except in a Tor Hidden Service it is possible to communicate without the user and server knowing who each other are.
The device (the question mark) knows that it wants to access Netflix, but it doesn't know anything about the server and the server doesn't know anything about the device that's asked to access it. This is quite confusing, but don't worry, I'm going to explain it all with cool diagrams. ✨
When a server is set up on Tor to act as a hidden service, the server sends a message to some selected Onion Routers asking if they want to be an introduction point to the server. It is entirely up to the server as to who gets chosen as an introduction point, although usually they ask 3 routers to be their introduction points.
The introduction points know that they are going to be introducing people to the server.
The server will then create something called a hidden service descriptor which has a public key and the IP address of each introduction point. It will then send this hidden service descriptor to a distributed hash table which means that every onion router (not just the introduction points) will hold some part of the information of the hidden service.
If you try to look up a hidden service the introduction point responsible for it will give you the full hidden service descriptor, the address of the hidden service's introduction points.
The key for this hash table is the onion address and the onion address is derived from the public key of the server.
The idea is that the onion address isn’t publicised over the whole Tor network but instead you find it another way like from a friend telling you or on the internet (addresses ending in .onion).
The way that the distributed hash table is programmed means that the vast majority of the nodes won't know what the descriptor is for a given key.
So almost every single onion router will have minimal knowledge about the hidden service unless they explicitly want to find it.
Let's say someone gave you the onion address. You request the descriptor off the hash table and you get back the services introduction points.
If you want to access an onion address you would first request the descriptor from the hash table and the descriptor has, let’s say 4 or 5 IP addresses of introductory nodes. You pick one at random let's say the top one.
You’re going to ask the introduction point to introduce you to the server and instead of making a connection directly to the server you make a rendezvous point at random in the network from a given set of Onion Routers.
You then make a circuit to that rendezvous point and you send a message to the rendezvous point asking if it can introduce you to the server using the introduction point you just used. You then send the rendezvous point a one time password (in this example, let's use 'Labrador').
The rendezvous point makes a circuit to the introduction point and sends it the word 'Labrador' and its IP address.
The introduction point sends the message to the server and the server can choose to accept it or do nothing.
If the server accepts the message it will then create a circuit to the rendezvous point.
The server sends the rendezvous point a message. The rendezvous point looks at both messages from your computer and the server. It says "well, I've received a message from this computer saying it wants to connect with this service and I’ve also received a message from the service asking if it can connect to a computer, therefore they must want to talk to each other".
The rendezvous point will then act as another hop on the circuit and connect them.
In short, a hidden service works like this, taken from here:
Tor projects its users from analysis attacks. The adversary wants to know who Alice is talking to. Yet, Tor does not protect against confirmation attacks. In these attacks, the adversary aims to answers the question “Is Alice talking to Bob?”
Confirmation attacks are hard and need a lot of preparation and resources. The attacker needs to be able to track both ends of the circuit. The attacker can either directly track each devices internet connection or the guard and exit nodes.
If Alice sends a packet like this:
# (timestamp, size, port, protocol)
(17284812, 3, 21, SSH)
And Bob receives this packet, the attacker can see that the packets are the same - even though the attacker cannot see what the packet is as it is encrypted. Does Bob tend to receive packets at the same time that Alice sends them? Are they the same size? If so, it is reasonable to infer that Alice and Bob are communicating with each other.
Tor breaks packets up into sizeable chunks for a reason - to try and prevent this kind of thing. Tor is working on padding all packets to make this harder.
They’re discussing adding packet order randomisation too. But this is too costly at the moment. The Tor browser does add some extra defences, such as reordering packets.
If Alice sends the packets, A, B, C and Bob receives them in B, A, C it is harder to detect that they are the same. It’s not foolproof, but it does become harder.
An attack where the attacker tries to control both ends of the circuit is called a Sylbil Attack. Named after the main character of the book Sybil by Flora Rheta Schreiber. We discussed some of this earlier, where an attacker controls both the guard & exit nodes.
Sybil attacks are not theoretical. In 2014 researchers at Carnegie Mellon University appeared to successfully carry out a Sybil Attack against the real-life Tor network.
When Lizard Squad - a group of hackers tried to perform a Sybil attack, a detection system alarmed. Tor has built-in monitoring against these kinds of events, and they are working on more sophisticated monitoring against Sybil attacks.
In 2007 Dan Egerstad - a Swedish security consultant, revealed he has intercepted usernames and passwords sent through Tor by being an exit node. At the time, these were not TLS or SSL encrypted.
Interestingly, Dan Egerstad had this to say on the Tor nodes:
If you actually look into where these Tor nodes are hosted and how big they are, some of these nodes cost thousands of dollars each month just to host because they’re using lots of bandwidth, they’re heavy-duty servers and so on. Who would pay for this and be anonymous?
Tor does not normally hide the fact that you are using Tor. Many websites (such as BBC’S iPlayer or editing Wikipedia) block you when using a known Tor node.
Some applications, under Tor, reveal your true IP address. One such application is BitTorrent.
Jansen et al described an attack where they DDOS exit nodes. By degrading the network (removing exit nodes) an attacker increases the chance
Tor users who visit a site twice, once on Tor and once off, can be tracked. The way you move your mouse is unique. There is a JavaScript time measurement bug report on the Tor project that shows how it’s possible to monitor the mouse locations on a site (even when on Tor). Once you fingerprint someone twice, you know they’re the same person.
It should be noted, that Tor browser offers 3 levels of security (located in the settings). The highest security level disables JavaScript, some images (as they can be used to track you) and some fonts too. The lesson is, if you want high-security Tor, use the high-security version.
Now, all these attacks sound cool. But that’s not how most Tor users are caught. Most Tor users make mistakes and are caught because of themselves.
Take Dredd pirate Roberts, Founder of the Silk Road dark marketplace. He gave himself away by posting about it on social media.
Most Tor users are caught (if they’re doing illegal things) by bad operational security, and not normally because of a security issue with Tor.
It’s worth repeating this story, that we saw earlier.
In 2013 during the Final Exams period at Harvard, a student tried to delay the exam by sending in a fake bomb threat. The student used Tor and Guerrilla Mail (a service which allows people to make disposable email addresses) to send the bomb threat to school officials.
The student was caught, even though he took precautions to make sure he wasn’t caught.
Guerilla mail sends an originating IP address header along with the email that’s sent to the receiver, so it knows where the original email came from. With Tor, the student expected the IP address to be scrambled but the authorities knew it came from a Tor exit node (Tor keeps a list of all nodes in the directory service) so the authorities looked for people who were accessing Tor (within the university) at the time the email was sent.
If this person went to a coffee shop or something, he probably would of be fine.
There’s a fantastic talk at DEFCON 22 about how Tor users got caught. None of the stories mentioned was caused by Tor, but rather bad OpSec.
Tor is a fascinating protocol full of algorithms that have been refined over the years. I've come to appreciate Tor, and I hope you have to. Here is a list of things we've covered:
If you want to learn more, check out the paper on Tor titled "Tor: The Second-Generation Onion Router".
If you liked this article and want more like it, sign up to my email list below ✨ I'll only send you an email when I have something new, which is every month / 2 months or so.
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these articles in my spare time. This blog is my full time job, so any and all donations are appreciate
]]>Dynamic programming is breaking down a problem into smaller sub-problems, solving each sub-problem and storing the solutions to each of these sub-problems in an array (or similar data structure) so each sub-problem is only calculated once.
It is both a mathematical optimisation method and a computer programming method.
Optimisation problems seek the maximum or minimum solution. The general rule is that if you encounter a problem where the initial algorithm is solved in O(2^{n}) time, it is better solved using Dynamic Programming.
Richard Bellman invented DP in the 1950s. Bellman named it Dynamic Programming because at the time, RAND (his employer), disliked mathematical research and didn't want to fund it. He named it Dynamic Programming to hide the fact he was really doing mathematical research.
Bellman explains the reasoning behind the term Dynamic Programming in his autobiography, Eye of the Hurricane: An Autobiography (1984, page 159). He explains:
"I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let's kill two birds with one stone. Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."
Sub-problems are smaller versions of the original problem. Let's see an example. With the equation below:
$$1 + 2 + 3 + 4$$
We can break this down to:
$$1 + 2$$
$$3 + 4$$
Once we solve these two smaller problems, we can add the solutions to these sub-problems to find the solution to the overall problem.
Notice how these sub-problems breaks down the original problem into components that build up the solution. This is a small example but it illustrates the beauty of Dynamic Programming well. If we expand the problem to adding 100's of numbers it becomes clearer why we need Dynamic Programming. Take this example:
$$6 + 5 + 3 + 3 + 2 + 4 + 6 + 5$$
We have $6 + 5$ twice. The first time we see it, we work out $6 + 5$. When we see it the second time we think to ourselves:
"Ah, 6 + 5. I've seen this before. It's 11!"
In Dynamic Programming we store the solution to the problem so we do not need to recalculate it. By finding the solutions for every single sub-problem, we can tackle the original problem itself.
Memoisation is the act of storing a solution.
Let's see why storing answers to solutions make sense. We're going to look at a famous problem, Fibonacci sequence. This problem is normally solved in Divide and Conquer.
There are 3 main parts to divide and conquer:
Dynamic programming has one extra step added to step 2. This is memoisation.
The Fibonacci sequence is a sequence of numbers. It's the last number + the current number. We start at 1.
$$1 + 0 = 1$$
$$1 + 1 = 2$$
$$2 + 1 = 3$$
$$3 + 2 = 5$$
$$5 + 3 = 8$$
In Python, this is:
def F(n):
if n == 0 or n == 1:
return n
else:
return F(n-1)+F(n-2)
If you're not familiar with recursion I have a blog post written for you that you should read first.
Let's calculate F(4). In an execution tree, this looks like:
We calculate F(2) twice. On bigger inputs (such as F(10)) the repetition builds up. The purpose of dynamic programming is to not calculate the same thing twice.
Instead of calculating F(2) twice, we store the solution somewhere and only calculate it once.
We'll store the solution in an array. F[2] = 1. Below is some Python code to calculate the Fibonacci sequence using Dynamic Programming.
def fibonacciVal(n):
memo[0], memo[1] = 0, 1
for i in range(2, n+1):
memo[i] = memo[i-1] + memo[i-2]
return memo[n]
In theory, Dynamic Programming can solve every problem. The question is then:
"When should I solve this problem with dynamic programming?"
We should use dynamic programming for problems that are between tractable and intractable problems.
Tractable problems are those that can be solved in polynomial time. That's a fancy way of saying we can solve it in a fast manner. Binary search and sorting are all fast. Intractable problems are those that run in exponential time. They're slow. Intractable problems are those that can only be solved by bruteforcing through every single combination (NP hard).
When we see terms like:
"shortest/longest, minimized/maximized, least/most, fewest/greatest, "biggest/smallest"
We know it's an optimisation problem.
Dynamic Programming algorithms proof of correctness is usually self-evident. Other algorithmic strategies are often much harder to prove correct. Thus, more error-prone.
When we see these kinds of terms, the problem may ask for a specific number ( "find the minimum number of edit operations") or it may ask for a result ( "find the longest common subsequence"). The latter type of problem is harder to recognize as a dynamic programming problem. If something sounds like optimisation, Dynamic Programming can solve it.
Imagine we've found a problem that's an optimisation problem, but we're not sure if it can be solved with Dynamic Programming. First, identify what we're optimising for. Once we realize what we're optimising for, we have to decide how easy it is to perform that optimisation. Sometimes, the greedy approach is enough for an optimal solution.
Dynamic programming takes the brute force approach. It Identifies repeated work, and eliminates repetition.
Before we even start to plan the problem as a dynamic programming problem, think about what the brute force solution might look like. Are sub steps repeated in the brute-force solution? If so, we try to imagine the problem as a dynamic programming problem.
Mastering dynamic programming is all about understanding the problem. List all the inputs that can affect the answers. Once we've identified all the inputs and outputs, try to identify whether the problem can be broken into subproblems. If we can identify subproblems, we can probably use Dynamic Programming.
Then, figure out what the recurrence is and solve it. When we're trying to figure out the recurrence, remember that whatever recurrence we write has to help us find the answer. Sometimes the answer will be the result of the recurrence, and sometimes we will have to get the result by looking at a few results from the recurrence.
Dynamic Programming can solve many problems, but that does not mean there isn't a more efficient solution out there. Solving a problem with Dynamic Programming feels like magic, but remember that dynamic programming is merely a clever brute force. Sometimes it pays off well, and sometimes it helps only a little.
Now we have an understanding of what Dynamic programming is and how it generally works. Let's look at to create a Dynamic Programming solution to a problem. We're going to explore the process of Dynamic Programming using the Weighted Interval Scheduling Problem.
Pretend you're the owner of a dry cleaner. You have n customers come in and give you clothes to clean. You can only clean one customer's pile of clothes (PoC) at a time. Each pile of clothes, i, must be cleaned at some pre-determined start time $s_i$ and some predetermined finish time $f_i$.
Each pile of clothes has an associated value, $v_i$, based on how important it is to your business. For example, some customers may pay more to have their clothes cleaned faster. Or some may be repeating customers and you want them to be happy.
As the owner of this dry cleaners you must determine the optimal schedule of clothes that maximises the total value of this day. This problem is a re-wording of the Weighted Interval scheduling problem.
You will now see 4 steps to solving a Dynamic Programming problem. Sometimes, you can skip a step. Sometimes, your problem is already well defined and you don't need to worry about the first few steps.
Grab a piece of paper. Write out:
In the dry cleaner problem, let's put down into words the subproblems. What we want to determine is the maximum value schedule for each pile of clothes such that the clothes are sorted by start time.
Why sort by start time? Good question! We want to keep track of processes which are currently running. If we sort by finish time, it doesn't make much sense in our heads. We could have 2 with similar finish times, but different start times. Time moves in a linear fashion, from start to finish. If we have piles of clothes that start at 1 pm, we know to put them on when it reaches 1pm. If we have a pile of clothes that finishes at 3 pm, we might need to have put them on at 12 pm, but it's 1pm now.
We can find the maximum value schedule for piles $n - 1$ through to n. And then for $n - 2$ through to n. And so on. By finding the solution to every single sub-problem, we can tackle the original problem itself. The maximum value schedule for piles 1 through n. Sub-problems can be used to solve the original problem, since they are smaller versions of the original problem.
With the interval scheduling problem, the only way we can solve it is by brute-forcing all subsets of the problem until we find an optimal one. What we're saying is that instead of brute-forcing one by one, we divide it up. We brute force from $n-1$ through to n. Then we do the same for $n - 2$ through to n. Finally, we have loads of smaller problems, which we can solve dynamically. We want to build the solutions to our sub-problems such that each sub-problem builds on the previous problems.
I know, mathematics sucks. If you'll bare with me here you'll find that this isn't that hard. Mathematical recurrences are used to:
Define the running time of a divide and conquer (dynamic programming) technique
Recurrences are also used to define problems. If it's difficult to turn your subproblems into maths, then it may be the wrong subproblem.
There are 2 steps to creating a mathematical recurrence:
Base cases are the smallest possible denomination of a problem.
When creating a recurrence, ask yourself these questions:
"What decision do I make at step 0?"
It doesn't have to be 0. The base case is the smallest possible denomination of a problem. We saw this with the Fibonacci sequence. The base was:
It's important to know where the base case lies, so we can create the recurrence. In our problem, we have one decision to make:
or
If n is 0, that is, if we have 0 PoC then we do nothing. Our base case is:
if n == 0, return 0
Now we know what the base case is, if we're at step n what do we do? For each pile of clothes that is compatible with the schedule so far. Compatible means that the start time is after the finish time of the pile of clothes currently being washed. The algorithm has 2 options:
We know what happens at the base case, and what happens else. We now need to find out what information the algorithm needs to go backwards (or forwards).
"If my algorithm is at step i, what information would it need to decide what to do in step i+1?"
To decide between the two options, the algorithm needs to know the next compatible PoC (pile of clothes). The next compatible PoC for a given pile, p, is the PoC, n, such that $s_n$ (the start time for PoC n) happens after $f_p$ (the finish time for PoC p). The difference between $s_n$ and $f_p$ should be minimised.
In English, imagine we have one washing machine. We put in a pile of clothes at 13:00. Our next pile of clothes starts at 13:01. We can't open the washing machine and put in the one that starts at 13:00. Our next compatible pile of clothes is the one that starts after the finish time of the one currently being washed.
"If my algorithm is at step i, what information did it need to decide what to do in step i-1?"
The algorithm needs to know about future decisions. The ones made for PoC i through n to decide whether to run or not run PoC i-1.
Now that we’ve answered these questions, we’ve started to form a recurring mathematical decision in our mind. If not, that’s also okay, it becomes easier to write recurrences as we get exposed to more problems.
Here’s our recurrence:
$$ OPT(i) = \begin{cases} 0, \quad \text{If i = 0} \\ max{v_i + OPT(next[i]), OPT(i+1)}, \quad \text{if n > 1} \end{cases}$$
Let's explore in detail what makes this mathematical recurrence. OPT(i) represents the maximum value schedule for PoC i through to n such that PoC is sorted by start times. OPT(i) is our subproblem from earlier.
We start with the base case. All recurrences need somewhere to stop. If we call OPT(0) we'll be returned with 0.
To determine the value of OPT(i), there are two options. We want to take the maximum of these options to meet our goal. Our goal is the maximum value schedule for all piles of clothes. Once we choose the option that gives the maximum result at step i, we memoize its value as OPT(i).
Mathematically, the two options - run or not run PoC i, are represented as:
$$v_i + OPT(next[n])$$
This represents the decision to run PoC i. It adds the value gained from PoC i to OPT(next[n]), where next[n] represents the next compatible pile of clothing following PoC i. When we add these two values together, we get the maximum value schedule from i through to n such that they are sorted by start time if i runs.
Sorted by start time here because next[n] is the one immediately after v_i, so by default, they are sorted by start time.
$$OPT(i + 1)$$
If we decide not to run i, our value is then OPT(i + 1). The value is not gained. OPT(i + 1) gives the maximum value schedule for i+1 through to n, such that they are sorted by start times.
The solution to our Dynamic Programming problem is OPT(1). We can write out the solution as the maximum value schedule for PoC 1 through n such that PoC is sorted by start time. This goes hand in hand with "maximum value schedule for PoC i through to n".
From step 2:
$$OPT(1) = max(v_1 + OPT(next[1]), OPT(2))$$
Going back to our Fibonacci numbers earlier, our Dynamic Programming solution relied on the fact that the Fibonacci numbers for 0 through to n - 1 were already memoised. That is, to find F(5) we already memoised F(0), F(1), F(2), F(3), F(4). We want to do the same thing here.
The problem we have is figuring out how to fill out a memoisation table. In the scheduling problem, we know that OPT(1) relies on the solutions to OPT(2) and OPT(next[1]). PoC 2 and next[1] have start times after PoC 1 due to sorting. We need to fill our memoisation table from OPT(n) to OPT(1).
We can see our array is one dimensional, from 1 to n. But, if we couldn't see that we can work it out another way. The dimensions of the array are equal to the number and size of the variables on which OPT(x) relies. In our algorithm, we have OPT(i) - one variable, i. This means our array will be 1-dimensional and its size will be n, as there are n piles of clothes.
If we know that n = 5, then our memoisation array might look like this:
memo = [0, OPT(1), OPT(2), OPT(3), OPT(4), OPT(5)]
0 is also the base case. memo[0] = 0, per our recurrence from earlier.
When I am coding a Dynamic Programming solution, I like to read the recurrence and try to recreate it. Our first step is to initialise the array to size (n + 1). In Python, we don't need to do this. But you may need to do it if you're using a different language.
Our second step is to set the base case.
To find the profit with the inclusion of job[i]. we need to find the latest job that doesn’t conflict with job[i]. The idea is to use Binary Search to find the latest non-conflicting job. I've copied the code from here but edited.
First, let's define what a "job" is. As we saw, a job consists of 3 things:
# Class to represent a job
class Job:
def __init__(self, start, finish, profit):
self.start = start
self.finish = finish
self.profit = profit
Start time, finish time, and the total profit (benefit) of running that job.
The next step we want to program is the schedule.
# The main function that returns the maximum possible
# profit from given array of jobs
def schedule(job):
# Sort jobs according to start time
job = sorted(job, key = lambda j: j.start)
# Create an array to store solutions of subproblems. table[i]
# stores the profit for jobs till arr[i] (including arr[i])
n = len(job)
table = [0 for _ in range(n)]
table[0] = job[0].profit;
Earlier, we learnt that the table is 1 dimensional. We sort the jobs by start time, create this empty table and set table[0] to be the profit of job[0]. Since we've sorted by start times, the first compatible job is always job[0].
Our next step is to fill in the entries using the recurrence we learnt earlier. To find the next compatible job, we're using Binary Search. In the full code posted later, it'll include this. For now, let's worry about understanding the algorithm.
If the next compatible job returns -1, that means that all jobs before the index, i, conflict with it (so cannot be used). Inclprof means we're including that item in the maximum value set. We then store it in table[i], so we can use this calculation again later.
# Fill entries in table[] using recursive property
for i in range(1, n):
# Find profit including the current job
inclProf = job[i].profit
l = binarySearch(job, i)
if (l != -1):
inclProf += table[l];
# Store maximum of including and excluding
table[i] = max(inclProf, table[i - 1])
Our final step is then to return the profit of all items up to n-1.
return table[n-1]
The full code can be seen below:
# Python program for weighted job scheduling using Dynamic
# Programming and Binary Search
# Class to represent a job
class Job:
def __init__(self, start, finish, profit):
self.start = start
self.finish = finish
self.profit = profit
# A Binary Search based function to find the latest job
# (before current job) that doesn't conflict with current
# job. "index" is index of the current job. This function
# returns -1 if all jobs before index conflict with it.
def binarySearch(job, start_index):
# https://en.wikipedia.org/wiki/Binary_search_algorithm
# Initialize 'lo' and 'hi' for Binary Search
lo = 0
hi = start_index - 1
# Perform binary Search iteratively
while lo <= hi:
mid = (lo + hi) // 2
if job[mid].finish <= job[start_index].start:
if job[mid + 1].finish <= job[start_index].start:
lo = mid + 1
else:
return mid
else:
hi = mid - 1
return -1
# The main function that returns the maximum possible
# profit from given array of jobs
def schedule(job):
# Sort jobs according to start time
job = sorted(job, key = lambda j: j.start)
# Create an array to store solutions of subproblems. table[i]
# stores the profit for jobs till arr[i] (including arr[i])
n = len(job)
table = [0 for _ in range(n)]
table[0] = job[0].profit;
# Fill entries in table[] using recursive property
for i in range(1, n):
# Find profit including the current job
inclProf = job[i].profit
l = binarySearch(job, i)
if (l != -1):
inclProf += table[l];
# Store maximum of including and excluding
table[i] = max(inclProf, table[i - 1])
return table[n-1]
# Driver code to test above function
job = [Job(1, 2, 50), Job(3, 5, 20),
Job(6, 19, 100), Job(2, 100, 200)]
print("Optimal profit is"),
print(schedule(job))
Congrats! 🥳 We've just written our first dynamic program! Now that we’ve wet our feet, let's walk through a different type of dynamic programming problem.
Imagine you are a criminal. Dastardly smart. You break into Bill Gates’s mansion. Wow, okay!?!? How many rooms is this? His washing machine room is larger than my entire house??? Ok, time to stop getting distracted. You brought a small bag with you. A knapsack - if you will.
You can only fit so much into it. Let’s give this an arbitrary number. The bag will support weight 15, but no more. What we want to do is maximise how much money we'll make, $b$.
The greedy approach is to pick the item with the highest value which can fit into the bag. Let's try that. We're going to steal Bill Gates's TV. £4000? Nice. But his TV weighs 15. So... We leave with £4000.
TV = (£4000, 15)
# (value, weight)
Bill Gates has a lot of watches. Let's say he has 2 watches. Each watch weighs 5 and each one is worth £2250. When we steal both, we get £4500 with a weight of 10.
watch1 = (£2250, 5)
watch2 = (£2250, 5)
watch1 + watch2
>>> (£4500, 10)
In the greedy approach, we wouldn't choose these watches first. But to us as humans, it makes sense to go for smaller items which have higher values. The Greedy approach cannot optimally solve the {0,1} Knapsack problem. The {0, 1} means we either take the item whole item {1} or we don't {0}. However, Dynamic programming can optimally solve the {0, 1} knapsack problem.
The simple solution to this problem is to consider all the subsets of all items. For every single combination of Bill Gates's stuff, we calculate the total weight and value of this combination.
Only those with weight less than $W_{max}$ are considered. We then pick the combination which has the highest value. This is a disaster! How long would this take? Bill Gates's would come back home far before you're even 1/3rd of the way there! In Big O, this algorithm takes $O(n^2)$ time.
You can see we already have a rough idea of the solution and what the problem is, without having to write it down in maths!
Imagine we had a listing of every single thing in Bill Gates's house. We stole it from some insurance papers. Now, think about the future. What is the optimal solution to this problem?
We have a subset, L, which is the optimal solution. L is a subset of S, the set containing all of Bill Gates's stuff.
Let's pick a random item, N. L either contains N or it doesn't. If it doesn't use N, the optimal solution for the problem is the same as ${1, 2, ..., N-1}$. This is assuming that Bill Gates's stuff is sorted by $value / weight$.
Suppose that the optimum of the original problem is not optimum of the sub-problem. if we have sub-optimum of the smaller problem then we have a contradiction - we should have an optimum of the whole problem.
If L contains N, then the optimal solution for the problem is the same as ${1, 2, 3, ..., N-1}$. We know the item is in, so L already contains N. To complete the computation we focus on the remaining items. We find the optimal solution to the remaining items.
But, we now have a new maximum allowed weight of $W_{max} - W_n$. If item N is contained in the solution, the total weight is now the max weight take away item N (which is already in the knapsack).
These are the 2 cases. Either item N is in the optimal solution or it isn't.
If the weight of item N is greater than $W_{max}$, then it cannot be included so case 1 is the only possibility.
To better define this recursive solution, let $S_k = {1, 2, ..., k}$ and $S_0 = \emptyset$
Let B[k, w] be the maximum total benefit obtained using a subset of $S_k$. Having total weight at most w.
Then we define B[0, w] = 0 for each $w \le W_{max}$.
Our desired solution is then B[n, $W_{max}$].
$$ OPT(i) = \begin{cases} B[k - 1, w], \quad \text{If w < }w_k \\ max{B[k-1, w], b_k + B[k - 1, w - w_k]}, \; \quad \text{otherwise} \end{cases}$$
Okay, pull out some pen and paper. No, really. Things are about to get confusing real fast. This memoisation table is 2-dimensional. We have these items:
(1, 1), (3, 4), (4, 5), (5, 7)
Where the tuples are (weight, value)
.
We have 2 variables, so our array is 2-dimensional. The first dimension is from 0 to 7. Our second dimension is the values.
And we want a weight of 7 with maximum benefit.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | ||||||||
(4, 3) | ||||||||
(5, 4) | ||||||||
(7, 5) |
The weight is 7. We start counting at 0. We put each tuple on the left-hand side. Ok. Now to fill out the table!
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | |||||||
(4, 3) | 0 | |||||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
The columns are weight. At weight 0, we have a total weight of 0. At weight 1, we have a total weight of 1. Obvious, I know. But this is an important distinction to make which will be useful later on.
When our weight is 0, we can't carry anything no matter what. The total weight of everything at 0 is 0.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | ||||||
(4, 3) | 0 | |||||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
If our total weight is 1, the best item we can take is (1, 1). As we go down through this array, we can take more items. At the row for (4, 3) we can either take (1, 1) or (4, 3). But for now, we can only take (1, 1). Our maximum benefit for this row then is 1.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | |||||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
If our total weight is 2, the best we can do is 1. We only have 1 of each item. We cannot duplicate items. So no matter where we are in row 1, the absolute best we can do is (1, 1).
Let's start using (4, 3) now. If the total weight is 1, but the weight of (4, 3) is 3 we cannot take the item yet until we have a weight of at least 3.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | |||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
Now we have a weight of 3. Let's compare some things. We want to take the max of:
$$MAX(4 + T[0][0], 1)$$
If we're at 2, 3 we can either take the value from the last row or use the item on that row. We go up one row and count back 3 (since the weight of this item is 3).
Actually, the formula is whatever weight is remaining when we minus the weight of the item on that row. The weight of (4, 3) is 3 and we're at weight 3. 3 - 3 = 0. Therefore, we're at T[0][0]. T[previous row's number][current total weight - item weight].
$$MAX(4 + T[0][0], 1)$$
The 1 is because of the previous item. The max here is 4.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | ||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
$$max(4 + t[0][1], 1)$$
Total weight is 4, item weight is 3. 4 - 3 = 1. Previous row is 0. t[0][1].
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | |||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
I won't bore you with the rest of this row, as nothing exciting happens. We have 2 items. And we've used both of them to make 5. Since there are no new items, the maximum value is 5.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | |||||||
(7, 5) | 0 |
Onto our next row:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | ||||
(7, 5) | 0 |
Here's a little secret. Our tuples are ordered by weight! That means that we can fill in the previous rows of data up to the next weight point. We know that 4 is already the maximum, so we can fill in the rest.. This is where memoisation comes into play! We already have the data, why bother re-calculating it?
We go up one row and head 4 steps back. That gives us:
$$max(4 + T[2][0], 5)$$.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | |||
(7, 5) | 0 |
Now we calculate it for total weight 5.
$$max(5 + T[2][1], 5) = 6$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | ||
(7, 5) | 0 |
We do the same thing again:
$$max(5 + T[2][2], 5) = 6$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | |
(7, 5) | 0 |
Now we have total weight 7. We choose the max of:
$$max(5 + T[2][3], 5) = max(5 + 4, 5) = 9$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 |
If we had total weight 7 and we had the 3 items (1, 1), (4, 3), (5, 4) the best we can do is 9.
Since our new item starts at weight 5, we can copy from the previous row until we get to weight 5.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 |
We then do another max.
Total weight - new item's weight. This is $5 - 5 = 0$. We want the previous row at position 0.
$$max(7 + T[3][0], 6)$$
The 6 comes from the best on the previous row for that total weight.
$$max(7 + 0, 6) = 7$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 |
$$max(7 + T[3][1], 6) = 8$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 |
$$max(7+T[3][2], 9) = 9$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
9 is the maximum value we can get by picking items from the set of items such that the total weight is $\le 7$.
Now, what items do we actually pick for the optimal set? We start with this item:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
We want to know where the 9 comes from. It's coming from the top because the number directly above 9 on the 4th row is 9. Since it's coming from the top, the item (7, 5) is not used in the optimal set.
Where does this 9 come from?
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
This 9 is not coming from the row above it. Item (5, 4) must be in the optimal set.
We now go up one row, and go back 4 steps. 4 steps because the item, (5, 4), has weight 4.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
4 does not come from the row above. The item (4, 3) must be in the optimal set.
The weight of item (4, 3) is 3. We go up and we go back 3 steps and reach:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
As soon as we reach a point where the weight is 0, we're done. Our two selected items are (5, 4) and (4, 3). The total weight is 7 and our total benefit is 9. We add the two tuples together to find this out.
Let's begin coding this.
Now we know how it works, and we've derived the recurrence for it - it shouldn't be too hard to code it. If our two-dimensional array is i (row) and j (column) then we have:
if j < wt[i]:
If our weight j is less than the weight of item i (i does not contribute to j) then:
if j < wt[i]:
T[i][j] = T[i - 1][j]
else:
// weight of i >= j
T[i][j] = max(val[i] + t[i - 1][j-wt(i), t[i-1][j])
// previous row, subtracting the weight of the item from the total weight or without including ths item
This is what the core heart of the program does. I've copied some code from here to help explain this. I'm not going to explain this code much, as there isn't much more to it than what I've already explained. If you're confused by it, leave a comment below or email me 😁
# Returns the maximum value that can be put in a knapsack of
# capacity W
def knapSack(W , wt , val , n):
# Base Case
if n == 0 or W == 0:
return 0
# If weight of the nth item is more than Knapsack of capacity
# W, then this item cannot be included in the optimal solution
if (wt[n-1] > W):
return knapSack(W , wt , val , n-1)
# return the maximum of two cases:
# (1) nth item included
# (2) not included
else:
return max(val[n-1] + knapSack(W-wt[n-1] , wt , val , n-1),
knapSack(W , wt , val , n-1))
# To test above function
val = [60, 100, 120]
wt = [10, 20, 30]
W = 50
n = len(val)
print(knapSack(W , wt , val , n))
# output 220
Time complexity is calculated in Dynamic Programming as:
$$Number \;of \;unique \;states * time \;taken \;per\; state$$
For our original problem, the Weighted Interval Scheduling Problem, we had n piles of clothes. Each pile of clothes is solved in constant time. The time complexity is:
$$O(n) + O(1) = O(n)$$
I've written a post about Big O notation if you want to learn more about time complexities.
With our Knapsack problem, we had n number of items. The table grows depending on the total capacity of the knapsack, our time complexity is:
$$O(nw)$$
Where n is the number of items, and w is the capacity of the knapsack.
I'm going to let you in on a little secret. It's possible to work out the time complexity of an algorithm from its recurrence. You can use something called the Master Theorem to work it out. This is the theorem in a nutshell:
Now, I'll be honest. The master theorem deserves a blog post of its own. For now, I've found this video to be excellent:
Dynamic Programming & Divide and Conquer are similar. Dynamic Programming is based on Divide and Conquer, except we memoise the results.
But, Greedy is different. It aims to optimise by making the best choice at that moment. Sometimes, this doesn't optimise for the whole problem. Take this question as an example. We have 3 coins:
1p, 15p, 25p
And someone wants us to give a change of 30p. With Greedy, it would select 25, then 5 * 1 for a total of 6 coins. The optimal solution is 2 * 15. Greedy works from largest to smallest. At the point where it was at 25, the best choice would be to pick 25.
Greedy vs Divide & Conquer vs Dynamic Programming | ||
---|---|---|
Greedy | Divide & Conquer | Dynamic Programming |
Optimises by making the best choice at the moment | Optimises by breaking down a subproblem into simpler versions of itself and using multi-threading & recursion to solve | Same as Divide and Conquer, but optimises by caching the answers to each subproblem as not to repeat the calculation twice. |
Doesn't always find the optimal solution, but is very fast | Always finds the optimal solution, but is slower than Greedy | Always finds the optimal solution, but could be pointless on small datasets. |
Requires almost no memory | Requires some memory to remember recursive calls | Requires a lot of memory for memoisation / tabulation |
There are 2 types of dynamic programming. Tabulation and Memoisation.
We've computed all the subproblems but have no idea what the optimal evaluation order is. We would then perform a recursive call from the root, and hope we get close to the optimal solution or obtain a proof that we will arrive at the optimal solution. Memoisation ensures you never recompute a subproblem because we cache the results, thus duplicate sub-trees are not recomputed.
From our Fibonacci sequence earlier, we start at the root node. The subtree F(2) isn't calculated twice.
This starts at the top of the tree and evaluates the subproblems from the leaves/subtrees back up towards the root. Memoisation is a top-down approach.
We've also seen Dynamic Programming being used as a 'table-filling' algorithm. Usually, this table is multidimensional. This is like memoisation, but with one major difference. We have to pick the exact order in which we will do our computations. The knapsack problem we saw, we filled in the table from left to right - top to bottom. We knew the exact order of which to fill the table.
Sometimes the 'table' is not like the tables we've seen. It can be a more complicated structure such as trees. Or specific to the problem domain, such as cities within flying distance on a map.
Generally speaking, memoisation is easier to code than tabulation. We can write a 'memoriser' wrapper function that automatically does it for us. With tabulation, we have to come up with an ordering.
Memoisation has memory concerns. If we're computing something large such as F(10^8), each computation will be delayed as we have to place them into the array. And the array will grow in size very quickly.
Either approach may not be time-optimal if the order we happen (or try to) visit subproblems is not optimal. If there is more than one way to calculate a subproblem (normally caching would resolve this, but it's theoretically possible that caching might not in some exotic cases). Memoisation will usually add on our time-complexity to our space-complexity. For example with tabulation we have more liberty to throw away calculations, like using tabulation with Fib lets us use O(1) space, but memoisation with Fib uses O(N) stack space).
Memoisation vs Tabulation | ||
---|---|---|
Tabulation | Memoisation | |
Code | Harder to code as you have to know the order | Easier to code as functions may already exist to memoise |
Speed | Fast as you already know the order and dimensions of the table | Slower as you're creating them on the fly |
Table completeness | The table is fully computed | Table does not have to be fully computed |
Most of the problems you'll encounter within Dynamic Programming already exist in one shape or another. Often, your problem will build on from the answers for previous problems. Here's a list of common problems that use Dynamic Programming.
I hope that whenever you encounter a problem, you think to yourself "can this problem be solved with ?" and try it.
An AVL tree (named after inventors Adelson-Velsky and Landis) is a self-balancing binary search tree.
Before we delve into AVL trees, we need to learn a few things.
The height of a tree is the length of the longest path from the root node (the top node) to some node. In the image above, the tree has a height of 3.
This tree is called balanced. Balance is an important topic in AVL trees. First, let's look at why balance is important. The speed of searching a balanced tree is $\theta{log n]$. The speed of searching a non-balanced tree is $\theta{n}$.
This tree isn't balanced, it's a sad tree 😢. Formally, balance is defined as:
The tree is considered balanced if the height is $\theta{log n}$
This sucks as a definition. So, everytime you make a tree you have to check this weird property? Balance can be better explained.
Balance is where each node has one left and one right child. In the above example, no nodes had a left child so it is unbalanced.
The depth of a node is a distance from it to the root. The height of a node is the length of the longest path from the node to a leaf.
Here is a tree with heights of each node listed next to them.
Each leaf has its own pointers, both pointing towards NULL (with a height of -1).
We often do not include the null pointers in diagrams, but they're important as they make our lives easier later on.
AVL trees require the weight of the left and right children of every node to differ by at most =+ 1. In fact, anything in the range $-1 \le x \le 1$ is fine. Preferably, we'd like them to differ by 0. That is, it's perfectly matched on left and right. This isn't realistic and is impossible for an odd number of nodes in a tree.
$$S - R \le 1$$
Now, we know we can't make them 0. And it's tiresome to calculate $$S-R$$ for every single pair of nodes. So we just make sure the right has +1 more node than the left. A very short proof of this working is:
$$ left = n$$
$$ right = n + 1$$
$$H = right - left = n - n + 1 = 1$$
Let's start out with this tree. If you look at the root node, the right side has height 2. The left side has height 3. 2 - 3 = -1, which is fine.
Now, let's insert 23.
Right is 2, left is 4. 2 - 4 = -2. Uh oh! It's a sad tree now because it's not balanced. We can turn this tree into a balanced tree by rotating it.
Our tree is right heavy. We want to rotate it left. We can do it like this:
Rotations take O(1) time.
We rotate the tree we had above, now it is a happy tree. A colleague told me that 26 is between 23 and 29, therefore 26 goes on top. While this isn't official advice in any textbooks, it seems to work for most of the problems I'm going to present in this article.
We insert(55). Now it's a sad tree again, what a disaster! What we have here is a zig-zag:
If we rotate the zig-zag we get:
That's.... no better than before. Let's quickly rotate our tree to match this zig-zag pattern:
Now, the solution to the zig-zag pattern is to rotate again. A double rotation. 55 is inbetween 50 and 65, so it should go on top.
Now it's a happy tree! 🌴
Now we understand rotations, let's take a look at AVL insertions one more time.
Suppose X is lowest node in the now not-AVL tree. Assume right(x) is higher (heavier).
AVL Deletion is the same as insertion, but instead of insertion we delete. The same 2 principles apply:
Diffie-Hellman-Merkle is a way to share a secret key with someone (or something) without actually sending them the key. Before we look into how we share keys let's first look into what keys are and why we would want to invent a method to share keys without giving the other person the key.
Your front door is usually locked by a key. This key unlocks & locks your front door. You have one key which you use to unlock and lock things.
Only people with the key or a copy of the key can unlock the door. Now, imagine you're going to be on holiday Friday, Saturday, Sunday in Bali. You want to invite your friend around to look after your cat 😺 while you're on the beautiful beaches 🏖️.
Your only friend is unfortunately on holiday Wednesday, Thursday and Friday. They get back right as you leave for holiday. You can't be there to physically give them the key, but hiding the key under a rock outside your door seems insecure. Anyone could lift up that rock and find the key, but you just want your friend to have the key.
This is where Diffie-Hellman comes in. Well, with Diffie-Hellman you're not exchanging physical keys but rather digital keys. Let's explore some basic cryptography to understand why digital key exchange sucks just as much as real life key exchange.
Julius Caesar used a cipher to send messages that no one else could read other than the intended recipient. Mainly because no one could read back in 100 BC, and those that could wouldn't understand a random string of letters. That's the whole point of cryptography. To create ways to communicate without third parties understanding the message. This cipher is Caesar's Cipher. Given an alphabet and a key (the key is an integer between 1 and 25), shift all of the alphabet letters by key.
With a shift of 3, as seen in the image above, A becomes D, B becomes E and so on until it wraps around with X = A. The original message is called the plaintext and the encrypted message is called the ciphertext.
The easiest way to perform Caesar's Cipher is to turn all of the letters into numbers, a = 1, b = 2, c = 3 and so on.
To encrypt, E, you calculate this for every letter (where s is the shift):
$$ E_{s}(letter) = (letter + shift)$$
To decrypt Caesar's cipher, D, you calculate this for every letter:
$$ D_{s}(letter) = (letter - shift)$$
Something important to note is that this version of the cipher doesn't support wraparound (for brevity).
As you can tell, it's not very secure. With 25 total shifts you just have to shift the text 25 times until you find the decrypted code, this is called a brute force attack. You take the encrypted text and shift it all 25 times until you find the decrypted text. But let's imagine for a second that this was a hard cipher - that brute force isn't feasible.
The shift is the key of Caesar's cipher. But the problem still persists, how do you tell your friend you're using a shift of 9? Any and all forms of communication can be listened in on. It doesn't matter if you're writing a letter or going to a hidden forest in Switzerland 30 miles from the nearest town. If you communicate the key, it can be listened in on.
How do you tell your friend you're using a shift of 9, for example? You have to communicate it to them somehow. Any and all forms of communication can be listened in on - whether that's writing a letter or going to a hidden forest in Switzerland 30 miles from the nearest town and telling your friend.
The problem becomes even more apparent when you realise that communicating parties over the internet normally have no prior knowledge about each other and are thousands of miles apart. This is where the magic of Diffie-Hellman-Merkle key exchange comes in.
Diffie-Hellman is a way to securely exchange keys in public. It was conceptualised by Ralph Merkle, and named After Whitfield Diffie and Martin Hellman. I have chosen to include Merkle's name as the title because he put in just as much work as Diffie-Hellman and his name never appears when this algorithm is talked about.
U.S. Patent 4,200,770, from 1977, is now expired and describes the now-public-domain algorithm. It credits Hellman, Diffie, and Merkle as inventors.
Let's go through how this algorithm works.
For this algorithm, we will also walk through the colour mixing method for explaining how it works.
Alice and Bob publicly agree to use a modulus p = 23 and g = 5 (which is a primitive root modulo 23, explained later). Modulus is just the remainder of the division. Note: this example comes from Wikipedia.
It's hard to describe the paint method in text, so if you want to know about this method I suggest watching this video:
We'll colour G yellow. We have 2 copies of G (yellow) as seen above.
When Alice and Bob agree on these numbers, Eve knows they are using these numbers.
2. Alice needs to calculate a private key.
She does this by picking a secret number (a). She computes $G^a \; mod \; p$ and sends that result to Bob.
Alice chooses a secret, random integer a = 4.
Alice computes $A = 5^4 \; mod \; 23 = 4$ and sends the number 4 to Bob.
She colours this private key reddish-brown.
Eve doesn't know Alice's secret number is 4, only that the result of this equation is 4. It's not feasible for Eve to calculate what Alice's secret number is from the resultant of this equation.
3. Bob makes his own private key. Its colour is dark green.
He calculates this by picking a secret number (b) and computing $g^b \; mod \; p$. He then sends the result to Alice. Bob creates a random private key, for this example we'll use 3.
Then Bob calculates $B = 5^3 \; mod \; 23 = 10$ and sends 10 to Alice.
4. Now Bob takes the number Alice sent him and computes $b^a \; mod \; p$.
In the colour analogy, this is taking Alice's paint colour and adding it to Bob's paint colour.
Bob computes $s = 4^3 \; mod \; 23 = 18$
Bob doesn't send this to Alice.
5. Alice computes $A^b \; mod \; p$.
In the paint analogy, this is Alice adding Bob's paint (that Bob sent her) to her paint.
Alice calculates $s = 10^4 \; mod \; 23 = 18$
The magic is that Alice and Bob now have the same number, or the same paint colour.
Let's discuss in detail the mathematics behind this cool algorithm.
Diffie-Hellman-Merkle works because of a cool modulus exponent principle. First, let's explain what modulus is before we try to understand this principle.
Imagine a finite range of numbers, for example, 1 to 12. These numbers are arranged in a circle, much like a clock (modular arithmetic is sometimes called clock arithmetic because of this).
Count 13 around this clock. You get to 12 and then you need to count 1 more - so you go back to 1. Modular arithmetic is still defined as the remainder of division, however it can also be defined (and is more commonly defined) as a clock.
Functions using modular arithmetic tend to perform erratically, which in turn sometimes makes them one-way functions. Let's see this with an example by taking a regular function and seeing how it works when it becomes a modular arithmetic function.
$$3^x$$
When we insert 2 into this function, we get $3^2 = 6$. Insert 3 and we get $3^3 = 9$.
This function is easy to reverse. If we're given 9, we can tell that the function had an input of 3, because of $3^3 = 9$.
However, with modular arithmetic added, it doesn't behave sensibly.
Image we had this formula:
$$ 3^x \; mod \; 7 = 1$$
How would you find out what x is? You can't put the mod on the other side, because there isn't really an inverse of modular arithmetic. What about guessing? Let's input 5:
$$ 3^5 \; mod \; 7 = 5$$
Okay, that was too big. You might want to go lower, maybe 4 or 3 but actually this is the wrong direction. When x is 6, it is equal to 1.
In normal arithmetic, we can test numbers and get a feel for whether we are getting warmer or colder, but this isn't the case with modular arithmetic.
Often the easiest way to reverse modular arithmetic is to compile a table for all values of x until the right answer is found. Although this may work for smaller numbers, it is computationally infeasible to do for much larger numbers. This is often why modular arithmetic is known as a one-way function.
If I gave you a number such as 5787 and told you to find the function for it, it would be infeasible. In fact, if I gave you the ability to input any number into the function it would still be hard. It took me a mere few seconds to make this function, but it'll take you hours or maybe even days to work out what x is.
Diffie-Hellman-Merkle is a one-way function. While it is relatively easy to carry out this function, it is computationally infeasible to do the reverse of the function and find out what the keys are. Although, it is possible to reverse an RSA encryption if you know some numbers such as N.
The primitive root of a prime number, p, is a number, a, such that all numbers:
$$a \; mod \; p, a^2 mod p, a^3 \; mod \; p, a^4 \; mod \; p, ...$$
are different. There is a formula for counting what the indices are, but I think it's far more intuitive to think "the second one is to the power of 2, the third one is to the power of 3" and so on.
Let's see an example where $p = 7$. Let's set $a_1 = 2$ and $a_2 = 3$.
$$2^0 = 1 ( mod \; 7) = 1$$
$$2^1 = 2 ( mod \; 7) = 2$$
$$2^2 = 4 ( mod \; 7) = 4$$
$$2^3 = 8 ( mod \; 7) = 1$$
Uh oh! 2^{0} is the same as 2^{3}. This means that 2 is not a primitive root of 7. Let's try again with 3.
$$3^0 = 1 ( mod \; 7) = 1$$
$$3^1 = 3 (mod \; 7) = 3$$
$$3^2 = 9 (mod \; 7) = 2$$
$$3^3 = 27 (mod \; 7) = 6$$
$$3^4 = 81 ( mod \; 7) = 4$$
$$3^5 = 243 ( mod \; 7) = 5$$
$$3^6 = 1 (mod \; 7) = 1$$
Now let's try a = 3.
$$3^0 = 1$$
$$3^1 = 3$$
$$3^2 = 2$$
$$3^3 = 6$$
$$3^4 = 4$$
$$3^5 = 5$$
$$3^6 = 1$$
Now we've gotten a cycle in these powers.
3^{6} = 1, and 3^{0} = 1. This is because we are using modulus it repeats into this cycle, so we can stop now. Unlike before where we reached 2^{3} and it cycled, it's okay if it cycles here because for any prime number, p, and any number, a, such that $a \ne 0 \; mod \; p$ and $a \ne 1 \; mod \; p$ the consecutive powers of $a$ may cover no more than p - 1 values modulo p. That is, we go from $1, ..., p - 1$. When p is 7, the consecutive powers cover up to 6.
$$a^b = c \; mod \; n$$
Such an equation means some numbers you can write it differently as:
$$log_a c = b \; mod \; n$$
Logarithms are the inverse of exponents, we've just inversed the sum here.
Now it's a well defined function, we can say in discrete terms that $log_3 5 = 5 \; (mod \; 7)$ (looking at the table above).
if you use a non-primitive root number it becomes easier, as we have a smaller number of outcomes (because it repeats earlier), as seen below.
$$2^0 = 1 (mod \; 7) = 1$$
$$2^1 = 2 ( mod \; 7) = 2$$
$$2^2 = 4 ( mod \; 7) = 4$$
$$2^3 = 8 ( mod \; 7) = 1$$
By using a primitive root, we get a much larger outcome which makes it harder.
$$3^0 = 1 ( mod \; 7) = 1$$
$$3^1 = 3 (mod \; 7) = 3$$
$$3^2 = 9 ( mod \; 7) = 2$$
$$3^3 = 27 ( mod \; 7) = 6$$
$$3^4 = 81 ( mod \; 7) = 4$$
$$3^5 = 243 ( mod \; 7) = 5$$
$$3^7 = 1 ( mod \; 7) = 1$$
It is relatively easy to calculate exponentials modulo a prime, that is a, I, p calculate $a^i mod p$.
Exponentiation is a cheap operation. you can do it even for very large numbers while logarithm is a much more difficult function to calculate for large numbers.
To calculate exponentiation, you give number 2 and you respond to me what the answer is. that's exponentiation, going from left to right.
$$3^0 = 1$$
$$3^1 = 3$$
$$3^2 = 2$$
$$3^3 = 6$$
$$3^4 = 4$$
$$3^5 = 5$$
Logarithm is how to go back, from right to left. Logarithms are much harder than exponentiation.
Let's go back to seeing how Diffie-Hellman worked, but this time with a lot more knowledge of how the mathematics works.
We have 2 people, Alice and Bob. Each of them has to agree in advance some prime number q (publicly known number) and its primitive root $\alpha$(publicly known).
We randomly select the prime number 7 with the primitive root 3.
Let Alice and Bob wish to exchange a key, then they do the following:
$$3^0 = 2$$
$$3^1 = 3$$
$$3^2 = 2$$
$$3^3 = 6$$
$$3^4 = 4$$
$$3^5 = 5$$
and they choose one of the exponents, chosen randomly and kept in a secret. Now Alice does $y_a = a^{x a} mod q$ and sends it to Bob.
This example isn't very impressive, and sometimes $3^5 = 5$ but for much larger numbers most things change everything, this is almost RSA encryption (the idea is the same, but it's not quite the same as this is key exchange, not encryption)
Bob then does the same as Alice. Both Alice and Bob are now capable of calculating the shared key.
Alice calculates $k = (y_b)^{x_a} \; mod \; q$
Bob calculates $k = (y_a)^{x_b} \; mod \; q$
Now they have the same numbers, k is the common secret key.
$$(\alpha ^ {x_b})^{x_a} = (\alpha ^ {x_a})^{x_b}$$
This equation above is what makes it all work. The formulae are the same, but different. You get Alice to do one side of the formula, Bob does another side and they end up with the same number.
This really is the equation that puts it all together. Most of this blog post led up to this equation.
a and b are secret, and without these numbers, there is no easy way to repeat these computations because in to do it you need to know the secrets.
The above formula shows that the two methods are exactly equal. If you do the left equation, you get the same result as the right equation.
Diffie-Hellman-Merkle is a fascinating way of sharing a secret over an unsecured communications medium, by not sharing it at all over that medium.
If you want to learn about the different attacks that have been attempted on Diffie-Hellman and one attack which might be a very big problem, sign up to my email list below and I'll send you the PDF 😁✨
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these articles in my spare time. This blog is my full time job, so any and all donations are appreciated
]]>Public key cryptography seems magical to everyone, even those who understand it. In this post, I'm going to explain public key cryptography. Public Key Cryptography is based on asymmetric cryptography, so first let us talk about symmetric cryptography.
Your front door is usually locked by a key. This key unlocks & locks your front door. With symmetric cryptography, you have one key which you use to unlock and lock things.
Only people with the key or a copy of the key can unlock the door. Now, imagine you're on holiday in Bali. You want to invite your friend around to look after your cat 😺 while you're on the beautiful beaches 🏖️.
Before the holiday, you give your friend the key to your door. Your friend is then robbed, so someone else has your front door key now. Or your friend leaves it laying around and someone clones it. The problem with symmetric key cryptography is that this one key is easy to clone, it's easy to attack your house in many different ways.
Let's take this from an analogy to a real-life example of symmetric cryptography.
Julius Caeser used a cipher to send messages that no one else could read other than the intended recipient. Mainly because no one could read back in 100 BC, and those that could wouldn't understand a random string of letters. That's the whole point of cryptography. To create ways to communicate without third parties listening in. This cipher is Caeser's Cipher. Given an alphabet and a key (the key is an integer between 1 and 25), shift all of the alphabet letters by key.
With a shift of 3, as seen in the image above, A becomes D, B becomes E and so on until it wraps around with X = A. The original message is called the plaintext and the encrypted message is called the ciphertext.
The easiest way to perform Caesar's Cipher is to turn all of the letters into numbers, a = 1, b = 2, c = 3 and so on.
To encrypt, E, you calculate this for every letter (where s is the shift):
$$ E_{s}(letter) = (letter + shift) mod \; 26$$
This is called a function. You put an input into it, and an output comes out. A lot of functions are known as two-way functions. You can encrypt by using the function above, and it makes sense that to decrypt you just do the opposite. Given a function that doubles a number, if you have a doubled number and you want to reverse the function do the opposite of multiplying by 2, divide the number by 2.
mod is the modulus operator. It's the remainder of dividing. We do modulous because there isn't a 27th letter in the alphabet, you just wrap around from "z" back to "a". We'll talk more about modular arithmeticlater on in this article. Look at this small example below:
$$ 4 \; mod \; 3 = 1 $$
Because 4 divided by 3 has a remainder of 1.
To decrypt Caesar's cipher, D, you calculate this for every letter:
$$ D_{s}(letter) = (letter - shift) \; mod \; 26$$
As you can tell, it's not very secure. With 25 total shifts you just have to shift the text 25 times until you find the decrypted code, this is called a brute force attack. You take the encrypted text and shift it all 25 times until you find the decrypted text. But let's imagine for a second that this was a hard cipher - that brute force isn't feasible.
How do you tell your friend you're using a shift of 9, for example? You have to communicate it to them somehow. Any and all forms of communication can be listened in on - whether that's writing a letter or going to a hidden forest in Switzerland 30 miles from the nearest town and telling your friend.
The latter isn't very feasible, but it is a lot more secure than telling your friend in Times Square, New York 🗽 what the shift is.
Now, imagine you brought your lunch to work in a special lunchbox - the same you've had since nursery school. Someone steals your food and your lunchbox. You don't mind losing the food, but you do want the lunchbox back. You want a way for them to securely return your lunchbox without you knowing who took it - because that takes the pressure off of them.
You place a box in the staff room with a lock & key. You give copies of keys to everyone in the office and hope for the best - that someone will return the lunchbox by placing it in the box.
Unfortunately, the keys everyone has also unlocks the box as well as locks it. This means that someone could unlock the box and re-steal your lunchbox.
We need to find a way to get rid of this idea of sharing keys, get rid of the idea of 'any key can lock and unlock', and this is where asymmetric cryptography comes in.
You install an extraordinary lock on this box, one that has two separate keys. The first key 🔑 can only turn clockwise, from A (locked) to B (unlocked) to C (locked).
The second key 🗝️ can only turn anti-clockwise, from C to B to A. You pick the first key and keep it to yourself. This is called a private key. The second key is called the public key. This key is given out to everyone in the office. You want everyone to have this key.
When someone returns your prized lunchbox, they can leave it in this box. All the public keys can do is lock the box. Your private key is the only one that can open it.
This is public key cryptography. Everyone knows that if they put something in the box and lock it, only you can open it with your private key.
With symmetric cryptography, everyone could open your box if they had the key. Now, no one apart from you can open the box.
Public key cryptography was first formulated by Whitfield-Diffie or James Ellis (Ellis discovered first, but he didn't publish it. Whitfield-Diffie published first). Both Ellis and Whitfield-Diffie enjoyed that public key cryptography could work in theory, but never managed to figure out how it would work in practice.
The production of a working Public Key Encryption system is attributed to Rivest–Shamir–Adleman (RSA) or Clifford Cocks. Like above, Cocks discovered first, but he didn't publish it. Rivest–Shamir–Adleman published first.
Let's look at how this works mathematically.
While the box analogy was something physical, we're going to go back to encrypting messages much like we did with Caeser Cipher.
$$ K^-_b(K^+_b(m)) = m$$
When you apply the public key ($K^+$) to the encrypted message, and then the private key ($K^-$) to the encrypted message you get the plaintext message. We are also looking for these attributes:
It is computationally easy to:
But it is also computationally infeasible to:
We want to turn a message into numbers. Previously we assigned a number to each letter, A = 1 and so on. The American Standard Code for Information Interchange (ASCII) is a table of all English letters and most symbols along with their associated ASCII code & Binary output.
When you press a key on the keyboard, the keyboard converts this to Ascii as numbers are easier to work with than letters for a computer. If you want to learn more about ASCII, check out this video.
Let's encrypt the word "cats". In binary, according to Ascii, this is:
$$c = 01100011$$
$$a = 01100001$$
$$t=01110100$$
$$s = 01110011$$
If you add them all together and convert to base 10, you get 4430123. For our example, we're going to look at how Rivest–Shamir–Adleman (RSA), a public key cipher, calculates public & private keys. We're also going to use much smaller numbers, so the maths isn't as hard to read.
Below is a calculator I created for turning ASCII into Binary.
# https://stackoverflow.com/a/40949538
def string2bits(s=''):
return [bin(ord(x))[2:].zfill(8) for x in s]
# edit the text below to see how this works
text = 'cats'
bits = string2bits(text)
for x in bits:
print(x)
Prime numbers are numbers that only have 2 factors, 1 and itself. We're going to pick 5 & 7, not large prime numbers but small for brevity.
2. Compute n = pq, z = (p-1)(q-1)
$$n = 5 * 7 = 35$$
$$z = (5 - 1)(7-1) = 4 * 6 = 24$$
3. Choose e (with e < z) such that e has no common factors with z.
$$ e = 5$$
5 has no common factors with 24, and it is smaller than 24.
4. Choose d such that $ed - 1$ is exactly divisible by z.
The easiest way to do this would be to loop over all possible values of d in code. This code is written in Functional Python, but the language and paradigm doesn't matter.
Since we're using such small numbers, we have overlap. Both e and d are 5. Let's set d to 29, just so we don't have this overlap.
$$ d = 29 $$
5. The public key is (n, e). The private key is (n, d)
$$key_{public} = (35, 5)$$
$$key_{private} = (35, 29)$$
Below is code to generate RSA keys. Note that we have overlap on d with p = 5 and q = 7, as discussed above.
def rsa(p, q):
n = p * q
z = (p - 1) * (q - 1)
# calculate e such that e is less than z
# and e has no common factors with z
for i in range(1, z - 1):
if z % i != 0:
e = i
break
d = (filter(lambda x: ((x * 5) - 1) % 24 == 0, range(1, 200)))[0]
return{"Public key": [n, d], "Private Key": [n, e]}
# change p and q here to any prime numbers of your choice
p = 5
q = 7
print(rsa(p, q))
To send an encrypted message, Bob computes $C = m^e \; mod \; n$ for message m and key e. To decrypt the message, Alice computes $m = c^d \; mod \; n$.
Encrypting "cats" gives us $427^5 \; mod \; 35 = 7$.
Decrypting "cats" gives us 4430123.
Then to send a message m, Bob computes $c=m^e \; (mod \; N)$ and sends it to Alice and Alice decrypts the message using her private key d with $m=c^d \; (mod \; N)$
Prime factorisation. It's easy to multiply two prime numbers together, but it's incredibly hard to find out what prime numbers were used to make that number. You can easily multiply these two together:
$$ 23,719 * 41,843 = 992,474,117$$
But if I gave you 992,474,117 and told you to find the prime numbers that were used to make this number, it's not computationally feasible. Even more so when you realise the prime numbers used are very, very large.
This is known as a trap-door function or a one-way function. While it is easy to go through one way, it is computionally infeasiable to go the other way. Boiling an egg is a one-way function because it is easy to boil an egg, but it is not possible to un-boil an egg. Let's go deeper into the mathematics and explore modular arithmetic.
Imagine a finite range of numbers, for example, 1 to 12. These numbers are arranged in a circle, much like a clock (modular arithmetic is sometimes called clock arithmetic because of this)
Count 13 around this clock. You get to 12 and then you need to count 1 more - so you go back to 1. Modular arithmetic is still defined as the remainder of division, however it can also be defined (and is more commonly defined) as a clock.
Functions using modular arithmetic tend to perform erratically, which in turn sometimes makes them one-way functions. Let's see this with an example by taking a regular function and seeing how it works when it becomes a modular arithmetic function.
$$3^x$$
When we insert 2 into this function, we get $3^2 = 6$. Insert 3 and we get $3^3 = 9$
This function is easy to reverse. If we're given 9, we can tell that the function had an input of 3, because of $3^3 = 9$.
However, with modular arithmetic added, it doesn't behave sensibly.
Image we had this formula:
$$ 3^x mod \; 7 = 1$$
How would you find out what x is? You can't put the mod on the other side, because there isn't really an inverse of modular arithmetic. What about guessing? Let's input 5:
$$ 3^5 mod \; 7 = 5$$
Okay, that was too big. You might want to go lower, maybe 4 or 3 but actually this is the wrong direction. When x is 6, it is equal to 1.
In normal arithmetic, we can test numbers and get a feel for whether we are getting warmer or colder, but this isn't the case with modular arithmetic.
Often the easiest way to reverse modular arithmetic is to compile a table for all values of x until the right answer is found. Although this may work for smaller numbers, it is computationally infeasible to do for much larger numbers. This is often why modular arithmetic is known as a one-way function.
If I gave you a number such as 5787 and told you to find the function for it, it would be infeasible. In fact, if I gave you the ability to input any number into the function it would still be hard. It took me a mere few seconds to make this function, but it'll take you hours or maybe even days to work out what x is.
RSA is a one-way function. While it is relatively easy to carry out this function, it is computationally infeasible to do the reverse of the function and find out what the keys are. Although, it is possible to reverse an RSA encryption if you know some numbers such as N.
Recall earlier where I detailed how the RSA algorithm worked. There was one number, $n$. This n is special because under some circumstances n can make this one-way function reversible
N is a product of 2 prime numbers. As we saw earlier, if we take $5$ and $7$ and multiply them together, we get:
$$n = 35$$
In order for Bob to send Alice a message, he encrypts the message using Alice's public key. Now that the message is encrypted, there has to be some way for Alice to decrypt it. There has to be some way for Alice to reverse this, but only for Alice to reverse it. You can't have Eve or Niamh or Hannah reversing it - because that beats the point of encrypting it.
RSA is designed so the person who knows P and Q (the two prime numbers that are multiplied together to give N) can decrypt the message.
Although Alice has told the world her public key is $n = 35$, no one apart from Alice knows that $P = 7, Q = 5$. Note that the prime numbers are intentionally small for brevity.
You may be thinking "it's easy to guess that 35's prime factors are 5 and 7" and you would be right. But, with large enough numbers it is virtually impossible to find $p$ and $q$.
In fact, with large enough numbers multiplying p and q are essentially one way functions. It is possible that in the future, perhaps the near future (with the invention of quantum computers) that factoring numbers becomes easy. Mathematicians have tried and failed for thousands of years to find an efficient way to factor numbers, so for now it is considered secure.
Okay, let's look at how modulus works in all of this. You understand why multiplication works, and how modulus works. But what about the other equations? What are they for?
Let's demonstrate the deciphering algorithm using an identity due to Euler and Fermate:
For any integer (M), M is relatively prime to n:
$$M^{\phi(n)} \equiv 1 (mod n)$$
Here, $\phi (n)$ is the Euler totient function giving the number of positive integers less than n which are relatively prime to n. Relatively prime is where 2 numbers only share the factor $1$ with each other. In modern day we use Carmichael's function over Euler's function, as Euler's function can sometimes produce numbers too large to use. However, we're using Euler's totient function as it is what the original RSA paper used.
This sounds confusing, but let's break it down. By elementary properties of the totient function:
$$\phi (n) = \phi (p) * \phi (q)$$
$$ = (p - 1) * (q - 1)$$
$$ = n - (p + q) + 1$$
Since d is relatively prime to $\phi i (n)$, it has a multiplicative inverse $e$ in the ring of integers modulo $\phi (n)$. What this means is that the formula we used for RSA can be reversed (the trap door can be reversed) given some knowledge about the numbers used.
Without this special mathematical property it wouldn't be possible to reverse the encryption and find out the ciphertext if you know some of the numbers used.
The modular multiplicative inverse of the encryption algorithm $c = m^e \; mod \; n$ is $m = c^d \; mod \; n$. All of this maths has built up to this. Modular arithmetic and one-way functions are heavily involved here. In order to encrypt, you calculate c. In order to decrypt, you calculate m. Both of these require knowledge of $n$, which is the special number we talked about earlier.
If you want to learn more about the maths of RSA, I highly reccomend the readable, origianl RSA paper.
How do you prove that a message sent by Bob was actually sent by Bob, and not sent by Eve? You need a way to authenticate them. In the real world, we authenticate using signatures. Although these can be forged, you can authenticate using a biometric scanner, but your fingerprints can be lifted and copied.
You can use a passcode, but again much like how Caeser cipher and its single key is useless, authentication methods that use single keys aren't as perfect.
You can use a passcode, but again much like how Caeser's cipher and its single key is useless, authentication methods that use single keys aren't as perfect.
$$ K^-_b(K^+_b(m)) = m = K^+_b(K^-_b(m)) $$
Let's say Bob wants to prove to Alice that Bob wrote the message he sent her. Bob sends his original message with an encrypted version of the message with his private key ($k^-$). Alice uses Bob's public key($k^+$) which, using the formula above, turns the encrypted message back into the normal message. Then Alice checks the message Bob sent with the message she got from the encrypted message. If they match, she can be sure that someone with Bob's private key (probably Bob) sent it.
This method sucks for encrypting because if Bob encrypts his message with his private key, anyone can read it with his private key. Also, it's computationally expensive to prove that Bob sent something. This is why we create a digest of the message and encrypt that instead to verify Bob. This digest of a message is done using a hash function.
To learn more about hash functions, I wrote a sister article which explains them here.
By encrypting the hash of the message we speed up the process of encrypting it, which makes authentication a lot faster. Now, let's play a prank on Bob.
We create an e-mail order to a pizza shop asking for 4 pepperoni pizzas. We sign this email with our private key. We send the pizza store our public key, but we tell them that Bob's phone is dead and that our public key is actually Bob's public key.
The pizza store verifies the signature and sends 4 pepperoni pizzas 🍕 to Bob. The worst part is, Bob doesn't even like pepperoni. This is where a certification authority comes into play.
Certificate authorities (CA) bind a public key to a specific entity. This entity provides proof of identity to the CA, the CA then creates a certificate binding the entity to its public key. The idea is to take the trust out of trusting an individual for public keys. You still have to trust an organisation, but many people find trusting an organisation is better than trusting an individual.
The certificate containing the entities public key is digitally signed by the CA. This signing is the CA saying "this is the entities public key".
When Alice want's Bob's public key, she gets Bob's certificate. She then applies the CA's public key to Bob's certificate to get Bob's public key.
Cloudflare has an amazing article on certificate authorities here.
Phil Zimmerman invented Pretty Good Privacy (PGP), the de facto standard for email encryption. Zimmerman used RSA in PGP. RSA is patented and he did not have permission from RSA inc (the company that holds the patent) to publish another cipher using RSA.
Zimmerman was also a target of a 3-year U.S federal investigation because at the time cryptography programs were considered munitions under U.S law. When asked whether all of the trouble was worth it to publish PGP, he said he had "no regrets". Let's look at how this used to be illegal algorithm works.
When Alice wants to send a confidental email to Bob, she:
In total, Alice uses three keys. Her private key, Bob's public key, and the newly created symmetric key.
This idea of encrypting a symmetric key with a public key is called a Hybrid Cryptosystem. Some email messages can be incredibly large, encrypting these with a public key system would take a very long time.
Use a symmetric key system such as AES, which is incredibly hard to break (but not as hard as RSA). Encrypt the AES key (and only the key, not the whole email) with the public key. This way, the receiver can apply their private key and find out the AES symmetric key to decrypt the email.
Not many people use PGP, because of how difficult it is to set up. At most, you need to download a program you trust to correctly implement PGP. In 2018 it was shown that email clients such as Apple Mail, Thunderbird, and Outlook - who have settings to enable PGP can be forced to show the non-encrypted versions.
Not to mention how suspicious it looks for one person to send encrypted emails on a network of non-encrypted emails. The only email client (and address provider) which enables PGP by default is ProtonMail, but even then it's only for Proton-to-Proton emails and you have to trust the company to implement it correctly.
Most of them do a good job, but we understand your point. We built ProtonMail to make PGP encryption accessible to non-technical people. We will make sure this goal is reached 100%. ;) Thanks again!
— ProtonMail (@ProtonMail) January 21, 2018
Cryptography has been used for thousands of years, almost as long as mankind has held secrets. In our constant effort to keep our secrets secret to everyone apart from a select few we've found this magical algorithm that works pretty well. No doubt, in 300 or 400 years it will have been broken much like how Caeser thought his cipher would never be broken.
Hey 👋 Want to subscribe to my blog and stay up to date with posts similar to this one? Subscribe to my email list below. I won't spam you. I will only send you posts similar to this one 😊✨
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these articles in my spare time. This blog is my full time job, so any and all donations are appreciated
]]>A hash function takes a message, m, and returns a pseudo-random string of letters/numbers which should be unique to that message. Let's say the hash function returns "aBc67D" for the message "I love dogs". This function should not return the same "aBc67D" for "Donuts are cool".
Hashing algorithms have 3 requirements:
For the first point, the algorithms have to be fast. Hashing algorithms are often used by much slower algorithms (such as RSA) to speed up the algorithm. As a side note, most hashing algorithms are designed to be one-way functions.
The algorithm has to be fast, but it does not have to be efficient. Efficient hashing algorithms make causing collisions easier for attacks. Hashing algorithms need to be resistant towards "pre-image attacks". That is, given a hash, it should be extremely difficult to calculate retrace the deterministic steps taken to reproduce the value that created the hash (i.e. the pre-image).
For the second point, if using SHA-5 and we input:
abc = a9993e364706816aba3e25717850c26c9cd0d89d
An important feature we want is that if we were to change the original string even slightly by say one letter the entire outputted hash has to change. If we input "abd" we get:
abd = cb4cc28df0fdbe0ecf9d9662e294b118092a5735
So the idea of randomness here is that the first hash does not look anything like the second hash despite us only changing 1 letter. In fact, with some code:
We find out that a and b are 0% similar (using the Hamming Distance). This is called the Avalanche effect. When the input changes slightly, the output changes significantly.
For the third point, this is the donut/dogs example above. Sometimes, it's okay to not avoid collisions because there are billions of different messages and a fixed length hash message. But, by not avoiding collisions other people can artificially make a message with the same hash as another file and this is an issue - because it's a fake message.
Let's walk through how SHA-1, an old hashing algorithm, works in detail. Although all hashing algorithms work differently and SHA-1 isn't used in the real world anymore, seeing how SHA-1 works in detail will enable us to generalise how hashing algorithms work.
SHA-1 takes a bit string of length less than 2^{64} bits and produces a 160-bit hash value known as the message digest. Note: I will be using hexadecimal numbering for brevity. Hashing algorithms never take a message of size X, and return a message of size X. They always 'compress' and create a digest of the message. Remember rule 1 for hashing algorithms from earlier? We want hashing algorithms to be fast. Producing large messages is not fast.
SHA-1 was published in 1993 by NIST, but was developed by the NSA (yes, that NSA). Not much is known about the history of SHA-1, so I apologise for not including history here.
Suppose we want to encode the message 'abc' using SHA-1. We start off by converting it into binary:
$$ abc = 01100001 01100010 01100011$$
SHA-1 starts with an internal state. Let's say our internal state of SHA-1 is:
$$ h_0, h_1, h_2, h_3, h_4$$
The internal state size is exactly the same as the length it produces (160). So each internal state H has 160/5 = 32-bit words which is 4 bytes each. We split it into chunks because it's easier to calculate. We start by initializing these internal states with five random strings of hex characters:
$$H_0 = 67DE2A01$$
$$H_1 = BB03E28C$$
$$H_2 = 011EF1DC$$
$$H_3 = 9293E9E2$$
$$H_4 = CDEF23A9$$
The message is then padded by appending a 1, followed by enough 0s until the message is 448 bits. The length of the message represented by 64 bits is then added to the end, producing a message that is 512 bits long.
The padded input obtained above, $M$, is then divided into 512-bit chunks and each chunk is further divided into sixteen 32-bit words, $W_0, ..., W_15$. In the case of 'abc' there is only one chunk, as the message is less than 512-bits total.
For each chunk, begin the 80 iterations, $i$, necessary for hashing (80 is the determined number for SHA-1), and execute the following steps on each chunk, $M_n$:
For iterations 16 through to 79, where $16 \le i \le 79$, perform the following operation:
$$W(i) = S^1(W(i-3) \oplus W(i-8) \oplus W(i-14) \oplus W(i-16))$$
The symbol $\oplus $ is XOR which is exclusive or. The resultant is True (1) if and only if either the left hand side or right hand side is 1, but not both.
As an example, when $i$ is 16, the words chosen are W(13), W(8), W(2), W(0) and the output is a new word, W(16), so performing the XOR operation on these words gives us:
Now, the circular shift operation on the word by bits, being an integer between and, is defined by
$$S^n(X) = (X << n) OR (X>> 32 - n)$$
where $X << n$ is the left-shift operation, obtained by discarding the leftmost $n$ bits of $X$ and padding the result with zeroes on the right. If you don't know what the logic symbols mean, I've written an article explaining them here.
$X >> 32 - n$ is the right-shift operation obtained by discarding the $n$ rightmost bits of $x$ and padding the result with $n$ zeroes on the left. Thus $S^n(X)$ is equivalent to a circular shift of by positions, and in this case, the circular left-shift is used.
A left shift $S^n(W(i))$, where $W(i)$ is $10010$ would produce $01001$ as the rightmost bit, 0, is shifted to the left side of the string. Therefore $W(16)$ would end up being:
$$ W(16) = 11000010 11000100 11000111 00000000$$
Now store the hash values defined in step 1 in the following variables:
$$A = H_0$$
$$B = H_1$$
$$C = H_2$$
$$D = H_3$$
$$E = H_4$$
For 80 iterations, where $0 \le i \geq 79$ compute:
$$TEMP = S^5 * (A) + f(i; B, C, D) + E + W(i) + K(i)$$
We have to use a which takes in our data and a bit of the message and turns it into another set of h values.
A sequence of functions are used in SHA-1 depending on the value of $i$ and on three 32-bit words, B, C, and D, in order to produce a 32-bit output.
The following equations describe these logical functions.
$f(i; B, C, D) = (B \land C) \vee ((\not B) \land D)$ for $0 \geq i \geq 19$
$ f(i; B, C, D) = B \oplus C \oplus D$ for $20 \geq i \geq 39$
$ f(i; B, C, D) = (B \land C) \vee (B \land D) \vee (C \land D)$ for $40 \geq i \geq 59$
$f(i; B, C, D) = B \oplus C \oplus D$ for $60 \geq i \geq 79$
A sequence of constant words $K(0), K(1), ... , K(79)$ is used in the SHA-1. In hex, these are given:
$K(i) = 5A827999$ for $0 <= i <= 19)$
$K(i) = 6ED9EBA1$ for $20 <= i <= 39$
$K(i) = 8F1BBCDC$ for $40 <= i <= 59$
$K(i) = CA62C1D6$ for $60 <= i <= 79$
Reassign the following values:
$$ E = D$$
$$D = C$$
$$ C = S^30 (B)$$
$$B = A$$
$$ A = TEMP$$
Store the result of the chunk's hash to the overall hash value of all chunks as shown below, and proceed to execute the next chunk:
$$H_0 = H_0 + A$$
$$H_1 = H_1 + B$$
$$H_2 = H_2 + C$$
$$H_3 = H_3 + D$$
$$H_4 = H_4 + E$$
As a final step, when all chunks have been processed, the message digest is represented as the 160-bit string comprised of the OR logical operator and the 5 hashed values:
$$H = S^128 (H_0) \vee S^96 (H_2) \vee S^32(H_3) \vee H_4$$
SHA-1 was broken quite a few years ago. By broken, I mean that it's possible to artificially create collisions. That is, given a document with an SHA-1 hash of "cb4cc28df0fdbe0ecf9d9662e294b118092a5735" it is possible to produce a different document that has the same hash.
This article used SHA-1 because it's relatively easy to explain when compared to non-broken hash functions such as SHA-5. If you want to learn more about why you shouldn't use SHA-1, read this article.
As a side note, do not try to implement SHA-1 yourself. Remember Scheiner's rule:
"Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break."
While your implementation SHA-1 may appear to work to you, it may have some underlying issue that you didn't spot. It's always best to use implementations created and monitored by the community (open source).
MD5 was a pretty popular hashing algorithm which produced 128-bit outputs. It is suspectible to a birthday attack.
Birthday attacks are formulated on the birthday problem. If there are 23 people in a room, there is a 50% chance two of them will share the same birthday. If there are 70 people in a room, this is a 99.9% chance 2 people share the same birthday.
This comes from what is called the pigeonhole principle. If you have 9 pigeonholes (boxes to put pigeons in) and 10 pigeons, 2 pigeons will have to share the same hole.
In fact, MD5 is so weak to collision resistance that a simple, household 2.4GHz Pentium Processor can compute artificial hash collisions within seconds. Moreover, its extensive usage in the earlier days of the current web has created tons of leaked MD5 pre-images online that can be found with a simple Google search of their hash.
The 3 rules to hashing algorithms are:
As we saw with SHA-1, it uses a compressor function to take a message and turn that message into a much smaller, hashed version. SHA-1 isn't used in the real world anymore, and it's not secure in the sense that collisions can be created. Although it's a nice historical example of how hashing algorithms work.
For more on SHA-1, check out the official document describing it here. If you're interested in what hash functions are used now (especially in cryptocurrencies, as well as the attacks on them and the differences this article is very good.
Hey 👋 Want to subscribe to my blog and stay up to date with posts similar to this one? Subscribe to my email list below. I won't spam you. I will only send you posts similar to this one 😊✨
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these blogs in their spare time. This blog is my full time job, so any and all donations are appreciated!
]]>5/5 ⭐
Buy the book here (this is an affiliate link. I'll get some of the money you pay for the book at no extra cost to you.)
Atomic habits builds on the sensational work of Charles Duhigg's The Power of Habit by applying more new & groundbreaking research as well as the work of many others into this short textbook on habits. Note: all words past this paragraph are taken directly from the book. This is more of a summary than a review.
Fundamentally, a habit is something that you repeatedly do. Habits are not bad or good. All habits are created to make you happy, to get that dopamine hit. Because of this, habits can only provide good rewards.
You run every day because this habit provides you with good health & dopamine. You binge eat every night because this habit provides you with dopamine. Habits aren't evil, or good. They're neutral.
Atomic. Small or minute. Atomic habits. Small habits. These small habits snowball into something much, much larger.
Imagine you are an aeroplane. You're flying from Los Angeles to New York. You shift your heading just 3.5 degrees south when you take off from LAX. You land in Washington D.C instead. A small change is barely noticeable at takeoff, but when marginalised across the entire United States you end up hundreds of miles off.
It doesn't matter how successful you are right now. What matters is whether your habits are putting you on the right trajectory. You should be far more concerned with the trajectory than your current results.
Successful people and non-successful people set the same goals. So, why do successful people achieve them? Their systems. Their habits.
You don't rise to the level of your goals. You fall to the levels of your habits.
Goals restrict your happiness. Once you achieve them, that's it. What next?
You are what you identify as. Imagine 2 people resisting a cigarette. When offered one, the first person says "No thanks, I'm trying to quit". This person still believes they are a smoker who is trying to be someone else. They are hoping their behaviour will change while carrying around the same beliefs. The second person declines, saying "No thanks, I'm not a smoker." Small change, but this signifies a shift in identity.
The ultimate form of intrinsic motivation is when a habit becomes part of your identity. It's one thing to say I'm the type of person who wants this, it's another to say I'm the type of person who is this.
The goal isn't to run. It's to become a runner. The goal isn't to read books. It's to become a reader.
When you've believed a story for so long, it is easy to slide into these mental grooves and accept them as fact. In time, you begin to resist certain actions because "that's not who I am".
Your habits embody your personality. You are your habits. Every action you take is a vote for the person you wish to become. Every fruit or veg you eat is a vote for a healthier you.
If you keep casting the same votes you've always cast, you'll never change.
Decide the type of person you want to be. Prove it to yourself with small wins
The process of building a habit can be divided into four simple steps:
If the behaviour is insufficient in any of these 4 steps, it will not become a habit.
We can create these rules to create habits:
In order to break a habit, invert these rules:
Until you make the unconscious conscious, it will direct your life and you will call it fate
Pointing and calling is a safety system designed to reduce mistakes. You point at something and say it out loud. The MTA subway system in New York City adopted a silent version of this and within 2 years of implementation, incidents of incorrectly berthed subways fell 57 percent.
Plans are what makes the world go round. In a study, participants were asked to exercise and split up into 3 groups. In the first group with no plans, 35% of people exercised. In the second group 38% of people who made plans but with no definitive date exercised 38% of the time. In the third group 98% of the People who made plans with dates & times exercised.
One of the best ways to build a new habit is to identify a behaviour you do each day and build on top of that. You brush your teeth every day? Immediately after brushing your teeth meditate for 1 minute.
This allows you to take advantage of natural momentum that comes from one behaviour leading into the next.
Effectively, you can stack habits on top of each other in this way. Habit stacking is powerful.
The easiest way to stop a habit is to cut it off at the source.
Can't seem to get any work done? Leave your phone in another room. If you're continually feeling like you're not enough, stop following social media. If you're wasting too much time watching Netflix, log out on your work computer. Make it as hard as possible to log back in.
The anticipation of a habit releases just as much dopamine as performing the habit does. Your brain has far more neural circuitry allocated for wanting rewards than for liking the reward.
We can use this fact to our advantage by using something called temptation building. You link an action you want to do with an action you need to do. As an example, you can watch Netflix (the thing you want to do) while riding a stationary bike (the thing you need to do).
We often pick up habits from the people around us. We copy the way our parents put on socks (left foot first or right foot first), the way our friends flirt, the way our coworkers get results. When your friends smoke, you're more likely to smoke.
The closer we are to someone, the more likely we are to copy their habits.
The chances of becoming obese increase 57% when we have a friend who is obese.
You need to join a culture where your desired behaviour is the normal behaviour and you already have something in common with the group.
Do you want a first class degree? Become friends with the people who get firsts effortlessly. Do you want to become fit? Become friends with people who work out all the time.
You can make hard habits seem more attractive if you learn to associate them with a positive experience. Sometimes, all you need is a slight mindset shift. For instance, we often talk about everything we have to do on a given day. Now, imagine changing just one word. You don't "have" to do it. You "get" to do it.
When starting to build a habit, don't worry about the quality. A university professor of photography decided to test out the idea of quality vs quantity.
He split his classroom into two. On the left side is the quantity group. They would be graded on the amount of work they produced.
On the right is the quality group, they would be graded on the excellence of their work.
To his surprise, those in the quantity group took better quality photos than those in the quality group.
One of the most common questions relating to habits is the question "how long does it take to build a habit?" when really the question ought to be: "how many times does it take to form a new habit?" That is, how many repetitions of the habit before the habit becomes a habit.
Imagine you're holding a garden hose that is bent in the middle. Some water can flow through, but not very much. If you want to increase the rate at which water passes through the hose, you have 2 options.
Trying to pump up your motivation with a hard habit is like trying to force more water through a bent hose.
If you look at the most habit-forming products (Facebook, Uber, Netflix) they all remove friction from your life. When building new habits it is important to make them as frictionless as possible.
If you need to do something such as go to the gym, try going to the gym for 5 minutes maximum. Do this until your habit forms where you only go for 5 minutes. After the habit has formed, you'll want to stay there longer than 5 minutes.
The key is to change the task so it requires more work to get out of the good habit than to perform the habit.
If you reverse this - the best way to break a bad habit is to make it impractical to do. Increase the friction until you don't even have the option to act.
Say, for instance, you want to lose weight. You can use smaller plates. Do you want to increase focus? Turn off notifications.
In modern society, many of the choices you make will not affect you immediately.
The brain's tendency to prioritise the present moment means you can't rely on good intentions. When you make a plan to lose weight, write a book or learn a language - you're making a plan for your future self.
However, when the moment of decision arrives, instant gratification usually wins.
People who are better at delaying gratification have higher university grades, lower levels of substance abuse, lower likelihood of obesity, better responses to stress and superior social skills.
At some point, success in nearly every field requires you to ignore an immediate reward in favour of a delayed reward.
The most effective form of motivation is progress.
You don't realise how valuable it is to show up on your bad (or busy) days. Lost days hurt you more than successful days help you. If you start with £100, a 50% increase will take you to £150. However, you only need a 33 percent loss to take you back to £100. In other words, avoiding a 33% loss is just as valuable as achieving a 50% gain. As Charlie Munger says:
The first rule of compounding: never interrupt it unnecessarily]]>
Note: these links are affiliate links. If you buy through these links I will get a small portion of the profit at no extra cost to you.
If you talk to other humans, if you interact with humans, you must read this book. No exceptions. All my life, people told me that making friends and influencing people is an artform - something you can't learn. This book showed me the hidden rules required to win friends.
Since reading this book, every aspect of my friendships / relationships have gone up tenfold. As a result, everything else in my life has greatly improved. If you go to a foreign country and you're only allowed to bring 1 thing with you, make it this book. That is how good this work is.
This book has sold over 16 million copies - with good reason. It is one of the best books out there, ever, period. I've read this book 13 times. That's how ridiculously good this book is.
This book is about a gold lender from the age of Babylon. Published in the early 1900s but based on a story from Babylon, it truly is an old book. This gold lender started off with nothing. Less than nothing, they were a slave. This gold lender learnt the secrets to becoming rich and he became The Richest Man in Babylon.
The king was so impressed with his rules he paid scribes to inscribe the rules of money into stone tablets and sent them far and wide. He wanted everyone to know these rules and to prosper, just like Babylon did.
If you want to be rich, this is the book to read. Not only will it teach you personal finance lessons that have worked for thousands of years but it'll teach it to you in stories - interesting stories of The Richest Man in Babylon.
3. I Will Teach You To Be Rich
A modern day version of the Richest Man in Babylon. If you want to be rich, read this book. This book covers all the same principles as The Richest Man in Babylon, but with modern day takes on credit cards, travel hacking, investing and more.
Personal finance isn't about personal finance. It's about so much more. Having good personal finance makes you happy. It makes others in your life happy. It gives you discipline. And discipline gives you freedom.
If you're overweight, you don't have discipline. Can you afford to eat that McDonalds? No. If you're fit, you can. This disciplinegives you more freedom. By learning about personal finance, you're creating more freedom for yourself.
I prefer The Richest Man in Babylon, so if you can - read both.
The Alchemist is a fictional fable with more life lessons in it than the average nonfiction book. It's been said that non-fiction cannot distill lessons as easily as fiction. You cannot learn something you can only experience, you can only experience it. By reading fiction and seeing others experience it - it brings you a step closer to learning those fundamental life lessons.
This book is a story about following your dreams.
5. How to Become a Straight-A Student
I failed my GCSEs 8 times. I dropped out of sixth form. Yet, in university, I have straight A's. Getting marks of 95% on final exams. How did I do it? I read this book twice a semester. Once at the start, once at the end. This book is what helped me get those good grades. Highly recommend.
The single most influential book in my life. I read this book when I was a child, it got me into encryption and codes. Because of this, I started to program when I was 10. Because of that, I went on to study computer science at university. Because of that, this blog exists.
7. Shoe Dog
One of the best biographies I've read. This is a memoir by the creator of Nike. From his insane business partner to his insane team, this has it all. His story is good. He doesn't hide the fact he comes from a well-off family, but at the same time it's clear as to why he's just like us.
One of the richest men in the world, but yet hardly anyone knows of his story. This is a tour de force through the business world, through startups, through discovering yourself and learning to love.
8. Meditations
Marcus Aurelius ruled the Roman empire. He was also a philosopher who thought about life a lot. His books will teach you to deal with pain, to deal with loss, to deal with life. Some of my favourite quotes are:
"Many lumps of incense on the same altar. One crumbles now, one later, but it makes no difference."
"You have power over your mind - not outside events. Realize this, and you will find strength."
"At dawn, when you have trouble getting out of bed, tell yourself: “I have to go to work — as a human being. What do I have to complain of, if I’m going to do what I was born for — the things I was brought into the world to do? Or is this what I was created for? To huddle under the blankets and stay warm?”
So you were born to feel “nice”? Instead of doing things and experiencing them? Don’t you see the plants, the birds, the ants and spiders and bees going about their individual tasks, putting the world in order, as best they can? And you’re not willing to do your job as a human being? Why aren’t you running to do what your nature demands?
You don’t love yourself enough. Or you’d love your nature too, and what it demands of you."
His original books can be found online for free. But be warned, they're not in an English tongue you can decipher unless you've studied it. I've linked the modern day translation above. It's more expensive, but it's more understandable in my opinion.
An ode to veganism or a desperate plea to increase the longevity of human lives? Either way, it's a great book. This book backs up the facts. Eating more plants makes you live longer. Not only that, it goes through all the plants which make you live longer.
It references the China Study, a study on everyone in China - all 1.2 billion people and found that meat causes cancer in the same way that smoking causes cancer.
Although this book doesn't beg you to turn vegan. It just encourages you to eat more plants. An overall great book and essential reading if you want to not die anytime near in the future.
10. Turing's Vision
In 1939 Alan Turing invented Computer Science. This book explains his phenominal world changing paper and gives background on it. If you study computer science, this is a must. It shows you the birth of computer science, the history of computer science and more.
Alan Turing is a mastermind. Inventing the computer. Inventing the Turing machine. Inventing artifical intelligence. Not many people know of his existance. This is a good book on that famous first paper.
On the rare occasion that I read fiction, I tend to stop halfway through because of how boring it is. This book is the exception. This book discusses all sorts. Love, quantum phyics, friendships, being nice to others and above all else - magic.
I really enjoyed this series and would highly reccomend it.
12. Atomic Habits
Losers and winners have the same goals. You don't rise to your goals. You fall to your systems.
Habits are what makes us. This book isn't just about making or breaking habits - it's about habits. What is a habit? How do they form?
Habits are such a vital part of life. It'd be stupid to not learn about them. Atomic - small. Habit - something you routinely do. By creating small habits, getting 1% better every day you'll improve every aspect of your life.
$$ 0.99^{365} = 0.03 $$
$$1.01^{365} = 37.8$$
1% better every day. That's all you need.
13. On Writing Well
The author of this book is an expert on writing - well. if you write things, this book is essential for you. A lot of the content of this book is common knowledge. But, common knowledge stated clearly is knowledge all the same.
Hey 👋 Want to subscribe to my blog and stay up to date with posts similar to this one? Subscribe to my email list below. I won't spam you. I will only send you posts similar to this one 😊✨
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these blogs in their spare time. This blog is my full time job, so any and all donations are appreciated!
]]>Decision trees, one of the simplest and yet most useful Machine Learning structures. Decision trees, as the name implies, are trees of decisions.
You have a question, usually a yes or no (binary; 2 options) question with two branches (yes and no) leading out of the tree. You can get more options than 2, but for this article, we’re only using 2 options.
Trees are weird in computer science. Instead of growing from a root upwards, they grow downwards. Think of it as an upside down tree.
The top-most item, in this example, “Am I hungry?” is called the root. It’s where everything starts from. Branches are what we call each line. A leaf is everything that isn’t the root or a branch.
Trees are important in machine learning as not only do they let us visualise an algorithm, but they are a type of machine learning. Take this algorithm as an example.
This algorithm predicts the probability that a passenger will survive on the Titanic.
“sibsp” is the number of spouses or siblings aboard the ship. The figures under each leaf show the probability of survival.
With machine learning trees, the bold text is a condition. It’s not data, it’s a question. The branches are still called branches. The leaves are “decisions”. The tree has decided whether someone would have survived or died.
This type of tree is a classification tree. I talk more about classification here. In short; we want to classify each person on the ship as more likely to die or to have survived.
In real life, decision trees aren’t always as easy. Take a look at this photo, and brace yourself. I’ll try to describe as much of it as I can.
This is a decision tree. It wants to answer the question “can I eat this mushroom?”
Given an image of a mushroom (like on the left) we want to find out if it’s edible.
You see those things at the top which look like variable assignments? Those are if statements. Let’s take a look at one of those.
“odor = a: e (400.0)”
If the smell (odor) of the mushroom is “a” for almond, then it is edible (e) and we are 400.0 points confident that it is edible. Each of these statements is a feature.
Features are just attributes of an object. The features of a bike are: it has wheels, it has handlebars etc.
We do this on and on until we reach a point where the odor is neutral (n) at which point we start to check more features of the mushroom.
Okay, We can draw them but how do we write decision trees? There’s a nice notation for that.
Let’s jump right into an example.
The fancy little “^” means “and”. It’s some fancy mathematical notation. For more notation like this, check out this other article I wrote. In this notation, when we don’t see anything connecting 2 items (like x2 and x5) we assume it is “and”. We want a decision tree that returns True when both x2 and x5 are true.
Okay, let’s see another one.
This one features a lot more logic symbols. You might want to check out this other article I wrote. Okay, the “∨” symbol means “or” and the “¬” means “not”.
Decision trees are made by taking data from the root node and splitting the data into parts.
Taking the Titanic example from earlier, we split the data so that it makes the most sense and is in alignment with the data we have.
One of the problems with decision trees is the question “what is the best way to split the data?” Sometimes you’ll instinctively know, other times you’ll need an algorithm
We want to design a function which when given a dataset will split the data accordingly.
If we have numerical features we can split it based on the data we see. There are many different ways of splitting. We can sort all the values in the dataset and decide the split thresholds between instances of different classes. We can also cut them straight down the middle. There are too many splitting algorithms to discuss here. So instead we’ll go through a simple algorithm.(1, a), (2, b), (1, c), (0, b), (3, b)
So we have 3 classes (a, b, c). The first thing we do is put them into different categories.
{(0, b)}, {(1, a), (1, c)}, {(2, b)}, {(3, b)}
Now we have 4 different sets. For more on set theory, click here.
Let’s just pick some arbitrary numbers here. We’ll split them like so:
Split 1 <= 0.5
Split 2 <= 1.5 but > 0.5
Split 3 > 1.5
We now have a decision tree split up. If we didn’t split the data up, the tree wouldn’t look much like a tree. Imagine what the tree might look like if our split was “all data less than 3”. Everything would be there! It wouldn’t be very tree-like.
Occam's razor is a philosophy attributed to William of Ockham in the 14th century. In short, the quote is:
“When you have two competing theories that make exactly the same predictions, the simpler one is the better one.”
We can use this principle in machine learning, especially when deciding when to split up decision trees.
“The simplest tree that classifies the training instances accurcately will work well on previously unseen instances.”
The simplest tree will often be the best tree, so long as all other possible trees make the same results.
Trying to find and return the smallest possible decision tree that accurately classifies the training set is very very hard. In fact, it’s an NP-hard problem.
Instead, we’ll try to approximate the best result instead of getting the best result. We’re going to talk a lot about probability and statistics, if you want to know more about probability and statistics click here.
What we want is information that explicitly splits the data into two. We don’t want something that can include both male and females, we want purity. One singular class for each split.
This measure of purity is called information. It represents the expected amount of information that would be needed to specify whether a new instance should be classified as the left or right split.
To find the best splits, we must first learn a few interesting things.
This part talks about random variables. For more on random variables, check out this article on statistics & probability I wrote.
The expected value is exactly what it sounds like, what do you expect the value to be? You can use this to work out the average score of a dice roll over 6 rolls, or anything relating to probability where it has a value property.
Suppose we’re counting types of bikes, and we have 4 bikes. We assign a code to each bike like so:
For every bike, we give it a number. For every coding, we can see we use 2 bits. Either 0 or 1. For the expected value, not only do we need the value for the variable but the probability. Each bike has equal probability. So each bike has a 25% chance of appearing.
Calculating the expected value we multiply the probability by 2 bits, which gets us:
What if the probability wasn’t equal?
What we need to do is to multiply the number of bits by the probability
This measure of purity is called the information. It represents the expected amount of information that would be needed to specify whether a new instance (first-name) should be classified as male or female, given the example that reached the node. We calculate it based on the number of male and female classes at the node.
Remember earlier when we talked about purity? Entropy is a measure of impurity. It’s how uncertain something is. The formula for entropy is:
Entropy is trying to give a number to how uncertain something is.
You can also have conditional entropy, which looks like this:
Let’s show this using an example.
What’s the information gain of splitting on Humidity?
We have 9+ and 5-. What does that mean? That means in the table we have 9 features where data is positive and 5 where it’s no. So go down the PlayTennis table and count 9 times for positive (Yes) and 5 times for negative (No).
Now we want to find out the information gain of humidity. If humidity is high, we look at the data and count how many yes’s for humidity high. So when humidity is high, we have 3+ and 4-. 3 positives and 4 negatives.
The information gain is the gap between uncertainty. We have 14 sets of data in total, The denominator is always 14. Now we just calculate them using the formula. The information gain of playing tennis (yes) when the humidity is high is:
And the information gain of playing tennis when the humidity is normal is:
This isn’t how likely something is to happen, it’s just how much information we gain from this. We use information gain when we want to split something. In the below example we want to find out whether it is better to split on humidity or wind.
Now we know what the information gain on each split is using entropy, we apply the information gain formula.
The information gain on splitting by humidity amongst our sample, D, is 0.151.
If we use the same formula for entropy in the wind part, we get these results:
And if we put them into the information gain formula we get:
It is better to split on humidity rather than wind as humidity has a higher information gain.
What we want to do is to check how accurate a machine learning model is.
M(x) means given a sample, X, we give the predicted classification. The label. lx is actually the true label. So this sample has already been labeled so we know the true label. This set of samples shows that these are correctly labeled.
What we do is feed the algorithm a sample set where we already know the classification of every single item in that sample set. We then measure how many times the machine learning algorithm was right.
Look at the below example. We have this formula and noisy data.
Noisy data means that the data isn’t correct. Our formula is X1 and X2 = True. Our noisy data is True and False = True, which is wrong.
The x3, x4, x5 are all additional features. We don’t care about them, but this is just an example to show that sometimes we have many additional features in a machine learning model which we don’t care about.
We build a decision tree that can match the training data perfectly.
The accuracy is
The problem is that it matches the training data perfectly, 100% but because of the noisy data it doesn’t perform very well on the true data. That one small error makes a larger decision tree and causes it to not perform as well in the real world.
If we build a decision tree that works well with the true data, we’ll get this:
Even though it performs worse in the training set, due to not worrying about noisy data it performs perfectly with real-world data.
Let’s see another example of overfitting.
Here are the probabilities for each one:
There’s a 50% chance that the resultant, x3, is True. There’s a 0.66% chance that the resultant, Y, is True.
For our first model let’s have a quick look.
The accuracy is:
It’s good on training data, but on real world data (D_true) it doesn’t perform as well. From this, we can tell that overfitting has occurred.
The reason for overfitting is because the training model is trying to fit as well as possible over the training data, even if there is noise within the data. The first suggestion is to try and reduce noise in your data.
Another possibility is that there is no noise, but the training data is small resulting in a difference from the true sample. More data would work.
It’s hard to give an exact idea of how to prevent overfitting as it differs from model to model.
Hey 👋 Want to subscribe to my blog and stay up to date with posts similar to this one? Subscribe to my email list below. I won't spam you. I will only send you posts similar to this one 😊✨
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these blogs in their spare time. This blog is my full time job, so any and all donations are appreciated!
]]>