I recently went through the trouble of distributing a Rust package. I wished that there was a simple guide on distributing one package to many platforms, so I wrote this guide.
Follow me as we publish my package, RustScan, to multiple distributions.
Semantic Versioning is a system defining how to write version numbers. The 3 numbers are:
Major.Minor.Bugs
If you have fixed some bugs, increment the bugs counter.
If you have added a minor feature, increment the minor counter.
If you have done something major, increment the major counter.
We can signify whether a release is still being rested or not by adding "rc" (release candidate) to the end of the version. "5.0.0rc1" signifies "release candidate 1" which means this is the first public testing release of version 5.0.0.
Cargo is a package registry system for Rust. Imagine it as PyPi (Pip for Python) or NPM (for JavaScript).
As a rustacean, you may have heard of this â and even used it to download packages yourself. So let's skip right to the good part.
Before publishing to Cargo, we need to make sure our cargo.toml
file has the required information.
There are 3 things we need:
The name of our project.
Describe what the project does.
What license do you use? Specifically, we need to use a license identification code. View the Linux Foundation's SPDX website for all the license identification codes.
However, you will probably want more than these for your package. Some good ones are:
The location of your README file, which is used to fill out the README on the Cargo website.
This is tags for your project. When a user searches a keyword such as "sewing", and your project has that keyword, your project will come up in the search results.
This is RustScan's Cargo.toml
:
[package]
name = "rustscan"
version = "1.0.1"
authors = ["Brandon <my_email@skerritt.blog>"]
edition = "2018"
description = "Faster Nmap Scanning with Rust"
homepage = "https://github.com/brandonskerritt/rustscan"
repository = "https://github.com/brandonskerritt/rustscan"
license = "MIT"
keywords = ["port", "scanning", "nmap"]
categories = ["command-line-utilities"]
readme="README.md"
For more information on the manifest file, look here:
Now we're ready to publish! Go to the Crates.io website and register an account. Then, go into the settings and create a new API key.
Now in a terminal, execute cargo login <API_KEY>
. You're now logged into Crates.io and can publish!
Build your Rust package using the release profile, which optimises it at the highest level Rust can provide:
cargo build --release
And then publish it.
cargo publish
Ta-da! Your package is now available on the Crates.io website, and can be installed with cargo <your_package_name>
.
Homebrew is a package manager used by Mac OS users, but can is also used on Linux.
Unfortunately, I found the documentation to be lacklustre in explaining how to get a package into Homebrew.
Let's assume we are using GitHub to store our code.
Homebrew expects a TAR
archive. To get this, we create a new release on GitHub.
On the GitHub repo's homepage, click "Releases" on the right hand side menu.
You should be taken to this page. Click "Draft a new release".
Now create a new release.
Use semantic versioning to create the Tag Version. Create a new release title, and describe the release.
A good format for release descriptions is:
# Features
# Maintenance
# Bugs
Similar to the semantic versioning rules. I normally pull these from pull requests, or write them down as I merge commits.
Once we've entered some information, click Publish release
. We now have a published release of our app!
Our code is now in .tar.gz
format if we look on the releases page again. GitHub does it for us!
Right click Source code (tar.gz) and click on "get link". Now we have the link to our tar.gz
folder.
Go into a terminal, and type:
wget <link>
where <link>
is replaced by the link you just copied.
We need the SHA256 Hash of the archive, so let's calculate it:
shasum -a 256 rustscan.tar.gz
Where rustscan.tar.gz
is the file you just downloaded with wget.
Note down the shasum, this is an important step for later. Also note down the link we used to download it with.
Homebrew requires a separate GitHub repository for your project. Or you can change the name of your current repository.
Homebrew calls these taps. Taps are third party GitHub repositories with specific names and a configuration file.
Go to GitHub and create a new repository. Naming it:
homebrew-<project>
Where <project> is the name of your project. Note it must start with the name "homebrew-".
In my case, it is:
homebrew-rustscan
Now clone your new repo onto your machine:
git clone homebrew-<project>
Homebrew requires a file called a formula. This is a Ruby file that details your project along with how to install the binary. You do not need to know Ruby to create this.
cd
into our newly cloned repo, and create the following file structure:
- Formula/
- <project>.rb
In my case:
- Formula/
- rustscan.rb
Capitalise the folder name if it is not already.
Now copy and paste the following file into your rustscan.rb
(or whatever your project is called).
# Documentation: https://docs.brew.sh/Formula-Cookbook
# https://rubydoc.brew.sh/Formula
# PLEASE REMOVE ALL GENERATED COMMENTS BEFORE SUBMITTING YOUR PULL REQUEST!
class Rustscan < Formula
desc "Faster Nmap Scanning with Rust"
homepage "https://github.com/brandonskerritt/rustscan"
url "https://github.com/RustScan/RustScan/archive/1.3.tar.gz"
sha256 "3bbaf188fa4014a57596c4d4f928b75bdf42c058220424ae46b94f3a36b61f81"
version "1.3.0"
depends_on "rust" => :build
def install
system "cargo", "build", "--release", "--bin", "rustscan"
bin.install "target/release/rustscan"
end
end
Change the class name to match the name of your program:
class Rustscan < Formula
Then add a short description and link the homepage (in my case, the GitHub repo).
desc "Faster Nmap Scanning with Rust"
homepage "https://github.com/brandonskerritt/rustscan"
Now we need to fill out the download link and the SHA-256.
url "https://github.com/RustScan/RustScan/archive/1.3.tar.gz"
sha256 "3bbaf188fa4014a57596c4d4f928b75bdf42c058220424ae46b94f3a36b61f81"
Remember earlier when I told you to write down the link & the shasum? This is exactly where you'd place them!
Now insert your version number, the same one for the whole release:
version "1.3.0"
Our program relies on Rust to build the binary, we note this down here:
depends_on "rust" => :build
The next step is to detail how to build the binary and install our program. We tell Homebrew to build the binary using cargo build
, and then to install it with bin.install
.
def install
system "cargo", "build", "--release", "--bin", "rustscan"
bin.install "target/release/rustscan"
end
And just like that, we've made the formula file.
Upload this to your homebrew-<project>
repository like so:
git add .
git commit -m 'First release'
git push
Let's install the package to double check everything went well.
brew tap brandonskerritt/rustscan
brew install rustscan
Where brandonskerritt/rustscan
is your GitHub username combined with the project's name.
My username is Brandonskerritt, and the project is called rustscan.
I created a one command install for my users. which is just the 2 commands combined. You may find this helpful.
brew tap brandonskerritt/rustscan && brew install rustscan
The easiest way to create Debian binaries is to use the crate cargo-deb
. Cargo-deb is installed
cargo install cargo-deb
Once it is installed, run the command:
cargo-deb
And we now have a .deb
file for our project on our system architecture.
But what if we wanted to package for other architectures?
Luckily I've created a (albeit badly made) Docker script to package for other architectures.
The script packages the project for:
It requires some editing (as it was made for RustScan), but once done it will automatically package your script for you.
Create a separate folder in your main project's repo, such as rustscan-debbuilder
.
Then place these 3 files in there:
entrypoint.sh
#!/bin/bash
cd /RustScan
git pull --force
#amd64
cargo deb
#arm64
rustup target add arm-unknown-linux-gnueabihf
cargo deb --target=arm-unknown-linux-gnueabihf
#i386
rustup target add i686-unknown-linux-gnu
cargo deb --target=i686-unknown-linux-gnu
find target/ -name \*.deb -exec cp {} /debs \;
Change cd /RustScan
to your project name.
run.sh
#!/bin/bash
docker build -t rustscan-builder . || exit
# This creates a volume which binds your currentdirectory/debs to
# the location where the deb files get spat out in the container.
# You don't need to worry about it. Just chmod +x run.sh && ./run.sh and
# you'll get yer .deb file in a few minutes. It runs faster after you've used it the first time.
docker run -v "$(pwd)/debs:/debs" rustscan-builder
Dockerfile
FROM rust:latest
RUN git clone https://github.com/brandonskerritt/RustScan
WORKDIR "/RustScan"
RUN git pull --force
RUN cargo install cargo-deb
RUN apt update -y && apt upgrade -y
RUN apt install libc6-dev-i386 -y
RUN git clone --depth=1 https://github.com/raspberrypi/tools /raspberrypi-tools
ENV PATH=/raspberrypi-tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64/bin/:$PATH
ENV CARGO_TARGET_ARM_UNKNOWN_LINUX_GNUEABIHF_LINKER=arm-linux-gnueabihf-gcc
RUN mkdir /root/.cargo
RUN echo "[target.arm-unknown-linux-gnueabihf]" >> /root/.cargo/config
RUN echo "strip = { path = \"arm-linux-gnueabihf-strip\" }" >> /root/.cargo/config
RUN echo "objcopy = { path = \"arm-linux-gnueabihf-objcopy\" }" >> /root/.cargo/config
COPY ./entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
Change RUN git clone https://github.com/brandonskerritt/RustScan
to the git repository link of your choice.
Change WORKDIR "/RustScan
to your project's name.
The directory should look like:
- rustscan-debbuilder /
Dockerfile
run.sh
entrypoint.sh
Now to run this builder:
cd rustscan-debbuilder
chmod +x run.sh
./run.sh
And it will build 3 Debian binaries for you.
See how we did it with RustScan here:
To install .deb
files, you can run dpkg -i
on the file, or you can double-click the file (on some systems).
The easiest way to distribute for AUR is to use the Cargo package cargo-aur
.
Cargo-aur automatically builds a PKGBUILD file, which is the file needed by Arch package management tools to install the software.
The PKGBUILD file is similar to cargo.toml
, or our Homebrew file.
Let's open up the file and edit some fields (if we want to).
# Maintainer: Bee <bee@fake.com>
pkgname=rustscan
pkgver=1.4.1
pkgrel=1
pkgdesc="Faster Nmap Scanning with Rust"
url="https://github.com/brandonskerritt/rustscan"
license=("MIT")
arch=("x86_64")
provides=("rustscan")
options=("strip")
source=("https://github.com/brandonskerritt/rustscan/releases/download/v$pkgver/rustscan-$pkgver-x86_64.tar.gz")
sha256sums=("7bed834f5df925b720316341150df74ac2533cc968de54bb1164c95ea9b65db8")
package() {
install -Dm755 rustscan -t "$pkgdir/usr/bin/"
}
The pkgname
is the name of the package. Please see the Arch wiki for guidance on naming conventions.
pkgver
is the semantic version of our package. This is automatically taken from cargo.toml
.
pkgrel
means "this package has updated". Nothing more to it, but the Arch Wiki explains this concept in more detail.
pkgdesc
is the description of our package.
arch
is the architecture our package will compile on.
provides
is an array of packages that the software provides the features are. Packages providing the same item can be installed side-by-side unless one of them has a conflicts array.
options
per the Arch Wiki:
This array allows overriding some of the default behavior of makepkg, defined in/etc/makepkg.conf
. To set an option, include the name in the array. To disable an option, place an!
before it.
Personally, I don't know why this is needed. But it's an automated generation, so we can't complain too much.
source
is the location of the release on GitHub, and sha256sums
are the checksums of the package.
Finally, package()
shows Arch how to install our package.
cargo aur
built a tarball .tar
file. Create a new `release on GitHub and attach the .tar
file that was just created.2. Create an account on the AUR https://aur.archlinux.org/
3. Upload your SSH public key to your account.
Check for SSH keys with:
ls -al ~/.ssh
And you're likely looking for a file like id_rsa.pub
.
If this doesn't exist, generate a new SSH key with:
$ ssh-keygen -t rsa -b 4096 -C "your_email@example.com"
And follow the on-screen prompts. Or follow this guide if you are still confused.
Next, go to your account page on the AUR and upload your public SSH key.
4. In a new directory, git clone your repo on the AUR.
This is kind of confusing. But say the package name is rustscan
(confirm there is no other package on the AUR using your projects name by searching here).
git clone ssh://aur@aur.archlinux.org/rustscan.git
I normally clone this in a folder format like:
- rustscan /
- rustscan / # the rust package
- rustscan / # the package we have git cloned
- homebrew-rustscan /
Make sure to change the name of the package rustscan
to the name you want.
5. Copy the PKGBUILD you built in stage 1 into the new Git repo.
6. Run makepkg --printsrcinfo > .SRCINFO
in the repo.
7. Your directory should now look like:
- rustscan /
- rustscan / # the rust package
- rustscan / # the package we have git cloned
- PKGBUILD
- .SRCINFO
- homebrew-rustscan /
8. Now push these:
git add .
git commit -m 'initial release'
git push
And Ta-Da! We now have an Arch Linux AUR package!
Eventually, you may want to clean up the default Rust AUR package for whatever reason. This is the one RustScan uses. Feel free to copy & change it however you wish:
# Maintainer: Hao Last_name_emited_for_privacy <email_emited_for_privacy>
pkgname=rustscan
_pkgname=RustScan
pkgver=1.6.0
pkgrel=1
pkgdesc="Faster Nmap Scanning with Rust"
arch=("x86_64" "i686")
url="https://github.com/rustscan/RustScan"
license=("GPL3")
provides=('rustscan')
conflicts=('rustscan')
depends=("nmap")
makedepends=("cargo")
source=("${pkgname}-${pkgver}.tar.gz::${url}/archive/${pkgver}.tar.gz")
sha256sums=('a4ebe4b8eda88dd10d52d961578c95b5427cc34b3bf41e5df729a37122c68965')
build() {
cd ${_pkgname}-${pkgver}
cargo build --release --locked --all-features --target-dir=target
}
package() {
cd ${_pkgname}-${pkgver}
install -Dm755 target/release/${pkgname} ${pkgdir}/usr/bin/${pkgname}
}
Note: someone else made this for RustScan.
Hey! This is a new newsletter I am trialing. Basically, with previous newsletters I had to stick to a certain topic. Which sucks. I obsess over new things all the time. SooooâŚ. This newsletter is Â just a mish-mash of things I think about, and things I learn. Enjoy! :)
I was looking up at the clouds one sunny day, and thought to myself âthatâs an awfully lonely cloud.â
And I was right. The cloud was by itself. Higher up, you can see it has cloud-friends looking over it. But otherwise, the cloud is alone.
And it got me thinking. Why was the cloud alone? Why does the cloud have smaller sub-clouds, almost like a tail? How is the distance & spacing between clouds determined?
Is there a mathematical algorithm to determine how clouds _work_?
I started to research the topic. It turns out, we know almost nothing. And most cloud simulations are about as good as my simulation of autonomous driving robots for my dissertation, which is to say, not very effective at real-world usage.
I read The Cloud Book, which only left me more confused about how clouds operate. The major players in cloud simulations are video games, but we donât really know how they operate.
This Scientific American article tries to explain how clouds form.
The premise is the same as how bubbles form in boiling water. A spot of the water gets hotter than others. When a bubble gets hot enough, the waterâs surface tension can no longer hold it so it rises to the top where it bursts into steam.
The same can be said about clouds. Hot buildings can form bubbles of hot air, which bubbles to the top of the atmosphere (well, not always). There the cool air condenses the hot air into water droplets, which becomes a cloud.
This makes me wonder, if we develop a mathematical model for how heat works â could we theoretically & perfectly predict cloud formation over an area? Assuming heat is not totally random.
If this turns out to be the case, I would like to see at least once in my life âcloud artâ, where the ground is heated, and the air cooled perfectly to form the cloud into the shape of some art. Assuming these are the only known variables, which most certainly isnât the case this early on in this scientific regime.
Recently I have found some success on GitHub. 2 of my tools have reached some level of fame. One of them is Ciphey, an automated decryption tool using AI & NLP. Put in encrypted text, get back decrypted text.
The other is RustScan, a fast port scanner. I built this tool to learn Rust, but itâs gained more popularity than I can imagine
Okay, enough of the flexing. Letâs talk about why they gained popularity.
I believe this is down to 2 simple things:
The first point is harder than it sounds. Solve a problem people have. If everyone could do this, weâd all be billionaires. The first step is to seeing a problem you have, and then seeing if other people have it.
For Ciphey, this was the fact that in capture the flag events (hacking competitions) many times weâd have some stupid encryption scheme. Encoded with Base64 then encrypted with Caesar Cipher. Good luck figuring that out by hand!
This meant that a very simple encryption scheme (base64 -> Caesar cipher) could easily take hours and hours of just manual, boring, brute force work. Ciphey solves this problem.
RustScan solves the problem of slow Nmap scans. In pentesting, the very first thing youâd do is Nmap. In fact, you wouldnât pentest without a port scan. Itâs essential. But Nmap frequently takes > 20 minutes to scan.
In a CTF where time is quite literally money (faster = more points = more chance to get money), Nmap just wasnât cutting it. RustScan fixes this problem.
The second point, show it clearly as soon as the user sees it is the most important.
You can have the best tool in the world, but no one has time to read the source code to understand why your tool. Youâve got to advertise it front and centre.
Cipheyâs tagline is âautomated decryption tool with AI & NLPâ. This explains immediately that its a decryption tool like CyberChef, but automated. The automated part is what people want. They donât want to think about doing it by hand.
RustScanâs Tagline is âFaster Nmap scansâ and right below it, it shows clearly how it makes Nmap faster.
Notice how the logos both say exactly what it does. This is because when people share the GitHub repo on social media, the image appears.
And with the GitHub description too. Look at this social card from RustScan
You donât even need to click on the repo to learn what it does, it says it right there. Once in the description, and once in the image.
Finally, a closing remark. Very early on in the README I clearly take the largest possible competitor, and I show exactly the need for the tool using Gifs and real time. CyberChef takes too long, too many steps to do what Ciphey can do in 2 seconds. Nmapâs manual scan takes too long.
Show your readers not only the purpose of your tool, but exactly why their current tool arenât isnât suitable for the job.
This is why I believe my tools are popular.
Since Iâve been learning Rust lately, I decided to explore tools rewritten in Rust. Mostly based on this article.
A cat clone but better in almost every single way. Syntax highlighting, line numbers, and more.
Exa is LS rewritten.
It uses colour by default, so automatically itâs awesome.
Iâve been learning about entropy recently. Entropy is a measure of how âchaoticâ a string is, how much âorderâ it has. This principle has uses in many applications, such as determining the difference between an attack and normal traffic on a network see here.
We use entropy in Ciphey to determine when we are âgoing in the right directionâ of decryption. Hereâs a cool example.
This is encrypted with Base64 -> Rot13 -> VigenĂ¨re (key: âkeyâ).
The Shannen Entropy is 5.23.
Now if we âdecode Base64â and get the entropy again:
Itâs now 3.88. Even though we have no idea how many more decryptions there are, or if our Base64 decoding successfully resulted in the correct decoding (just because it decodes as Base64 does not mean it is Base64 - fun fact) we know we are going in the right direction because the Entropy is decreasing.
Ciphey is using this as part of a heuristic in A* search (we have many other things that go into the algorithm, more on it when I actually code it :P ).
But I thought this was a cool application of entropy. If you ever need to tell the difference between order & randomness, entropy is very useful.
Recently I had taken 2 cuttings of my Spider Plant, and created 2 new cute spider plant babies.
I had actually taken 3 cuttings. 1 of them in its own pot, 2 of them in the same pot. And it got me thinking â do these plants fight each other to survive in the plant pot?
Well it turns out, plants do know who their siblings are. And they donât compete for the same space with siblings, they co-exist. But when thrown in with foreign plants, they do fight for root control and space.
This helped clarify that I am not a sadistic plant murderer, and in fact placing 2 sibling plants together wasnât such a bad idea overall.
Anyway, this has been my first newsletter in this weird series of things I am interested in. Have a good week :D <3
If you've spent time on HackerRank or LeetCode, you might have noticed most of the optimal solutions use Set Theory.
By reading this article, you will gain a deeper knowledge into set theory and how to use it to create optimal algorithms.
Set Theory was invented (or found, depending on how you view maths) by George Cantor.
It revolves around the idea of sets, as in collections of objects. Â Set theory is incredibly powerful and can be used to write some beautiful & elegant code.
Sign up now and get the ebook for free along with:
A set is simply a set of objects. They don't have to be the same type, or have any relation to one another.
set = (apples, pears, oranges)
Sets are denoted with parenthesis ( )
. With the separation of elements being a comma. Or sometimes { }
.
Sets follow 2 rules:
Sets can only contain unique items. It cannot contain the same item twice or more.
There is no order or structure to a set.
The first rule is quite useful. Say, for example we have a list of words from a book:
words = ["hello", "my", "name", "is", "brandon", "and",
"your", "name", "is", "brandon"]
To find all the unique words, we put the words into a set:
words = ["hello", "my", "name", "is", "brandon", "and",
"your", "name", "is", "brandon"]
unique = set(words)
# {'hello', 'is', 'and', 'brandon', 'my', 'your', 'name'}
The 2^{nd} rule ties in nicely with the 1^{st} rule. A set aims to replicate how we see things in our lives.
Imagine we are on a road trip with Jahan, Olivia, and Ryan.
Next year, we want to go on a field trip again. It doesn't make much sense to invite Jahan, Olivia, Jahan, Ryan, Jahan.
And it also doesn't make sense to assign an "order" to them. They're people, not elements in an array!
There are other types of sets too. Such as a a multiset which allows non-unique elements but no order.
The cardinality of a set is the length of the set. For the set:
$$A = \{1, 2, 3, 4, 5, 6\}$$
The cardinality is
$$|A| = 6$$
2 sets are equal when all the elements match.
$$a = \{1, 2, 3\}$$
$$b = {3, 2, 1}$$
$$a == b$$
Remember, order doesn't matter - so sets are equal despite being in the 'wrong' order!
Some sets are infinite. Such as the set containing all the natural numbers, or all the real numbers.
We can express infinite sets using a little bit of maths. Heard of list comprehensions? It looks the same syntactically, but operates infinitely.
The set of all even numbers is:
$$A = \{2n: n \in \mathbb{Z}\}$$
Read this as "for each number, n, in the set of all integers, $\mathbb{Z}$, multiply the number by 2". Which will give us the set of all even numbers.
In a list comprehension, it would look like:
[2 * n for n in naturalNumbers]
However, this won't work as Python doesn't allow infinite sets.
The empty set is the set containing nothing at all. It is much like 0. It is nothing, and serves the purpose as a negated value. We don't have 0 apples, we have no apples.
The empty set serves the same purpose. We don't have a set containing all the words, we have nothing.
Its symbol is $\varnothing$
The Universal Set $\mathbb{U}$ is the set which contains all objects, including itself.
If our universe contains only integers, 1, 2, 3, ..., then the universal set is the set of all integers.
Just like with other data types, sets have operations! Let's start with some operations you may already know.
The union is combining 2 sets together (think addition). The symbol for the union is $\cup$. Let's look at an example.
$$A = \{1, 2, 3\}$$
$$B = \{3, 4, 5\}$$
$$B \cup A = \{1, 2, 3, 4, 5\}$$
Notice how when we perform a union, we have two 3's. Since sets only contain unique elements, we simply toss the extra 3 away. Also note, there is no order. So B $\cup$ A does not result in (3, 4, 5...) as the order doesn't exist. We can even say B $\cup$ A = {3, 5, 1, 4, 2}.
What if we union an empty set with a non-empty set? Or an empty set with another empty set?
$$\varnothing \cup \{1, 2, 3\} = \{1, 2, 3\}$$
$$\varnothing \cup \varnothing = \varnothing$$
When we get onto subsets and the likes we'll learn to appreciate the empty set and this simplistic maths.
The intersection of a set is every item in set A that is also in set B.
Think of it like the boolean operator AND. Its symbol is $\cap$.
$$A = \{1, 2, 3\}$$
$$B = {3, 4, 5}$$
$$A \cap B = {3}$$
Let's say we have a list of flights with passengers on board. And we have a list of the people that took my daughter (Taken reference).
We can use the intersection operation to find the list of flights that contain the people who took my daughter.
Because sets only contain unique elements, it is very fast to calculate this. However, creating the set itself may take time. But, if we do not know all the names of the people in the child trafficking ring we can build the set first and use the intersection to query it.
We can create sets which belong to other sets. $1$ belongs to the set $\{1, 2, 3\}$. In notation this is:
$$1 \in \{1, 2, 3\}$$
We say that "1 is in the set {1, 2, 3}".
But don't get caught out!
$$\{1\} \in \{1, 2, 3\}$$
Is not true. We say the element is not in $\notin$ the set.
$${1} \notin {1, 2, 3}$$
it would be true if:
$${1} \in {{1}, 2, 3}$$
This is a very, very useful feature. Let's say we have the word "FLAG{" that we want to find in some text.
We turn the text into a set, like so:
text = "FLAG{"
corpus = set("Hello", "my", "name", "is")
if text in corpus:
print("It's in there!")
Because sets do not contain unique items, this is rather fast. However, you have to take into account that to turn it into a set is expensive.
The symbol for subset is $\{1\} \subset \{1, 2\}$.
The empty set is a subset of all sets, but it is not an element of all sets.
Fun fact, we can do:
$$ \mathbb{U} - \varnothing$$
This equates to a new set, which contains all the elements of the set apart from the empty set. However, it is not the universal set - it's a new set!
The power in subsets, in my opinion, comes from the power set.
The Power set is the set of all subsets of a set, including the empty set and the set itself.
For the set {2, 9}
we have 4 subsets.
The power set will be:
$$\frak{P} (\mathbb{A}) = \emptyset, {2}, {9}, {2, 9}$$
Note: For home use or exams we can denote using P(a). For websites where we want to impress people with our fancy use of the alphabet, use LaTeX.
When building subsets, the element can either be included in the subset or not. This is binary. It's either in it, or it isn't.
That leads us to the formula for calculating how large a powerset is.
$$2^s$$
Where $s$ is the size of the set and 2 is because elements are either in the set or not.
Let's say we have a list of ingredients.
{cabbage, lettuce, tomato, chips, carrots}
It makes sense to use a set because the order doesn't matter for a list of ingredients and we don't want to have cabbage twice for some reason.
Now, how many combinations of food can we make with this?
$$2^5 = 32$$
The empty set doesn't count here (unless we want to starve), so we can reduce this to 31.
I made this for you! Feel free to print and use it however you want. Personally, when I was studying for exams, the symbols section proved to be most useful!
In Python, sets remember insertion order.
>>> a = set()
>>> a.add(1)
>>> a.add(2)
>>> a.add(3)
>>> a.remove(2)
>>> a.add(4)
>>> a
{1, 3, 4}
Because they remember insertion order, technically they can be sorted (although in Set Theory maths, sets cannot be sorted. I mean sorted as in insert the elements in a sorted manner).
If we insert things into a set alphabetically, it is theoretically possible to make use of both Binary Search and the no repeating elements rule of sets.
It takes longer to program this up-front, but it will be much faster if we do not know what we want to find in the text, and we have time to spare at the start of the program. We can always save the sorted set so the next time the program runs, it is faster.
>>> for item in a:
... print(item)
...
1
3
4
Why is this in the belongs to section? Well, sets can also store other sets.
Let's say we have a world leader who says a lot. But they often say "I've never said that before". How we can work out if they have said it before?
We can build a set of sets, where each subset is the exact quote the person said.
Not only do sets in most programming languages remember insertion order, but they are often automatically sorted. Take a look at this Python code:
>>> x = [9, 8, 7, 6, 5, 4, 3, 2, 1]
>>> set(x)
{1, 2, 3, 4, 5, 6, 7, 8, 9}
And that's just the tip of the iceberg. Sets are implemented as hash maps in many programming languages. This means that we get O(1) lookup time. To perform 6 in x
will take O(1) time. For more on Big O notation, check out my other article.
We have a text such as:
text = ["hello", "my", "name", "is", "lkmxjja", "brandon"]
And we want to find out how many words in text
appear in another list, text_dict
.
text_dict = ["hello", "hello", "my", "brandon", "name"]
The quickest way to do this without using sets is to sort both lists, and then go through each one, one-at-a-time to count how many times an item in text
appears in text_dict
.
In a real world example, we could be checking to see if a text is English or not by counting how many times the words in the text appear in the English dictionary.
By using sets, we delete the repetitions in text_dict
. It also automatically sorts the two sets, so it is much easier for us to search through them. Not to mention the O(1) lookup time. 3 birds with 1 stone!
However, looping through both of them ourselves is unruly. We'd have to write a loop and keep track of it ourselves. Luckily, we can use set theories intersection function to do this for us.
>>> text = ["hello", "my", "name", "is", "lkmxjja", "brandon"]
>>> text_dict = ["hello", "hello", "my", "brandon", "name"]
>>> text = set(text)
>>> text_dict = set(text_dict)
>>> text.intersection(text_dict)
{'brandon', 'name', 'hello', 'my'}
>>> len(text.intersection(text_dict))
4
Set theory is gnarly, and incredibly useful in not just competitive Â programming but all kinds of programming. We've not only learnt the theory behind set theory, but also how programming languages such as Python have implemented it.
Sign up now and get the ebook for free along with:
This is an informed opinion piece on designing effective peer to peer (P2P) networks. I've written extensively about peer to peer networks, and have even designed my own (albeit, bad) cryptocurrency.
By the end of the article, we should have a good understanding of the important questions to answer when designing effective P2P networks.
Sign up now and get the ebook for free along with:
Alternatively, you can buy the ebook for ÂŁ49.99 from Gumroad.
P2P networks can be used for all networking. We need to use the right tool for the job.
Quickly & automatic scaling is their game. P2P networks enhance with more users, not degrade.
If we are building a platform with a shared goal (i.e. shares files) and we can imagine the platform growing uncontrollably big, P2P networks are good.
Or if we do not want anyone to shut down our network (in a client-server model, turn off your servers) a P2P network is effective.
There are 2 types of P2P Networks:
These are networks where everyone node is equal.
Certain nodes have more power than others.
Let's walk through what all good peer to peer networks share.
All great peer to peer networks have reward systems. The user does some work, and they are rewarded.
Even if the reward isn't clearly defined, there is always a reward for taking part in the peer to peer network. And always a reward for giving back to the network.
Rewards are paramount to P2P networks and my favourite thing to talk about!
In Bitcoin, this is earning Bitcoin. BitTorrent rewards participants with faster download speeds. Tor rewards other Tor members with more privacy.
All of these rewards are given for contributing to the network.
The idea of a reward is to eliminate free-riders who partake in the network without giving anything back. There are 2 ways to discourage free-riders:
Tor is the former. The only reward that can be given is to become a guard node, which provides no obvious benefit for the user other than to feel good about themselves.
The latter is what most peer to peer networks use, rewarding users who give back.
In BitTorrent, those who seed (upload) are rewarded with higher download rates. The more you seed, the faster you can download.
In Bitcoin, users are rewarded for mining the blockchain with Bitcoin.
This leads us to look at peer to peer networks from 2 interesting perspectives. Human civilisations, or reinforcement learning machines.
Let's start with human civilisations.
We do work, and we get rewarded. Capitalism. In semi P2P networks (where one node has more power than the others) we can almost perfectly replicate capitalism.
If we force everyone to do the same work for the same reward then it is communism.
Luxury communism uses machines to do the heavy lifting leaving humans to do other things and is a new 21st-century take. This more closely relates to P2P networks than traditional communism.
As P2P networks grow, eventually if not governed properly it will fall back into an client-server architecture. Why use Jim's computer to download ubuntu.iso
when you can download it from LargeCorp which has a much faster speed.
With planning, we can build a dictatorship of the proletariat where every ordinary node has power in the network. Not just those with higher rewards.
There will always be those with more power than others (perhaps they have more computers on the network) but the point is that everyone has some power over the network.
And the ordinary nodes will likely always have more power than larger clusters of nodes, as seen in Bitcoin.
Modelling P2P networks as political movements allows us to talk about P2P networks as more than just networks, as entire communities.
Eventually people with little to no knowledge about networking can discuss and change how the network operates using these abstractions.
We can use artificial intelligence to automatically optimise a P2P network. Humans are slow when it comes to making changes compared to computers.
P2P networks already use technology to find the shortest routes or preferred routes. As networks become more and more complicated, simple handwritten algorithms might not be enough.
Picture this nifty example.
In our network, we have multiple nodes all interconnected. Each node has a rating.
This rating signifies how good that node is. A good rating based on how much reward an individual has. Whether that reward is from speed, accuracy or trust is problem dependant.
Every time this node is used, its personal reward is increased. Much like Kohonen's competitive learning rules.
But why stop there? If the node is faster, trustworthy, contributes heavily to the network it should have a higher reward rating.
And our goal is to create the "best" path across the network, taking into account all of these numbers and to maximise our reward, whatever that may be.
Tor already has a similar (albeit smaller) system in place like this. Guard nodes, bridge nodes, and more.
This seems like the perfect scenario for reinforcement learning. Multiple inputs to maximise our reward. Exploratory choices, meaning that a node with a "bad" reating is chosen, preventing some nodes from hogging all the resources.
Reinforcement learning is the closet computer learning to how humans learn according to Sutton & Barto.
Humans will eventually optimise manually. Where good nodes have better jobs or are chosen more than others. Reinforcement learning is much faster than any human could be.
We, as humans, naturally & manually improve the network over time. What if we could use AI to speed this process up?
An ideal P2P network will improve the more people that use it.
In Tor, the more people that use it the more privacy is given. I talk more about this in my Tor blog post.
A simple distinction, but one that must be made. As the network grows older, ideally, it reaches its global optimised state.
Unlike client-server, where the more users there are the harder it becomes to maintain the network.
Take the example of downloading a file. Client-Server starts out being incredibly fast, but over time as the file becomes more popular that download speed degrades.
With a P2P network, at the start it is slow to download (assuming the initial uploader is slow) but the speed increases over time. What's important is that over ttime, client-server gets worse with more users but P2P gets better.
Optimisation of the network as it grows is often done through the rewards system.
As Tor grows, the privacy of each user increases. Thus, the more that use Tor, the better the privacy.
Each user has their personal reward, but together this creates a positive reward for the whole network.
In Bitcoin, the more that mine Bitcoin, the harder it is to mine, and the larger incentive it is to use better computers. Better computers = faster mines = more money.
This creates competition amongst users. Who can contribute to the network the most? Users are fighting for their own personal reward, but this fight improves the network as a whole.
It is almost impossible to perform a 51% attack on the network when miners are hyper-competitive like this.
Users work towards their own personal rewards, which means the network gets better from not only more users but better rewards. The entire network scales majestically upward.
When choosing a reward, it is important to look at the local rewards (what the user gets) and to look at the greater rewards (what the network will get for participating).
I also recommend rewards which increase as the network grows. Scalable rewards. If the price of Bitcoin did not scale with how many users it had, the network degrades or upgrades much slower over time.
With a client-server architecture, it can be amazing from the start. But with a peer to peer network, it sucks for the first few users.
With Bitcoin, what was the incentive? You get this worth coin.... nothing at all.
Tor. The incentive? Well, if there are only 3 users then there is no incentive. No privacy. 51% attacks is a matter of whether someone loses WiFi on the train or not.
BitTorrent. The incentive? You can download this one person's files.
In peer to peer networks, generally, they suck for the first couple users and are only worthwhile over time.
So how then, do we build an effective peer to peer network that other people will want to use?
There are 2 methods to this.
You can force users to use the network. This is how Tor got its start. The US Naval Resarch Labs built Tor, and effectively they could force others in the Navy to use it.
The other option is incredible technology. In a peer to peer network, the first few users are often those who recognise how incredible the technology is.
Think about Bitcoin. The first few users fell in love with the technology behind it, and truly believed it could one day be great.
With a great rewards system in place, the first users of Bitcoin realised it could grow up to be something huge. Amazing technology attracts the first few users, and those users are often technologically minded.
In almost all peer to peer networks, jobs are assigned to nodes.
In Bitcoin, we have miners whose job it is to mine the blockchain.
In Tor, we have guard nodes.
If a node has proven to be trustworthy than we shall make it a guard node. If a node wants to mine the blockchain, it becomes a miner.
These nodes do not have more rights than the other nodes, they simply have a job to do. And sometimes this job provides a reward, and sometimes it doesn't.
If we give everyone the same job, the network will tend towards global optimisation much slower.
Let's say we have a new file-sharing system. Our system will let you download files from others.
We have a file, Ubuntu.iso
, which is being seeded by 10 people.
We choose at randomly a seed to download from. This seed has an upload speed of 1 Mbps while all the other seeders have an upload speed of 200mbps.
It makes logical sense to use the uploaders with faster speed.
It makes sense logically to choose from a subsection of people with high upload speeds. Effectively, we have given these people 'jobs'.
There are 2 types of jobs in peer to peer networks:
With rewarded jobs, the uploaders get a higher reward for working. This reward might be faster download speeds, such as in BitTorrent.
With unpaid jobs, there is no reward. There is no incentive for one to choose a job, it is gifted to those. This is how guard nodes work in Tor.
If we have a job which peopl have to opt-in to, but doesn't provide a reward it is unlikely for many people to opt-in.
Deciding on what jobs to give, and whether it is rewarded or not is problem specific.
For example, we have a job called "maintainer of the Blockchain". This job means people have a copy of the blockchain and maintains it. And only they do, no one else can. Any cryptocurrency nerd reading this may already see the problem.
To view the blockchain, you need to have this job. But not every citizen of the world has a spare 300gb to view the chain. They cannot get the job, because they do not meet the required specification.
You also have to meet a plethora of other specifications. Most of them a normal home user cannot achieve.
This is another issue. Can we reject people if they do not meet the job specification? Again, problem specific.
With only a small per cent controlling the blockchain, it leaves it open to a 51% attack. A small few will hold all the power. It will not work well in this instance.
I propose that jobs are automatically assigned based on the specification of the user, but any user can request a job.
The only jobs they should be rejected for is high trust jobs, such as guard nodes. But, the specification of their computer should not be taken into account for trust jobs.
Humans picking jobs are too slow. The network needs to react fast and scale fast. Humans are not fast. So automatic assignment it is.
But we do not want to prevent humans from accessing resources because of their circumstances.
When assigning jobs, the jobs must work in harmony.
We should take inspiration from beehives or ant colonies for this. Human jobs are simply too hard to model, to focus on a much smaller example is wise.
Should there be a punishment if someone takes a job for which they are unfit? Should a 30gb computer take the punishment for trying to maintain a 300gb file?
On the one hand, only a very small per cent of the population will ignore the assigned job and choose something for which they are unfit.
But, to achieve global optimisation we must make sure every single thing is accounted for.
Unfortunately, in this world, those that do jobs that they are not capable of should be punished. I would not say to reduce the resources available to them through the job, but perhaps reduce the reward of using the network in some way.
Every single effective peer to peer network uses data structures.
The blockchain is just a fancy linked list with some extra properties.
BitTorrent's magnet links work based on distributed hash tables.
Data structures are vital to peer to peer networks. I do not want to get into every data structure available, but knowledge of them is important.
We can design a network using a distributed data structure that already exists, or we can build a new data structure/algorithm based on pre-existing ones.
Throughout this article, I have talked about designing the network to be perfect from the get-go. Anyone knows this isn't possible. So how then, do we change the network when it's already running? How do we implement policy changes?
In our first instance, we have a dictatorship.
You, the creator decides on changes. You create these changes, and they go out to the network. We don't care what the participants think, only that the change has taken place.
This is the most effective for easily implementing policy changes, but it ruins the trust.
Good peer to peer networks reduces trust. We should not need to trust anyone. However, trust is important in some networks. It depends on the network itself.
If the network was for a digital currency, why would you trust this entire currency in the hands of a single person?
But for a privacy centred network, having this trust in one person or body of people can increase the trust throughout the network. It reduces the chance of a 51% attack backdoor in the early days, building trust from day 0.
As the network progresses, it strays further and further away from a 51% attack. Which means we lose trust, we do not need to trust that it is safe from a 51% attack if the network is large enough.
In the early days of a P2P network, it is very much a dictatorship. A few small people decide on changes to the network (think of the early days of Bitcoin) and this eventually moves towards democracy as the network grows.
But again, this brings problems. What if there is a critical failure in the code and there needs to be a change made during the networks primetime? This is for the designer to decide.
We can alternatively let the people decide.
From the get-go, no single person has full control. Every time a change is made, a % of the network (likely 51) will need to agree to the change. Complete control for the network. No trusting the developer involved.
At the start, this may be problematic. What if we have a node in the network purely from FOMO and doesn't care about updates?
Humans are incredibly slow too. Imagine a critical bug in the code that had to be fixed right there and then. Humans may take weeks to decide. Many will not even respond to the decision.
The alternative is a democracy.
Give certain people more power than others. Have the community vote in leaders, or have the leaders be pre-selected.
These leaders then decide on changes to the codebase. This reduces trust, especially if they are voted in.
But nothing is stopping them from teaming together to get the maximum personal reward while not doing anything for the many.
All in all, deciding on policy changes is a tricky subject. And one that should be decided before the network is made.
Although something to remember is trust. Will the users need to trust you, personally?
If so, try and work towards a goal where that trust is reduced. Diversify that trust over to others. Remove trust completely, make them only trust themselves.
Effective peer to peer networks is unlike client-server models where one person controls all the power. Effective P2P networks are for the people.
The Cathedral is waterfall development. Slow, well planned. The people working on it are cathedral builders. The final product is presented and sold to the public as the final product.
The Bazaar is the opposite. Work fast. Break things. Release early. Release often. Listen to your customers.
When building a full P2P network, we would build both the cathedral and the bazaar.
In an ideal full P2P network, eventually, the creator's power dwindles to almost nothing (they become just another normal node in the network).
This is what Satoshi Nakamoto did. They created the cathedral first, and then over time moved to the bazaar - eventually losing all power to change anything in the network without the permission of the network.
Our P2P network could explode on day 1 and we would have no control to finish building the cathedral.
For this reason, in a full P2P network where all nodes are equal, we should build a cathedral and then transition naturally to a bazaar.
In semi P2P network where some nodes have more power than others, the developer doesn't have to give up their child anytime soon.
However, this requires the nodes to trust the developers. Something that not all P2P networks have the liberty of.
The main thing in P2P networks is removing trust. We don't necessarily want people to trust us when building a network, preferably we would want to remove as much trust as possible.
Closed source systems do not instil much trust, therefore we must choose an open-source system.
We now know the biggest questions to ask when desiging effective P2P networks.
Sign up now and get the ebook for free along with:
Often I'll hear about how you can optimise a for loop to be faster or how switch statements are faster than if statements. Most computers have over 1 core, with the ability to support multiple threads. Before worrying about optimising for loops or if statements try to attack your problem from a different angle.
Divide and Conquer is one way to attack a problem from a different angle. Don't worry if you have zero experience or knowledge on the topic. This article is designed to be read by someone with very little programming knowledge.
I will explain this using 3 examples. The first will be a simple explanation. The second will be some code. The final will get into the mathematical core of divide and conquer techniques. (Don't worry, I hate maths too).
Divide and conquer is where you divide a large problem up into many smaller, much easier to solve problems. The rather small example below illustrates this.
We take the equation "3 + 6 + 2 + 4" and cut it down into the smallest set of equations, which is [3 + 6, 2 + 4]. It could also be [2 + 3, 4 + 6]. The order doesn't matter, as long as we turn this one long equation into many smaller equations.
Letâs say we have 8 numbers:
$$4 + 6 + 3 + 2 + 8 + 7 + 5 + 1$$
We want to add them all together. We first divide the problem into 8 equal sub-problems. We do this by breaking the addition up into individual numbers.
$$4 6 3 2 8 7 5 1$$
We then add 2 numbers at a time.
Then 4 numbers into 8 numbers which is our resultant.
Why do we break it down to individual numbers at stage 1? Why don't we just start from stage 2? Because while this list of numbers is even if the list was odd you would need to break it down to individual numbers to better handle it.
A divide and conquer algorithm tries to break a problem down into as many little chunks as possible since it is easier to solve with little chunks. It does this with recursion.
Before we get into the rest of the article, let's learn about recursion first.
Recursion is when a function calls itself. It's a hard concept to understand if you've never heard of it before. This page provides a good explanation.
Matryoshka dolls are these cute little things:
We open up the bigger one, and inside is a slightly smaller one. Inside that one is another slightly small doll. Let's say, inside the last doll is a key. But we do not know how many dolls there are. How do we write a function that opens up the dolls until we find a key?
We could use a while loop, but recursion is preferred here.
To program this, we can write:
def getKey(doll):
item = doll.open()
if item == key:
return key
else:
return getKey(item)
getKey(doll)
The function repeatedly calls itself until it finds a key, at which point it stops. The finding key point is called a break case or exit condition.
We always add a break case to a recursive function. If we didn't, it'd just be an infinite loop! Never ending.
Computer scientists love recursion. Because it's so hard for normal people to understand, we have a schadenfreude sensation watching people struggle. Haha just kidding!
We love recursion because it's used in maths all the time. Computer scientists are mathematicians first, coders second. Anything that brings code closer to real-life mathematics is good.
Not just because some people love maths, but because it makes it easier to implement. Need to calculate the Fibonacci numbers? The maths equation for this is:
$$ Â F(n) = \begin{cases} n, \text{If n = 0 or 1} \\ F(n - 1) + F(n - 2), \; \text{if n > 1} \end{cases}$$
A natural recurrence in our formula! Instead of translating it into loops, we can just calculate it:
def F(n):
if n == 0 or n == 1:
return n
else:
return F(n-1)+F(n-2)
This is one of the reasons why functional programming is so cool.
Also, as you'll see throughout this article, recursion reads so much nicer than loops. And hey, maybe you can feel a little happier when your coworker doesn't understand recursion but you do ;)
The technique is, as defined in the famous Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein, is:
If the problem is small, then solve it directly. Otherwise, divide the problem into smaller subsets of the same problem.
2. Conquer
Conquer the smaller problems by solving them recursively. If the sub-problems are small enough, recursion is not needed and you can solve them directly.
3. Combine
Take the solutions to the sub-problems and merge them into a solution to the original problem.
Let's look at another example, calculating the factorial of a number.
n = 6
def recur_factorial(n):
if n == 1:
return n
else:
return n * recur_factorial(n-1)
print(recur_factorial(n))
With the code from above, some important things to note. The Divide part is also the recursion part. We divide the problem up at return n * recur_factorial(n-1)
.
The recur_factorial(n-1)
part is where we divide the problem up.
The conquer part is the recursion part too, but also the if statement. If the problem is small enough, we solve it directly (by returning n). Else, we perform return n * recur_factorial(n-1)
.
Combine. We do this with the multiplication symbol. Eventually, we return the factorial of the number. If we didn't have the symbol there, and it was return recur_factorial(n-1)
it wouldn't combine and it wouldn't output anything similar to the factorial. (It'll output 1, for those interested).
We'll explore how divide and conquer works in some famous algorithms, Merge Sort and the solution to the Towers of Hanoi.
Merge Sort is a sorting algorithm. The algorithm works as follows:
In this image, we break down the 8 numbers into separate digits. Just like we did earlier. Once we've done this, we can begin the sorting process.
It compares 51 and 13. Since 13 is smaller, it puts it in the left-hand side. It does this for (10, 64), (34, 5), (32, 21).
It then merges (13, 51) with (10, 64). It knows that 13 is the smallest in the first list, and 10 is the smallest in the right list. 10 is smaller than 13, therefore we don't need to compare 13 to 64. We're comparing & merging two sorted lists.
In recursion we use the term base case to refer to the absolute smallest value we can deal with. With Merge Sort, the base case is 1. That means we split the list up until we get sub-lists of length 1. That's also why we go down all the way to 1 and not 2. If the base case was 2, we would stop at the 2 numbers.
If the length of the list (n) is larger than 1, then we divide the list and each sub-list by 2 until we get sub-lists of size 1. If n = 1, the list is already sorted so we do nothing.
Merge Sort is an example of a divide and conquer algorithm. Let's look at one more algorithm to understand how divide and conquer works.
The Towers of Hanoi is a mathematical problem which compromises 3 pegs and 3 discs. This problem is mostly used to teach recursion, but it has some real-world uses. The number of pegs & discs can change.
Each disc is a different size. We want to move all discs to peg C so that the largest is on the bottom, second largest on top of the largest, third largest (smallest) on top of all of them. There are some rules to this game:
We want to use the smallest number of moves possible. If we have 1 disc, we only need to move it once. 2 discs, we need to move it 3 times.
The number of moves is a power of 2 minus 1. Say we have 4 discs, we calculate the minimum number of moves as $2^4 = 16 - 1 = 15$.
To solve the above example we want to store the smallest disc in a buffer peg (1 move). See below for a gif on solving Tower of Hanoi with 3 pegs and 3 discs.
Notice how we need to have a buffer to store the discs.
We can generalise this problem. If we have n discs: move n-1 from A to B recursively, move largest from A to C, move n-1 from B to C recursively.
If there is an even number of pieces the first move is always into the middle. If it is odd the first move is always to the other end.
Let's Â code the algorithm for ToH, in pseudocode.
function MoveTower(disk, source, dest, spare):
if disk == 0, then:
move disk from source to dest
We start with a base case, disk == 0
. source
is the peg you're starting at. dest
is the final destination peg. spare
is the spare peg.
FUNCTION MoveTower(disk, source, dest, spare):
IF disk == 0, THEN:
move disk from source to dest
ELSE:
MoveTower(disk - 1, source, spare, dest) // Step 1
move disk from source to dest // Step 2
MoveTower(disk - 1, spare, dest, source) // Step 3
END IF
Notice that with step 1 we switch dest
and source
. We do not do this for step 3.
With recursion, we know 2 things:
The algorithm gets a little confusing with steps 1 and 3. They both call the same function. This is where multi-threading comes in. You can run steps 1 and 3 on different threads - at the same time.
Since 2 is more than 1, we move it down one more level again. So far you've seen what the divide and conquer technique is. You should understand how it works and what code looks like. Next, let's learn how to define an algorithm to a problem using divide and conquer. This part is the most important. Once you know this, it'll be easier to create divide and conquer algorithms.
When we have a problem that looks similar to a famous divide & conquer algorithm (such as merge sort), it will be useful.
Most of the time, the algorithms we design will be most similar to merge sort. If we have an algorithm that takes a list and does something with each element of the list, it might be able to use divide & conquer.
For example, working out the largest item of a list. Given a list of words, how many times does the letter "e" appear?
If we have an algorithm that is slow and we would like to speed it up, one of our first options is divide and conquer.
There isn't any obvious tell-tale signs other than "similar to a famous example". But as we'll see in the next section, we can check if it is solvable using divide & conquer.
Now we know how divide and conquer algorithms work, we can build up our own solution. In this example, we'll walk through how to build a solution to the Fibonacci numbers.
We can find Fibonacci numbers in nature. The way rabbits produce is in the style of the Fibonacci numbers. You have 2 rabbits that make 3, 3 rabbits make 5, 5 rabbits make 9 and so on.
The numbers start at 0 and the next number is the current number + the previous number. But by mathematicla definition, the first 2 numbers are 0 and 1.
Let's say we want to find the 5 Fibonacci number. We can do this:
# [0, 1]
0 + 1 = 1 # 3rd fib number
# [0, 1, 1]
1 + 1 = 2 # 4th fib number
# [0, 1, 1, 2]
2 + 1 = 3 # 5th fib number
# [0, 1, 1, 2, 3]
Now the first thing when designing a divide and conquer algorithm is to design the recurrence. The recurrence always starts with a base case.
We can describe this relation using a recursion. A recurrence is an equation which defines a function in terms of its smaller inputs. Recurrence and recursion sound similar and are similar.
As we saw, our base case is the 2 numbers at the start.
def f(n):
if n == 0 or n == 1:
return n
To calculate the 4th Fibonacci number, we do (4 - 1) + (4 - 2). This means (last number in the sequence) + (the number before the last). Â Or in other words:
The next number is the current number + the previous number.
If our number is not 0 or 1, we want to add the last 2 Fibonacci numbers together.
Let's take a look at our table quickly:
# [0, 1]
0 + 1 = 1
# [0, 1, 1]
1 + 1 = 2
# [0, 1, 1, 2]
2 + 1 = 3
# [0, 1, 1, 2, 3]
2 + 3 = 5
# [0, 1, 1, 2, 3, 5]
But what if we don't have this list stored? How do we calculate the 6th number without creating a list at all? Well we know that the 6th number is the 5th number + the 4th number. Okay, what are those? The 5th number is the 4th number + the 3rd number. The 4th number is the 3rd number + the second number.
We know that the second number is always 1, as we've reached a base case.
Eventually we break it down to the basecases. Okay, so we know our code calls itself to calculate the Fibonacci numbers of the previous ones:
def f(n):
if n == 0 or n == 1:
return n
else:
f(n-1) f(n-2)
Okay, how do we merge the Fibonacci numbers at the end? As we saw, it is the last number added to the current number.
def f(n):
if n == 0 or n == 1:
return n
else:
f(n-1) + f(n-2)
Now we've seen this, let's turn it into recursion using a recurrence. Luckily for us, it's incredibly easy to go from a recurrence to code or from code to a recurrence, as they are both recurrences!
$$ Â F(n) = \begin{cases} n, \text{If n = 0 or 1} \\ F(n - 1) + F(n - 2), \; \text{if n > 1} \end{cases}$$
We often calculate the result of a recurrence using an execution tree. We saw this earlier when exploring how to build it in code. For F(6) this looks like:
n is 4, and n is larger than 0 or 1. So we do f(n-1) + f(n-2). We ignore the addition for now. This results in 2 new nodes, 3 and 2. 3 is larger than 0 or 1 so we do the same. Same for 2. We do this until we get a bunch of nodes which are either 0 or 1.
We then add all the nodes together. 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 1 + 0 + 0 + 1 = 8.
When we have a problem that looks similar to a famous divide & conquer algorithm (such as merge sort), it will be useful.
Most of the time, the algorithms we design will be most similar to merge sort. If we have an algorithm that takes a list and does something with each element of the list, it might be able to use divide & conquer.
For example, working out the largest item of a list. Given a list of words, how many times does the letter "e" appear?
Normally if our algorithm follows a famous divide & conquer (algorithm) we can infer our big o from that.
This is no different from calculating the big o notation of our own algorithms.
Greedy vs Divide & Conquer vs Dynamic Programming | ||
---|---|---|
Greedy | Divide & Conquer | Dynamic Programming |
Optimises by making the best choice at the moment | Optimises by breaking down a subproblem into simpler versions of itself and using multi-threading & recursion to solve | Same as Divide and Conquer, but optimises by caching the answers to each subproblem as not to repeat the calculation twice. |
Doesn't always find the optimal solution, but is very fast | Always finds the optimal solution, but is slower than Greedy | Always finds the optimal solution, but could be pointless on small datasets. |
Requires almost no memory | Requires some memory to remember recursive calls | Requires a lot of memory for memoisation / tabulation |
Once you've identified how to break a problem down into many smaller pieces, you can use concurrent programming to execute these pieces at the same time (on different threads) speeding up the whole algorithm.
Divide and conquer algorithms are one of the fastest and perhaps easiest ways to increase the speed of an algorithm and are useful in everyday programming. Here are the most important topics we covered in this article:
The next step is to explore multi-threading. Choose your programming language of choice and Google, as an example, "Python multi-threading". Figure out how it works and see if you can attack any problems in your own code from this new angle.
You can also learn about how to solve recurrences (finding out the asymptotic running time of a recurrence), which is the next article I'm going to write. If you don't want to miss it, or you liked this article do consider subscribing to my email list đâ¨
Greedy algorithms aim to make the optimal choice at that given moment. Each step it chooses the optimal choice, without knowing the future. It attempts to find the globally optimal way to solve the entire problem using this method.
We call algorithms greedy when they utilise the greedy property. The greedy property is:
At that exact moment in time, what is the optimal choice to make?
Greedy algorithms are greedy. They do not look into the future to decide the global optimal solution. They are only concerned with the optimal solution locally. This means that the overall optimal solution may differ from the solution the algorithm chooses.
They never look backwards at what they've done to see if they could optimise globally. This is the main difference between Greedy and Dynamic Programming.
To be extra clear, one of the most Googled questions about greedy algorithms is:
"What problem-solving strategies don't guarantee solutions but make efficient use of time?"
The answer is "Greedy algorithms". They don't guarantee solutions, but are very time efficient. However, in the next section we'll learn that sometimes Greedy solutions give us the optimal solutions.
Greedy algorithms are quick. A lot faster than the two other alternatives (Divide & Conquer, and Dynamic Programming). They're used because they're fast.
Sometimes, Greedy algorithms give the global optimal solution everytime. Some of these algorithms are:
These algorithms are Greedy, and their Greedy solution gives the optimal solution.
We're going to explore greedy algorithms using examples, and learning how it all works.
Your algorithm needs to follow this property:
At that exact moment in time, what is the optimal choice to make?
And that's it. There isn't much to it. Greedy algorithms are easier to code than Divide & Conquer or Dynamic Programming.
Imagine you're a vending machine. Someone gives you ÂŁ1 and buys a drink for ÂŁ0.70p. There's no 30p coin in pound sterling, how do you calculate how much change to return?
For reference, this is the denomination of each coin in the UK:
1p, 2p, 5p, 10p, 20p, 50p, ÂŁ1
The greedy algorithm starts from the highest denomination and works backwards. Our algorithm starts at ÂŁ1. ÂŁ1 is more than 30p, so it can't use it. It does this for 50p. It reaches 20p. 20p < 30p, so it takes 1 20p.
The algorithm needs to return change of 10p. It tries 20p again, but 20p > 10p. It next goes to 10p. It chooses 1 10p, and now our return is 0 we stop the algorithm.
We return 1x20p and 1x10p.
This algorithm works well in real life. Let's use another example, this time we have the denomination next to how many of that coin is in the machine, (denomination, how many)
.
(1p, 10), (2p, 3), (5p, 1), (10p, 0), (20p, 1p), (50p, 19p), (100p, 16)
The algorithm is asked to return change of 30p again. 100p (ÂŁ1) is no. Same for 50. 20p, we can do that. We pick 1x 20p. We now need to return 10p. 20p has run out, so we move down 1.
10p has run out, so we move down 1.
We have 5p, so we choose 1x5p. We now need to return 5p. 5p has run out, so we move down one.
We choose 1 2p coin. We now need to return 3p. We choose another 2p coin. We now need to return 1p. We move down one.
We choose 1x 1p coin.
Our algorithm selected these coins to return as change:
# (value, how many we return as change)
(10, 1)
(5, 1)
(2, 2)
(1, 1)
Let's code something. First, we need to define the problem. We'll start with the denominations.
denominations = [1, 2, 5, 10, 20, 50, 100]
# 100p is ÂŁ1
Now onto the core function. Given denominations and an amount to give change, we want to return a list of how many times that coin was returned.
If our denominations
list is as above, [6, 3, 0, 0, 0, 0, 0]
represents taking 6 1p coins and 3 2p coins, but 0 of all other coins.
denominations = [1, 2, 5, 10, 20, 50, 100]
# 100p is ÂŁ1
def returnChange(change, denominations):
toGiveBack = [0] * len(denominations)
for pos, coin in reversed(list(enumerate(denominations))):
We create a list, the size of denominations long and fill it with 0's.
We want to loop backwards, from largest to smallest. Reversed(x)
reverses x and lets us loop backwards. Enumerate means "for loop through this list, but keep the position in another variable". In our example when we start the loop. coin = 100
and pos = 6
.
Our next step is choosing a coin for as long as we can use that coin. If we need to give change = 40
we want our algorithm to choose 20, then 20 again until it can no longer use 20. We do this using a for loop.
denominations = [1, 2, 5, 10, 20, 50, 100]
# 100p is ÂŁ1
def returnChange(change, denominations):
# makes a list size of length denominations filled with 0
toGiveBack = [0] * len(denominations)
# goes backwards through denominations list
# and also keeps track of the counter, pos.
for pos, coin in enumerate(reversed(denominations)):
# while we can still use coin, use it until we can't
while coin <= change:
While the coin can still fit into change, add that coin to our return list, toGiveBack
and remove it from change.
denominations = [1, 2, 5, 10, 20, 50, 100]
# 100p is ÂŁ1
def returnChange(change, denominations):
# makes a list size of length denominations filled with 0
toGiveBack = [0] * len(denominations)
# goes backwards through denominations list
# and also keeps track of the counter, pos.
for pos, coin in enumerate(reversed(denominations)):
# while we can still use coin, use it until we can't
while coin <= change:
change = change - coin
toGiveBack[pos] += 1
return(toGiveBack)
print(returnChange(30, denominations))
# returns [0, 0, 0, 1, 1, 0, 0]
# 1x 10p, 1x 20p
The runtime of this algorithm is dominated by the 2 loops, thus it is $O(n^2)$.
It is optimal locally, but sometimes it isn't optimal globally. In the change giving algorithm, we can force a point at which it isn't optimal globally.
The algorithm for doing this is:
We'll pick 1, 15, 25.
We'll ask for change of 30. Now, let's see what our Greedy algorithm does.
[5, 0, 1]
It choses 1x 25p, and 5x 1p. The optimal solution is 2x 15p.
Our Greedy algorithm failed because it didn't look at 15p. It looked at 25p and thought "yup, that fits. Let's take it."
It then looked at 15p and thought "that doesn't fit, let's move on".
This is an example of where Greedy Algorithms fail.
To get around this, you would either have to create currency where this doesn't work or to brute-force the solution. Or use Dynamic Programming.
Dijkstra's algorithm finds the shortest path from a node to every other node in the graph. In our example, we'll be using a weighted directed graph. Each edge has a direction, and each edge has a weight.
Dijkstra's algorithm has many uses. It can be very useful within road networks where you need to find the fastest route to a place. We also use the algorithm for:
The algorithm follows these rules:
Our first step is to pick the starting node. Let's choose A. All the distances start at infinity, as we don't know their distance until we reach a node that knows the distance.
We mark off A on our unvisited nodes list. The distance from A to A is 0. The distance from A to B is 4. The distance from A to C is 2. We updated our distance listing on the right-hand side.
We pick the smallest edge where the vertex hasn't been chosen. The smallest edge is A -> C, and we haven't chosen C yet. We visit C.
Notice how we're picking the smallest distance from our current node to a node we haven't visited yet. We're being greedy. Here, the greedy method is the global optimal solution.
We can get to B from C. We now need to pick a minimum. $min(4, 2 + 1) = 3$.
Since A -> C -> B is smaller than A -> B, we update B with this information. We then add in the distances from the other nodes we can now reach.
Our next smallest vertex with a node we haven't visited yet is B, with 3. We visit B.
We do the same for B. Then we pick the smallest vertex we haven't visited yet, D.
We don't update any of the distances this time. Our last node is then E.
There are no updates again. To find the shortest path from A to the other nodes, we walk back through our graph.
We pick A first, C second, B third. If you need to create the shortest path from A to every other node as a graph, you can run this algorithm using a table on the right-hand side.
Dijkstra's Table | ||
---|---|---|
Node | Distance from A | Previous node |
A | 0 | N/A |
B | 3 | C |
C | 2 | A |
D | 5 | B |
E | 6 | B |
Using this table it is easy to draw out the shortest distance from A to every other node in the graph:
Prim's algorithm is a greedy algorithm that finds a minimum spanning tree for a weighted undirected graph. It finds the optimal route from every node to every other node in the tree.
With a small change to Dijkstra's algorithm, we can build a new algorithm - Prim's algorithm!
We informally describe the algorithm as:
We have this graph.
Our next step is to pick an arbitrary node.
We pick the node A. We then examine all the edges connecting A to other vertices. Prim's algorithm is greedy. That means it picks the shortest edge that connects to an unvisited vertex.
In our example, it picks B.
We now look at all nodes reachable from A and B. This is the distinction between Dijkstra's and Prim's. With Dijkstra's, we're looking for a path from 1 node to a certain other node (nodes that have not been visited). With Prim's, we want the minimum spanning tree.
We have 3 edges with equal weights of 3. We pick 1 randomly.
It is helpful to highlight our graph as we go along, because it makes it easier to create the minimum spanning tree.
Now we look at all edges of A, B, and C. The shortest edge is C > E with a weight of 1.
And we repeat:
The edge B > E with a weight of 3 is the smallest edge. However, both vertices are always in our VISITED
list. Meaning we do not pick this edge. We instead choose C > F, as we have not visited
The only node left is G, so let's visit it.
Note that if the edge weights are distinct, the minimum spanning tree is unique.
We can add the edge weights to get the minimum spanning tree's total edge weight:
$$2 + 3 + 3 + 1 + 6 + 9 = 24$$
Imagine you are a thief. You break into the house of Judy Holliday - 1951 Oscar winner for Best Actress. Judy is a hoarder of gems. Judy's house is lined to the brim with gems.
You brought with you a bag - a knapsack if you will. This bag has a weight of 7. You happened to have a listing of Â Judy's items, from some insurance paper. The items read as:
Judy's Items | ||
---|---|---|
Name | Value | Weight |
Diamonds | 16 | 5 |
Francium | 3 | 1 |
Sapphire | 6 | 2 |
Emerald | 2 | 1 |
The first step to solving the fractional knapsack problem is to calculate $\frac{value}{weight}$ for each item.
Judy's Items | |||
---|---|---|---|
Name | Value | Weight | Value / weight |
Diamonds | 16 | 5 | 3.2 |
Francium | 3 | 1 | 3 |
Sapphire | 6 | 2 | 3 |
Emerald | 2 | 1 | 2 |
And now we greedily select the largest ones. To do this, we can sort them according to $\frac{value}{weight}$ in descending order. Luckily for us, they are already sorted. The largest one is 3.2.
knapsack value = 16
knapsack total weight = 5 (out of 7)
Then we select Francium (I know it's not a gem, but Judy is a bit strange đ)
knapsack value = 19
knapsack weight = 6
Now, we add Sapphire. But if we add Sapphire, our total weight will come to 8.
In the fractional knapsack problem, we can cut items up to take fractions of them. We have a weight of 1 left in the bag. Our sapphire is weight 2. We calculate the ratio of:
$$\frac{weight\;of\;knapsack\;left}{weight\;of\;item}$$
And then multiply this ratio by the value of the item to get how much value of that item we can take.
$$\frac{1}{2} * 6 = 3$$
knapsack value = 21
knapsack weight = 7
The greedy algorithm can optimally solve the fractional knapsack problem, but it cannot optimally solve the {0, 1} knapsack problem. In this problem instead of taking a fraction of an item, you either take it {1} or you don't {0}. To solve this, you need to use Dynamic Programming.
The runtime for this algorithm is O(n log n). Calculating $\frac{value}{weight}$ is O(1). Our main step is sorting from largest $\frac{value}{weight}$, which takes O(n log n) time.
Greedy vs Divide & Conquer vs Dynamic Programming | ||
---|---|---|
Greedy | Divide & Conquer | Dynamic Programming |
Optimises by making the best choice at the moment | Optimises by breaking down a subproblem into simpler versions of itself and using multi-threading & recursion to solve | Same as Divide and Conquer, but optimises by caching the answers to each subproblem as not to repeat the calculation twice. |
Doesn't always find the optimal solution, but is very fast | Always finds the optimal solution, but is slower than Greedy | Always finds the optimal solution, but could be pointless on small datasets. |
Requires almost no memory | Requires some memory to remember recursive calls | Requires a lot of memory for memoisation / tabulation |
To learn more about Divide & Conquer and Dynamic Programming, check out these 2 posts I wrote:
Greedy algorithms are very fast, but may not provide the optimal solution. They are also easier to code than their counterparts.
By the end of this article, you'll thoroughly understand Big O notation. You'll also know how to use it in the real world, and even the mathematics behind it!
In computer science, time complexity is the computational complexity that describes the amount of time it takes to run an algorithm.
Big O notation is a method for determining how fast an algorithm is. Using Big O notation, we can learn whether our algorithm is fast or slow. This knowledge lets us design better algorithms.
This article is written using agnostic Python. That means it will be easy to port the Big O notation code over to Java, or any other language. If the code isn't agnostic, there's Java code accompanying it.
We could run an algorithm 10,000 times and measure the average time taken.
â python3 -m timeit '[print(x) for x in range(100)]'
100 loops, best of 3: 11.1 msec per loop
â python3 -m timeit '[print(x) for x in range(10)]'
1000 loops, best of 3: 1.09 msec per loop
# We can see that the time per loop changes depending on the input!
Say we have an algorithm that takes a shopping list and prints out every item on the shopping list. If the shopping list has 3 items, it'll execute quickly. If it has 10 billion items, it'll take a long time.
What is the âperfectâ input size to get the âperfectâ measure of how long the algorithm takes?
Other things we need to consider:
For this reason, we use Big O (pronounced Big Oh) notation.
Big O is a formal notation that describes the behaviour of a function when the argument tends towards the maximum input.
It was invented by Paul Bachmann, Edmund Landau and others between 1894 and 1820s. Popularised in the 1970s by Donald Knuth.
Big O takes the upper bound. The worst-case results in the worst execution of the algorithm. For our shopping list example, the worst-case is an infinite list.
Instead of saying the input is 10 billion, or infinite - we say the input is n size. The exact size of the input doesn't matter, only how our algorithm performs with the worst input. We can still work out Big O without knowing the exact size of an input.
Big O is easy to read once we learn this table:
Constant | Logarithm | Linear | Polynomial | Exponential |
O($1$) | O($log\;n$) | O($n$) | O($n^2$), O($n^3$), O($n^x$) | O($2^n$) |
Where the further right they are, the longer it takes. n
is the size of the input. Big O notation uses these functions to describe algorithm efficiency.
In our shopping list example, in the worst-case of our algorithm it prints out every item in the list sequentially. Since there are n
items in the list, it takes $O(n)$ time to complete the algorithm.
Other asymptotic (time-measuring) notations are:
Big Omega (lower bound) | Big Theta (average bound) | Big O (max bound) |
$\omega (n)$ | $\theta (n)$ | $O(n)$ |
Informally this is:
Let's walk through every single column in our "The Big O Notation Table".
Constant algorithms do not scale with the input size, they are constant no matter how big the input. An example of this is addition. $1 + 2$ takes the same time as $500 + 700$. They may take more physical time, but we do not add more steps in the algorithm for addition of big numbers. The underlying algorithm doesn't change at all.
We often see constant as $O(1)$, but any number could be used and it would still be constant. We sometimes change the number to a 1, because it doesn't matter at all about how many steps it takes. What matters is that it takes a constant number of steps.
Constant time is the fastest of all Big O time complexities. The formal definition of constant time is:
It is upper-bounded by a constant
An example is:
def OddOrEven(n):
return "Even" if n % 2 else "Odd"
Or in Java:
boolean isEven(double num) { return ((num % 2) == 0); }
In most programming languages, all integers have limits. Primitive operations (such as modulo, %
) are all upper-bounded by this limit. If we go over this limit, we get an overflow error.
Because of this upper-bound, it satisfies $O(1)$.
Here's a quick explainer of what a logarithm is.
$$Log_{3}^{9}$$
What is being asked here is â3 to what power gives us 9?â This is 3 to the power of 2 gives us 9, so the whole expression looks like:
$$Log_{3}^{9} = 2$$
A logarithmic algorithm halves the list every time itâs run.
Let's look at binary search. Given the below sorted list:
a = [1, 2, 3, 4, 5, 6 , 7, 8, 9, 10]
We want to find the number "2".
We implement Binary Search as:
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first <= last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
In English this is:
The algorithm halves the input every single time it iterates. Therefore it is logarithmic. Other examples include:
Linear time algorithms mean that every single element from the input is visited exactly once, O(n) times. As the size of the input, N, grows our algorithm's run time scales exactly with the size of the input.
Remember our shopping list app from earlier? The algorithm ran in O(n) which is linear time!
Linear time is where every single item in a list is visited once, in a worst-case scenario.
To read out our shopping list, our algorithm has to read out each item. It can't half the list, or add more items that we didn't add. It has to list all n items, one at a time.
shopping_list = ["Bread", "Butter", "The Nacho Libre soundtrack from the 2006 film Nacho Libre", "Reusable Water Bottle"]
for item in shopping_list:
print(item)
Let's look at another example.
Given the list:
a = [2, 16, 7, 9, 8, 23, 12]
How do we work out what the largest item is?
We need to program it like this:
a = [2, 16, 7, 9, 8, 23, 12]
max_item = a[0]
for item in a:
if item > max_item:
max_item = item
We have to go through every item in the list, 1 by 1.
Polynomial time is a polynomial function of the input. A polynomial function looks like $n^2$ or $n^3$ and so on.
If one loop through a list is $O(n)$, 2 loops must be $O(n^2)$. For each loop, we go over the list once. For each item in that list, we go over the entire list once. Resulting in n^{2} operations.
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for i in a:
for x in a:
print("x")
For each nesting on the same list, that adds an extra +1 onto the powers.
So a triple nested loop is $O(n^3)$.
Bubblesort is a good example of an $O(n^2)$ algorithm. The sorting algorithm takes the first number and swaps it with the adjacent number if they are in the wrong order. It does this for each number, until all numbers are in the right order - and thus sorted.
def bubbleSort(arr):
n = len(arr)
# Traverse through all array elements
for i in range(n):
# Last i elements are already in place
for j in range(0, n-i-1):
# traverse the array from 0 to n-i-1
# Swap if the element found is greater
# than the next element
if arr[j] > arr[j+1] :
arr[j], arr[j+1] = arr[j+1], arr[j]
# Driver code to test above
arr = [64, 34, 25, 12, 22, 11, 90]
bubbleSort(arr)
As a side note, my professor refers to any algorithm with a time of polynomial or above as:
A complete and utter disaster! This is a disaster! A catastrophe!
But the thing with large time complexities is that they often show us that something can be quickened.
For instance, a problem I had. Given a sentence, how many of those words appear in the English Dictionary? We can imagine the $O(n^2)$ method. One for loop
through the sentence, another through the dictionary.
dictionary = ["a", "an"] # imagine if this was the dictionary
sentence = "hello uu988j my nadjjrjejas is brandon nanndwifjasj banana".split(" ")
counter = 0
for word in sentence:
for item in dictionary:
if word == item:
counter = counter + 1
$O(n^2)$! A disaster! But, knowing that this is a disaster means we can speed it up.
Dictionaries are sorted by default. What if we sort our list of words in the sentence, and checked each word that way? We only need to loop through the dictionary once. And if the word we want to check is less than the word we're on, we switch to the second word in the list.
Now our algorithm is $O(n log n)$. We recognise that this isn't a disaster, so we can move on! Knowing time complexities isn't only useful in interviews. It's an essential tool to improve our algorithms.
We can hasten many polynomial algorithms we construct using knowledge of algorithmic design.
Exponential time is $2^n$, where 2 depends on the permutations involved.
This algorithm is the slowest of them all. You saw how my professor reacted to polynomial algorithms. He was jumping up and down in furiosity at exponential algorithms!
Say we have a password consisting only of numbers (10 numbers, 0 through to 9). we want to crack a password which has a length of n. To bruteforce through every combination we'll have:
$$10^n$$
Combinations to work through.
One example of exponential time is to find all the subsets of a set.
>>> subsets([''])
['']
>>> subsets(['x'])
['', 'x']
>>> subsets(['a', 'b'])
['', 'a', 'b', 'ab']
We can see that when we have an input size of 2, the output size is $2^2 = 4$.
Now, let's code up subsets
.
from itertools import chain, combinations
def subsets(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
Taken from the documentation for itertools. What's important here is to see that it exponentially grows depending on the input size. Java code can be found here.
Exponential algorithms are horrific, but like polynomial algorithms we can learn a thing or two. Let's say we have to calculate $10^4$. We need to do this:
$$10 * 10 * 10 * 10 = 10^2 * 10^2$$
We have to calculate $10^2$ twice! What if we store that value somewhere and use it later so we do not have to recalculate it? This is the principle of Dynamic Programming, which you can read about here.
When we see an exponential algorithm, dynamic programming can often be used to speed it up.
Again, knowing time complexities allows us to build better algorithms.
Here's our Big O notation graph where the numbers are reduced so we can see all the different lines.
You can find the code for this graph here.
Rarely will time complexity be as easy as counting how many for loops we have. What if our algorithm looks like $O(n + n^2)$? We can simplify our algorithm using these simple rules:
If we have an algorithm described as $O(2n)$, we drop the $2$ so it becomes $O(n)$.
$O(nÂ˛ + n)$ becomes $O(nÂ˛)$. Only keep the larger one in Big O.
If we have a sum such as $O(bÂ˛ + a)$ we canât drop either without knowledge of what b and a are.
Yup! The hardest part is figuring out what our program's complexity is first. Simplifying is the easy part! Just remember the golden rule of Big O notation:
"What is the worst-case scenario here?"
This is the fastest time possible for a comparison sort. We cannot get any faster unless we use some special property, like Radix sort. O(n log n) is the fastest comparison sort time.
It's rather famous, because Mergesort runs in O(n log n). Mergesort is a great algorithm not only because it sorts fast, but because the idea is used to build other algorithms.
Mergesort is used to teach divide & conquer algorithms. And for good reason, it's a fantastic sorting algorithm that has roots outside of sorting.
Mergesort works by breaking down the list of numbers into individual numbers:
And then sorting each list, before merging them:
def mergeSort(alist):
print("Splitting ",alist)
if len(alist)>1:
mid = len(alist)//2
lefthalf = alist[:mid]
righthalf = alist[mid:]
mergeSort(lefthalf)
mergeSort(righthalf)
i=0
j=0
k=0
while i < len(lefthalf) and j < len(righthalf):
if lefthalf[i] <= righthalf[j]:
alist[k]=lefthalf[i]
i=i+1
else:
alist[k]=righthalf[j]
j=j+1
k=k+1
while i < len(lefthalf):
alist[k]=lefthalf[i]
i=i+1
k=k+1
while j < len(righthalf):
alist[k]=righthalf[j]
j=j+1
k=k+1
print("Merging ",alist)
alist = [54,26,93,17,77,31,44,55,20]
mergeSort(alist)
print(alist)
Notice the le10
at the top? This one is so large, it makes all other times look constant!
This time complexity is often used as a joke, referring to Bogo Sort. I have yet to find a real life (not-a-joke) algorithm that runs in O(n!) that isn't an algorithm calculating O(6!).
Our own algorithms will normally be based on some famous algorithm that already has a Big O notation. If it's not, do not worry! Working out the Big O of our algorithm is easy.
Just think:
"What is the absolute worst input for my program?"
Take, for instance, a sequential searching algorithm.
def search(listInput, toFind):
for counter, item in enumerate(listInput):
if toFind == item:
return (counter, item)
return "did not find the item!"
The best input would be:
search(["apples"], "apples")
But the worst input is if the item was at the end of a long list.
search(["apples", "oranges", "The soundtrack from the 2006 film Nacho Libre", "Shrek"], "Shrek")
The worst-case scenario is $O(n)$, because we have to go past every item in the list to find it.
What if our search algorithm was binary search? We learnt that binary search divides the list into half everytime. This sounds like log n
!
What if our binary search looks for an object, and then looks to find other similar objects?
# here we want to find the film shrek, find its IMDB rating and find other films with that IMDB rating. We are using binary search, then sequential search
toFind = {title: "Shrek", IMDBrating: None}
ret = search(toFind)
ret = search(ret['IMDBrating'])
We find Shrek with an IMDB score of 7.8. But we're only sorted on the title, not the IMDB rating. We have to use sequential search to find all other films with the same rating.
Binary search is $O(log n)$ and sequential search is $O(n)$, this makes our algorithm $O(n log n)$. This isn't a disaster, so we can sure it's not a terrible algorithm.
Even in the instances where our algorithms are not strictly related to other algorithms, we can still compare them to things we know. $O(log n)$ means halfing. $O(n^2)$ means a nested for loop.
One last thing, we don't always deal with n
. Take this below algorithm:
x = [1, 2, 3, 4, 5]
y = [2, 6]
y = iter(y)
counter = 0
total = 0.0
while counter != len(x):
# cycles through the y list.
# multiplies 2 by 1, then 6 by 2. Then 2 by 3.
total = total + x[counter] * next(y)
counter += 1
print(total)
We have 2 inputs, x and y. Our notation is then $O(x + y)$. Sometimes we cannot make our notation smaller without knowing more about the data.
I made this little infographic for you! The "add +1 for every nested for loop" depends on the for loop, as we saw earlier. But explaining that all over again would take up too much space đ
Okay, this is where it gets hard. A lot of complaints against Big O notation is along the lines of:
"You didn't really teach it, to really understand it you have to understand the maths!"
And I kinda agree. The surface level knowledge above will be good for most interviews, but the stuff here is the stuff needed to master Big O notation.
Just as a reminder, we want to master asymptotic time complexity as it allows us to create better algorithms.
I'm going to be writing out the formal notation, and then explaining it simply. Over-simplification causes misinformation, so if you are studying for a test take my simplifications as generalities and not the truth.
The mathematics is the whole truth, and you would be better of studying the maths rather than studying my simplifications. As I once read on the internet:
Shut up and calculate.
First things first, when I said:
Big O notation is the worst-case
That's not true at all. It's a white lie designed to help you learn the basics. Often used to get us to know enough to just pass interviews, but not enough to use it in the real world.
The formal definition of Big O notation is:
The upper-bounded time limit of the algorithm
Now, this often means "the worst-case" but not always. We can put upper bounds on whatever we want. But more often than not, we put upper-bounds on the worst-case. In one of our examples, we'll come across a weird formula where "the worst-case" isn't necessarily the one we choose for Big O.
This is an important distinction to make, because some caveats will confuse us otherwise.
Given 2 positive functions, $f(n)$ and $g(n)$ we say $f(n)$ is $O(g(n))$, written $f(n) \in O(g(n))$, if there are constants $c$ and $n_0$ such that:
$$f(n) \le c * g(n) \forall \;n \geq Â n_o$$
Also, sometimes $n_0$ is called $k$. But $c$ stays the same.
This is saying that the function (algorithm) is a part of another function (the Big O notation used). Simplifying again: Our algorithm falls within the range of a Big O notation time complexity (O(n), O(log n), etc). So our algorithm is that time complexity (to simplify it).
Let's see an example.
$$7n - 4 \in O(n)$$
Here we are claiming that $7n - 4$ is in $O(n)$ time. In formal Big O notation, we don't say it is that time. We say it falls within the range of that time.
We need to find constants $c$ and $n_0$ such that $7n-4 \le cn$ for all $n \geq n_0$.
One choice is $c = 7$ and $n_0 = 1$. $7 * 7 = 42 - 4 = 38$ and $7 * 1 = 7$ so for all where $n \geq 7$ this function holds true.
This is just one of the many choices, because any real number $c \geq 7$ and any integer $n_0 \geq 1$ would be okay.
Another way to rephrase this is:
$$7n-4 \le 7n \; where \; n \geq 1$$
The left hand side, $7n-4$ is f(n). c = 10. g(n) = n. Therefore we can say $f(n) =O(n)$ because $g(n) = n$. We say $f(n) \in O(n)$.
All we have to do is substitute values into the formula until we find values for c and n that work. Let's do 10 examples now.
$$f(n) = 4n^2 + 16n + 2$$
Is f(n) O(n^{4})?
We need to take this function:
$$f(n) = 4n^2 + 16n + 2$$
and say "is this less than some constant times $n^4$?" We need to find out if there is such a constant.
$$n^2 + 16n + 2 \le n^4$$
Let's do a chart. If $n = 0$ we get:
$$0 + 0 + 2 = 2 \le 0$$
This isn't true, so N = 0 is not true.
When $n = 1$:
$$ 4 * 1 * 16 * 2 = 22 \le 1^4 = 22 \le 1$$
Is not true. Let's try it again with n = 3.
$$50 \le 16$$
Not true, so let's try another one. $n = 3$.
$$86 \le 3^3 = 86 \le 81$$
Not true. Looks like the next one should work as we are approaching the tipping point. $n = 4$.
$$ 130 \le 256$$
This is true. When $n = 4$ or a greater number then this function where it's less than N^{4} becomes True. When $C = 1, N \geq 4$ this holds true.
The answer to the question "is this function, $n^2 + 16n + 2$, Big O of n^{4} true?" Yes, when $c = 1$ and $n \geq 4$."
Note: I'm saying $c=1$ but I'm not writing $cn$ every time. Later on, using c will become important. But for these starter examples we'll just assume $c = 1$ until said otherwise.
$$3n^2 + n + 16$$
Is this $O(n^2)$?
We know that $n \le n^2$ for all $n \geq 1$. Also, $16 \le n^2$ for $n \geq 4$.
So:
$$3n^2 + n + 16 \le 3n^2 + n^2 + n^2 = 5n^2$$
for all $n \geq 4$.
The constant C is 5, and $n_0 = 3$.
$$13n^3 + 7n \;log \;n + 3$$
is $O(n^3)$
Because $log \;n \le n \geq n^2$ for all $n \geq 1$, and for similar reasons as above we may conclude that:
$13n^3 + 6nlog \;n + 3 \le 21 n^3$ for all 'large enough' n.
In this instance, $c = 21$.
$$45n^5 - 10n^2 + 6n - 12$$
is $O(n^2)$?
Any polynomial $a_k n^k + ... + a_2 n^2 + a_1 n + a_0$ with $a_k > 0$ is $O(n^k)$.
Along the same lines, we can arrgue that any polynomial $a_k n^k + ... + a_2 n^2 + a_1 n + a-0$ with $a_k > 0$ is also $O(n^j)$ for all $j \geq k$.
Therefore $45n^5 - 10n^2 + 6n - 12$ is $O(n^2)$ (and is also $O(n^8)$ or $O(n^9)$ or $O(n^k)$ for any $n \geq 5$).
$$\sqrt{n}$$
is $O(n)$?
This doesn't hold true. $\sqrt{n} = n^{1/2}$. Therefore $O(n^{1/2}) < O(n)$.
I hope you appreciate the easy example to break up the hard maths đ
$$ 3 \;log \;n + \;log \;log \;n$$
is $O(log \;n)$?
First we have this equality that $log \;n \le n$ for every $n \geq 1$. We can put a double log here like so: $log\;log\;n \le log \;n$. Log log n is smaller than log n. We replaced "n" with "log n" on both sides of $log \;n \le n$. So:
$$3 \;log \;n + \;log\;log \;n \le 4 \;log \;n$$
So:
$$c = 4, n_0 = 1$$
$log \; n$
is $< O(\sqrt{n})$
Log n grows slower than any function where this holds:
$$log \; m \le m^\epsilon$$ for every $\epsilon > 0$ no matter how small it is, as long as it is positive.
Using this inequality if we plug in $\epsilon = \frac{1}{2}$ and we plug that into our equation $\sqrt{m} = m^{\frac{1}{2}}$.
Knowing that $log \; m \le m^\epsilon$ we know that $O(log \; n) < O(\sqrt{n})$
$$2n + 3$$
What is the Big O of this?
$$2n + 3 \le 10n, n Â \geq 1$$
$$f(n) = O(n)$$.
This is because n is more than or equal to 1, it will always be larger than g(n) which is $2n + 3$. Therefore, we have $O(n)$.
$$2n + 3 \le 10n$$
We don't have to write 10, it can be whatever we want so long as the equation holds true.
$$2n + 3 \le 2n + 3n$$
$$2n + 3 \le 5n, n \geq 1$$
Therefore $f(n) = O(n)$.
Or we can write:
$$2n + 3 \le 5n^2 , n \geq 1$$
$f(n) = 2n + 3$
$c = 5$
$g(n) = n^2$
Can this same function be both $O(n)$ and $O(n^2)$? Yes. It can be. This is where our definition of big o comes into play. It's the upperbounded limit. We can say it is $n^2, 2^n$ and any higher. But we cannot say it's lower.
When we write big o, we often want to use the closet function. Otherwise we could say that every algorithm has an upperbound of $O(2^n)$, which isn't true.
Note: what we want to do (choose the closet function) Â is just personal preference for most courses. All functions which work and are the upperbound, are true.
There's a fantastic video on this strange concept here (and I took this example from there).
Big O represents how long an algorithm takes but sometimes we care about how much memory (space complexity) an algorithm takes too. If you're ever stuck, come back to this page and check out the infographics!
In this short 10 minute article, youâll learn what the functional paradigm is in Python. Youâll also learn about list comprehensions.
In an imperative paradigm, we do things by giving the computer a sequence of tasks and then it executes them. While executing them, it can change states. For example, letâs say we set A = 5
, then we change the value of A
. We have variables in the sense that the value inside the variable varies.
In a functional paradigm, we donât tell the computer what to do but we tell it what stuff is. What the greatest common divisor of a number is, and so on.
Variables cannot vary. Once we set a variable, it stays that way forever. Because of this, functions have no side effects in the functional paradigm. A side effect is where the function changes something outside of it. Letâs look at an example:
a = 3
def some_func():
global a
a = 5
some_func()
print(a)
The output Â is 5. In the functional paradigm, changing variables is a big no-no and having functions affect things outside of their scope is also a big no-no. The only thing a function can do is calculate something and return it.
Now you might think "no variables, no side effects? Why is this good?". Good question, gnarly stranger reading this.
If a function is called twice with the same parameters, itâs guaranteed to return the same result. If youâve learnt about mathematical functions, youâll appreciate this benefit. We call this referential transparency. Because functions have no side effects, if we are building a program which computes things, we can speed up the program. If the program knows that func(2)
equates to 3
, we can store this in a table. This prevents the program from running the same function when we already know the answer.
Typically, in functional programming, we do not use loops. We use recursion. Recursion is a mathematical concept, it means âfeeding into itselfâ. With a recursive function, the function calls itself as a sub-function. Hereâs a nice example of a recursive function in Python:
def factorial_recursive(n):
# Base case: 1! = 1
if n == 1:
return 1
# Recursive case: n! = n * (n-1)!
else:
return n * factorial_recursive(n-1)
Some programming languages are also lazy. This means they donât compute or do anything until the very last second. If we write some code to perform 2 + 2
, a functional program will only calculate that when we need to use the resultant. Weâll explore laziness in Python soon.
To understand map, letâs first look at what iterables are. An iterable is anything we can iterate over. These are lists or arrays, but Python has many different iterables. We can even create our own iterable objects which by implementing magic methods. A magic method is like an API that helps our objects become more Pythonic. We need to implement 2 magic methods to make an object an iterable:
class Counter:
def __init__(self, low, high):
# set class attributes inside the magic method __init__
# for âinitialiseâ
self.current = low
self.high = high
def __iter__(self):
# first magic method to make this object iterable
return self
def __next__(self):
# second magic method
if self.current > self.high:
raise StopIteration
else:
self.current += 1
return self.current - 1
The first magic method, __iter__
or dunder iter (double underscore iter) returns the iterative object, we often use this at the start of a loop. Dunder next, __next__
, returns what the next object is.
Letâs check this out:
for c in Counter(3, 8):
print(c)
This will print:
3
4
5
6
7
8
In Python, an iterator is an object which only has an __iter__
magic method. This means that we can access positions in the object, but cannot iterate through the object. Some objects will have the magic method __next__
and not the __iter__
magic method, such as sets (talked about later in this article). For this article, weâll assume everything we touch is an iterable object.
Now we know what an iterable object is, letâs go back to the map function.
The map function lets us apply a function to every item in an iterable. We want to apply a function to every item in a list, but know that itâs possible for most iterables. The syntax for map takes 2 inputs, the function to apply and the iterable object.
map(function, iterable)
Say we have a list of numbers like:
[1, 2, 3, 4, 5]
And we want to square every number, we can write code like this:
x = [1, 2, 3, 4, 5]
def square(num):
return num*num
print(list(map(square, x)))
Functional Python is lazy. If we didnât include the list()
the function would store the definition of the iterable, not the list itself. We need to tell Python âturn this into a listâ for us to use this.
Itâs weird to go from non-lazy evaluation to lazy evaluation suddenly in Python. Youâll get used to it if you think more in the functional mindset than an imperative mindset.
Now itâs nice to write a normal function like square(num)
but it doesnât look right. Do we have to define a whole function just to use it once in a map? Well, we can define a function in map using a lambda (anonymous) function.
Lambda functions are a one-line function, used for a short period of time. We often use them with higher order functions along with filter, map, and reduce. This lambda expression squares a number given to it:
square = lambda x: x * x
Now letâs run this:
>>> square(3)
9
I hear you. âBrandon, where are the arguments? what the heck is this? that looks nothing like a function?â
Well, itâs confusing but can be explained. Weâre assigning something to the variable square
. this part:
lambda x:
Tells Python that this is a lambda function, and the input is called x. Anything after the colon is what we do with the input, and it returns whatever the result of that is.
To simplfy our square program into one line we can do:
x = [1, 2, 3, 4, 5]
print(list(map(lambda num: num * num, x)))
In a lambda expression, all the arguments go on the left and the stuff we want to do with them go on the right. It gets a little messy, no one can deny that. Thereâs a certain pleasure in writing code that only other functional programmers can read. Also, itâs super cool to take a function and turn it into a one-liner.
Reduce is a function that applies a given function on an iterable and returns one thing. Normally weâll perform a computation on a list to reduce it down to one number. Reduce looks like this:
reduce(function, list)
We can (and often will) use lambda expressions as the function.
The product of a list is every single number multiplied together. To program this:
product = 1
x = [1, 2, 3, 4]
for num in x:
product = product * num
But with reduce we can write:
from functools import reduce
product = reduce((lambda x, y: x * y),[1, 2, 3, 4])
To get the same product. The code is shorter, and with knowledge of functional programming, it is neater.
The filter function takes an iterable and filters out all the things we donât want in that iterable.
Filter takes a function and a list. It applies the function to each item in the list and if that function returns True, it does nothing. If it returns False, it removes that item from the list.
The syntax looks like:
filter(function, list)
Letâs see a small example, without filter weâll write:
x = range(-5, 5)
new_list = []
for num in x:
if num < 0:
new_list.append(num)
With filter, this becomes:
x = range(-5, 5)
all_less_than_zero = list(filter(lambda num: num < 0, x))
Higher order functions can take functions as parameters and return functions. An example is:
def summation(nums):
return sum(nums)
def action(func, numbers):
return func(numbers)
print(action(summation, [1, 2, 3]))
# Output is 6
Or an simple example of the second definition, return functions
, is:
def rtnBrandon():
return âbrandonâ
def rtnJohn():
return âjohnâ
def rtnPerson():
age = int(input(âWhatâs your age?â))
if age == 21:
return rtnBrandon()
else:
return rtnJohn()
Higher-order functions make non-varying variables easier to work with. We need not store a variable anywhere if all weâre doing is passing data through a long tunnel of functions.
All functions in Python are first-class objects. We define a first-class object as having one or more of these features:
So all functions in Python are first-class and can be used as a higher-order function.
Partial application (also called closures) is weird but is cool. We can call a function without supplying all the arguments it requires. Letâs see this in an example. We want to create a function which takes 2 arguments, a base and an exponent, and returns base to the power of the exponent, like so:
def power(base, exponent):
return base ** exponent
Now we want to have a dedicated square function, to work out the square of a number using the power function:
def square(base):
return power(base, 2)
This works, but what if we want a cube function? or a function to the power of 4? Can we keep on writing them forever? Well, we could. But programmers are lazy. If we repeatedly do the same thing, itâs a sign that there is a much quicker way to speed things up and that will allow us to not repeat things. We can use partial applications here. Letâs see an example of the square function using a partial application:
from functools import partial
square = partial(power, exponent=2)
print(square(2))
# output is 4
Isnât that cool! We can call functions which require 2 arguments, using only 1 argument by telling Python what the second argument is.
We can also use a loop, to generate a power function that works from cubed up to powers of 1000.
from functools import partial
powers = []
for x in range(2, 1001):
powers.append(partial(power, exponent = x))
print(powers[0](3))
# output is 9
You might have noticed, but a lot of the things we want to do in functional programming revolve around lists. Other than the reduce function & partial application, all the functions we have seen generate lists. Guido (the inventor of Python) dislikes functional stuff in Python because Python already has its own way to generate lists.
If we write âimport thisâ into a Python IDLE session, weâll get:
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases arenât special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless youâre Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, itâs a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- letâs do more of those!
This is the Zen of Python. Itâs a poem about what something being Pythonic means. The part we want to relate to here is:
There should be one â and preferably only one â obvious way to do it.
In Python, map & filter can do the same things as a list comprehension (discussed next). This breaks one rule of the Zen of Python, so these parts of functional programming âpythonicâ.
Another talking point is Lambda. In Python, a lambda function is a normal function. Lambda is syntactic sugar. Both are equivalent:
foo = lambda a: 2
def foo(a):
return 2
A regular function can do everything a lambda function can, but it doesnât work the other way around. A lambda function cannot do everything that a regular function can do.
This was a short argument about why functional programming doesnât fit into the whole Python ecosystem very well. You may have noticed I mentioned list comprehensions earlier, weâll discuss them now.
Earlier, I mentioned that anything we could do with map or filter, we could do with a list comprehension. This is the part where weâll learn about them.
A list comprehension is a way to generate lists in Python. The syntax is:
[function for item in iterable]
So letâs square every number in a list, as an example:
print([x * x for x in [1, 2, 3, 4]])
Okay, so we can see how we can apply a function to every item in a list. How do we go around applying a filter? Well, look at this code from earlier:
x = range(-5, 5)
all_less_than_zero = list(filter(lambda num: num < 0, x))
print(all_less_than_zero)
We can convert this into a list comprehension like so:
x = range(-5, 5)
all_less_than_zero = [num for num in x if num < 0]
List comprehensions support if statements like this. We no longer need to apply a million functions to something to get what you want. If weâre trying to make some kind of list chances are that itâll look cleaner and easier using a list comprehension.
What if we want to square every number below 0 in a list? Well, with lambda, map and filter weâll write:
x = range(-5, 5)
all_less_than_zero = list(map(lambda num: num * num, list(filter(lambda num: num < 0, x))))
This is long and complicated. With a list comprehension it is:
x = range(-5, 5)
all_less_than_zero = [num * num for num in x if num < 0]
A list comprehension is only good for, well, lists. Map and filter work on any iterable, so whatâs up with that? We can use any comprehension for any iterable object we encounter.
We can generate any iterable using a comprehension. Since Python 2.7, we can even generate a dictionary (hashmap).
# Taken from page 70 chapter 3 of Fluent Python by Luciano Ramalho
DIAL_CODES = [
(86, âChinaâ),
(91, âIndiaâ),
(1, âUnited Statesâ),
(62, âIndonesiaâ),
(55, âBrazilâ),
(92, âPakistanâ),
(880, âBangladeshâ),
(234, âNigeriaâ),
(7, âRussiaâ),
(81, âJapanâ),
]
>>> country_code = {country: code for code, country in DIAL_CODES}
>>> country_code
{âBrazilâ: 55, âIndonesiaâ: 62, âPakistanâ: 92, âRussiaâ: 7, âChinaâ: 86, âUnited Statesâ: 1, âJapanâ: 81, âIndiaâ: 91, âNigeriaâ: 234, âBangladeshâ: 880}
>>> {code: country.upper() for country, code in country_code.items() if code < 66}
{1: âUNITED STATESâ, 7: âRUSSIAâ, 62: âINDONESIAâ, 55: âBRAZILâ}
If itâs an iterable, we can generate it. Letâs look at one last example of sets. If you donât know what a set is, check out this other article I wrote. The TL;DR is:
# taken from page 87, chapter 3 of Fluent Python by Luciano Ramalho
>>> from unicodedata import name
>>> {chr(i) for i in range(32, 256) if âSIGNâ in name(chr(i), ââ)}
{âĂâ, âÂĽâ, âÂ°â, âÂŁâ, âÂŠâ, â#â, âÂŹâ, â%â, âÂľâ, â>â, âÂ¤â, âÂąâ, âÂśâ, âÂ§â, â<â, â=â, âÂŽâ, â$â, âĂˇâ, âÂ˘â, â+â}
You may notice that sets have the same curly braces as dictionaries. Python is smart. Itâll know whether we're writing a dictionary comprehension or a set comprehension based on whether we provide the extra value for the dictionary or not. If you want to learn more about comprehensions, check out this visual guide.
Functional programming is beautiful and pure. Functional code can be clean, but it can also be messy. You should use what you want to use.
No talk about downloading things on BitTorrent. Or the best clients to do so.
Just a deep-dive into the technical side of it.
Anyone can read this article. Requires ZERO knowledge on networking or BitTorrent to read this.
BitTorrent is one of the most common protocols for transferring large files. In February 2013, BitTorrent was responsible for 3.35% of all worldwide bandwidth, more than half of the 6% of total bandwidth dedicated to file sharing.
Let's dive right in.
Bram Cohen invented the BitTorrent protocol in 2001. Cohen wrote the first client implementation in Python.
Cohen collected free pornography to lure beta testers to use BitTorrent in the summer of 2002.
In traditional downloading, the server uploads the file, and the client downloads the file.
For popular files, this isn't very effective.
500 people downloading the same file will put the server under strain. This strain will cap the upload speed, so clients can not download the file fast.
Second, client-server costs a lot of money. The amount we pay increases with how popular a file is.
Third, itâs centralised. Say the system dies, the file no longer exists - no one can download it.
BitTorrent aims to solve these problems.
In a peer-to-peer network, every peer is connected to every other peer in the network.
Semi-centralised peer-to-peer networks possess one or more peers with higher authority than most peers.
BitTorrent is a way to share files. Itâs often used for large files. BitTorrent is an alternative to a single source sharing a file, such as a server. BitTorrent can productively work on lower bandwidth.
The first release of the BitTorrent client had no search engine and no peer exchange, users who wanted to upload a file had to create a small torrent descriptor file that they would upload to a torrent index site.
When a user wants to share a file, they seed their file. This user is called a seeder. They upload a torrent descriptor file to an exchange (weâll talk about this later). Anyone who wants to download that file will download this torrent descriptor.
We call those who download peers. Their torrent client will connect to a tracker (discussed later) and the tracker will send them a list of IP addresses of other seeds and peers in the swarm. The swarm is all PCâs related to a certain torrent.
The torrent descriptor file contains a list of trackers and metadata on the file weâre downloading.
A peer will connect to a seed and download parts of the file.
Once the peer completes a download, they could function as a seed. Although, it is possible to function as a seed while also downloading (and is very common).
Once the seed has shared the file to a peer, that peer will act as a seed. Instead of the client-server model where only 1 server exists to upload the file, in BitTorrent, multiple people can upload the same file.
BitTorrent splits the file up into chunks called pieces, each of a certain size. Sometimes itâs 256KB, sometimes itâs 1MB. As each peer receives a piece, they become a seed of that piece for other peers.
With BitTorrent, we do not have a single source to download from. We could download a few pieces from your home country, then download a few that your home country doesnât own from a far-away country.
The protocol hashes the pieces to make sure no seed has tampered with the original file. Then stores the hash in the torrent descriptor on the tracker.
This is how BitTorrent works at a very high level. We will now go into detail. We aim to answer these questions:
And much more.
It's a dictionary (or hashmap) file.
The file is described as:
The URL of the tracker. Remember earlier when we contacted the tracker server to find other peers using the same file? We found that tracker by using the announce key in the torrent descriptor file.
This maps to a dictionary whose keys depend on whether one or more files are being shared. The keys are:
Files (child of info, is a list)
Files only exists when multiple files are being shared. Files is a list of dictionaries. Each dictionary corresponding to a file.
Each of these dictionaries has 2 keys.
Length - the size of the file in bytes.
Path - Â A list of strings corresponding to subdirectory names, the last of which is the actual file name.
The size of the file in bytes (only when one file is being shared)
Suggested filename. Or the suggested directory name.
The number of bytes per piece.
The piece's length must be a power of two and at least 16KiB.
This is $$2^8 \; KiB = 256 \; KiB = 262,144 \; B$$
A hash list.
A list of hashes calculated on various chunks of data. We split the data into pieces. Calculate the hashes for those pieces, store them in a list.
BitTorrent uses SHA-1, which returns a 160-bit hash. Pieces will be a string whose length is a multiple of 20 bytes.
If the torrent contains multiple files, the pieces are formed by concatenating the files in the order they appear in the files directory.
All pieces in the torrent are the full piece length except for the last piece which may be shorter.
Now, I can guess what youâre thinking.
"SHA-1? What is this? The early 2000s?"
And I agree. BitTorrent is moving from SHA-1 to SHA256.
Still confused? Not to worry! I designed this JSON file that describes what a torrent file looks like. Note: Iâve concatenated some things. This makes it easier to read and understand the general layout. I made the numbers up, following the rules of BitTorrentâs torrent descriptor.
{
"Announce": "url of tracker",
"Info": {
"Files": [
{
"Length": 16,
"path": "/folder/to/path"
},
{
"length": 193,
"path": "/another/folder"
}
]
},
"length": 192,
"name":" Ubuntu.iso",
"Pieces length": 262144,
"Pieces": [AAF4C61DDCC5E8A2DABEDE0F3B482CD9AEA9434D, CFEA2496442C091FDDD1BA215D62A69EC34E94D0]
}
One of the largest questions in BitTorrent is âwhat pieces should I select to download?â
With a traditional client-server model, we download the whole file. But now, we get to pick what pieces to download.
The idea is to download the pieces that no one else has - the rare pieces. By downloading the rare pieces, we make them less rare by uploading them.
BitTorrent uses TCP, a transmission protocol for packets. TCP has a mechanism called slow start.
Slow start is a mechanism which balances the speed of a TCP network connection. It escalates the amount of data transmitted until it finds the networkâs maximum carrying capacity. cwdn
stands for the Congestion Window.
TCP does this because if we send 16 connections at once, the server may not be used to the traffic and congestion will happen on the network.
If weâre not regularly sending data, TCP may cap our network connection at a slower speed than normal.
BitTorrent makes sure to always send data by breaking pieces down into further sub-pieces.
Each sub-piece is about 16KB in size. The size for a piece is not fixed, but it is somewhere around 1MB.
The protocol always has some number of requests (five) for a sub-piece pipe-lined. When a new sub-piece is download, the client sends a new request. This helps speed things up.
Sub-pieces can be downloaded from other peers.
Two core policies govern the Piece Selection Algorithm.
Once the BitTorrent client requests a sub-piece of a piece, any remaining sub-pieces of that piece are requested before any sub-pieces from other pieces.
In this image, it makes sense to download all the sub-pieces of this piece first rather than start downloading another piece.
The main policy in BitTorrent is to pick the rarest first. We want to download the piece which the fewest other peers own.
This is so we can make it âun-rareâ. If only one peer has a piece and they go offline, no one will get the complete file.
A plethora of benefits exist for this policy.
Growing the seed
Rarest first makes sure that we download only new pieces from the seed.
The seed will begin as a bottleneck. The one peer with the file.
A downloader can see what pieces their peers possess, and the rarest first policy will cause us to fetch the pieces from the seed which have not been uploaded by other peers.
Let's visualise this.
The list of nodes (peers) is inter-connected. I cannot draw this as the diagram is unfavourable.
Each arrow towards a sub-piece what that peer has downloaded. We downloaded a sub-piece that no one else has other than the seed. This means this sub-piece is rare.
Our upload rate is higher than that of the seed, so all peers will want to download from us. Also, they would want to download the rarest pieces first, and as we are one of 2 holders of the rarest piece.
When everyone downloads from us, we can download faster from them. This is the tit-for-tat algorithm (discussed later).
Increased download speed
The more peers that hold the piece, the faster the download can happen. This is because we can download sub-pieces from other peers.
Enable uploading
A rare piece is most wanted by other peers and getting a rare piece means peers will be interested in uploading from us. As we will see later, the more we upload, the more we can download.
Most common last
It is sensible to leave the most common pieces to the end of the download. As many peers hold common pieces, the probability of being able to download them is much larger than that of rare pieces.
Prevent rarest piece missing
When the seed dies, all the different pieces of the file should be distributed somewhere among the remaining peers.
Once we download, we have nothing to upload. We need the first piece, fast. The rarest first policy is slow. Rare pieces are downloaded slower because we can download its sub-pieces from only a few peers.
Sometimeâs a peer with a slow transfer rate will try to give us a sub-piece. Causing a delay of the download. To prevent this, there is âendgame modeâ.
Remember the pipe-lining principle? There are always several requests for sub-pieces pending.
When all the sub-pieces a peer lacks are requested, they broadcast this request to all peers. This helps us get the last chunk of the file.
If a peer has the missing sub-piece, they will send that back to our computer.
Once a sub-piece arrives, we send a cancel-message telling the other peers to ignore our request.
No centralised resource allocation in BitTorrent exists. Instead, every peer maximises their download rate.
A peer will download from whoever they can. To decide who to upload to, they will use a variant of the "tit-for-tat"algorithm.
The tit-for-tat strategy comes from game theory. The essence is:
"Do onto others as they do onto you"
Choking is a temporary refusal to upload to another peer, but we can still download from them.
To cooperate peers upload, and to not cooperate they âchokeâ the connection to their peers. The principle is to upload to peers who have uploaded to us.
We want several bidirectional connections at the same time and to achieve Pareto Efficiency.
We consider an allocation Pareto Efficient if there is no other allocation in which some individual is better off and no individual is worse off.
Thus the big question, is how to determine which peers to choke and which to unchoke?
A peer always unchokes a fixed number of its peers (the default is 4).
Current download rates decide which peers to unchoke. We use a 20-second average to decide this. Because of the use of TCP (slow-start) rapidly choking and unchoking is bad. Thus, this is calculated every 10 seconds.
If our upload rate is high more peers will allow us to download from them. This means that we can get a higher download rate if we are a good uploader. This is the most important feature of the BitTorrent protocol.
The protocol prohibits many âfree ridersâ which are peers who only download and donât upload.
For a peer-to-peer network to be efficient, all peers need to contribute to the network.
BitTorrent also allows an additional unchoked peer, where the download rate criteria arenât used.
We call this optimistic unchoking. Checking unused connections arenât better than the ones in use.
We shift the optimistic unchoke every 30 seconds. Enough time for the upload reaches full speed. Same for the upload. If this new connection turns out to be better than one of the existing unchoked connections, it will replace it.
The optimistic unchoke is randomly selected.
This also allows peers who do not upload and only download to download the file, even if they refuse to cooperate. Albeit, they will download at a much slower speed.
What happens if all peers uploading to another peer decide to choke it? We then have to find new peers, but the optimistic unchoking mechanism only checks one unused connection every 30 seconds. To help the download rate recover more, BitTorrent has snubbing.
If a client hasnât received anything from a particular peer for 60 seconds, it will presume that it has been âsnubbedâ.
Following the mentality of tit-for-tat, we retaliate and refuse to upload to that peer (except if they become an optimistic unchoke).
The peer will then increase the number of optimistic unchokes to find new connections quicker.
We see that using the choking algorithm implemented in BitTorrent we favour peers who are kind to us. If I can download fast from them, we allow them to upload fast from me.
What about no downloads? Then itâs impossible to know which peers to unchoke using this choking algorithm. When a download is completed, we use a new choking algorithm.
This new choking algorithm unchokes peers with the highest upload rate. This ensures that pieces get uploaded faster, and they get replicated faster.
Peers with good upload rates are also not being served by others.
Trackers are special types of server that help in communication between peers.
Communication in BitTorrent is important. How do we learn what other peers exist?
The tracker knows who owns the file, and how much.
Once a peer-to-peer download has started, communication can continue without a tracker.
Since the creation of the distributed hash table method for trackerless torrents, BitTorrent trackers are largely redundant.
These are trackers that anyone can use.
The Pirate Bay operated one of the most popular public trackers until disabling it in 2009, opting only for magnet links (discussed soon).
Private trackers are private. They restrict use by requiring users to register with the site. The method for controlling registration is often an invitation system. To use this tracker we need an invitation.
Multi-tracker torrents contain multiple trackers in a single torrent file. This provides redundancy if one tracker fails, the other trackers can continue to maintain the swarm for the torrent.
With this configuration, it is possible to have multiple unconnected swarms for a single torrent - which is bad. Some users can connect to one specific tracker while being unable to connect to another. This can create a disjoint set which can impede the efficiency of a torrent to transfer the files it describes.
Earlier, I talked about how the Pirate Bay got rid of trackers and started using trackerless torrents.
When we download a torrent, we get a hash of that torrent. To download the torrent without a tracker, we need to find other peers also downloading the torrent. To do this, we need to use a distributed hash table.
Let's explore Distributed Hash Tables.
Distributed Hash Tables (DHT) give us a dictionary-like interface, but the nodes are distributed across a network. The trick with DHTs is that the node that gets to store a particular key is found by hashing that key.
In effect, each peer becomes a mini-tracker.
Each node (client/server implementing the DHT protocol) has a unique identifier known as the ânode IDâ. We choose node IDs at random from the same 160-bit space as BitTorrent infohashes.
Infohashes are a SHA-1 hash of:
We use a distance metric to compare two node IDs or a node ID and an infohash for âclosenessâ.
Nodes must have a routing table containing the contact information for a few other nodes.
Nodes know about each other in the DHT. They know many nodes with IDs that are close to their own but few with far-away IDs.
The distance metric is XOR and is interpreted as an integer.
$$distance(A, B) = |A \oplus B |$$
Smaller values are closer.
When a node wants to find peers for a torrent, they use the distance metric to compare the infohash of the torrent with the IDs of the nodes in its routing table or the ID of one node with the ID of another node.
Then they contact the nodes in the routing table closet to the infohash and asks them for contact information of peers downloading the torrent.
If a contacted node knows about peers for the torrent, they return the peer contact information with the response. Otherwise, the contacted node must respond with the contact information of the nodes in its routing table closet to the infohash of the torrent.
The original node queries nodes that are closer to the target infohash until it cannot find any closer nodes. After the node exhausts the search, the client then inserts the peer contact information for itself onto the responding nodes with IDs closest to the infohash of the torrent. In the future, other nodes can easily find us.
The return value for a query for peers includes an opaque value known as the âtoken.â For a node to announce that its controlling peer is downloading a torrent, it must present the token received from the same queried node in a recent query for peers.
When a node attempts to âannounceâ a torrent, the queried node checks the token against the querying nodeâs IP address. This is to prevent malicious hosts from signing up other hosts for torrents.
The querying node returns the token to the same node that they receive the token from. We must accept tokens for a reasonable amount of time after they have been distributed. The BitTorrent implementation uses the SHA-1 hash of the IP address concatenated onto a secret that changes every five minutes and tokens up to ten minutes old are accepted.
Every node maintains a routing table of known good nodes. We use the routing table starting points for queries in the DHT. We return nodes from the routing table in response to queries from other nodes.
Not all nodes we learn about are equal. Some are âgoodâ and some are not. Many nodes using the DHT can send queries and receive responses, but cannot respond to queries from other nodes. Each nodeâs routing table must contain only known good nodes.
A good node is a node has responded to one of our queries within the last 15 minutes. A node is also good if it has ever responded to our queries and has sent us a query within the last 15 minutes. After 15 minutes of inactivity, a node becomes questionable. Nodes become bad when they fail to respond to multiple queries in a row. Nodes that we see are good are given priority over nodes with an unknown status.
The routing table covers the entire node ID space from 0 to 2^{160}. We subdivide the routing table into âbucketsâ that each cover a portion of the space.
An empty table has one bucket with an ID space range of min=0, max=2^{160}.
An empty table has only one bucket so any node must fit within it. Each bucket can only hold K nodes, currently eight, before becoming âfull.â
When a bucket is full of known good nodes, we may add no more nodes unless our node ID falls within the range of the bucket. The bucket is replaced by two buckets each with half of the old bucket. Nodes from the old bucket are distributed among the new buckets.
For a new table with only one bucket, we always split the full bucket into two new buckets covering the ranges 0..2^{159} and 2^{159}..2^{160}.
When the bucket is full of good nodes, we simply discard the new node. When nodes in the bucket become bad (if they do) we replace them with a new node.
When nodes are considered questionable and havenât been since, in the last 15 minutes, the least recently seen node is pinged. The node either responds or doesnât respond. A response means we move to the next node. We do this until we find a node that fails to respond. If we donât find any, then the bucket is considered good.
When we do find one, we try one more time before discarding the node and replacing them with a new good node.
Each bucket should maintain a âlast changedâ property to show how âfreshâ the contents are.
When a node in a bucket is pinged and responds, or a node is added to a bucket, or a node is replaced with another node, the bucketâs last changed property is updated.
Buckets are refreshed if the last changed property has not been updated in the last 15 minutes.
Few attacks on the BitTorrent network exist. Everything is public. Our IP address, what weâre downloading - everything.
Why attack an open network?
Why attack a completely open network?
Only 7 entries are listed on Exploit-DB - a database of known exploits against a service. And most of them relate to specific clients.
The principal attack on the BitTorrent network is to stop piracy. Weâve gone this far without talking about piracy, but it is often synonymous with BitTorrent.
The main attack on BitTorrent is Torrent Poisoning.
This attack aims to get the IP addresses of peers pirating content or to poison the content in some way.
Madonnaâs American Life album release is an example of content poisoning. Before the release, tracks were released of similar length and file size. The tracks featured a clip of Madonna saying:
"What the fuck do you think you're doing?"
Followed by a few minutes of silence.
Here are some methods of poisoning a torrent.
The index allows users to locate the IP addresses of peers with the desired content. This method of attack makes searching for peers difficult.
The attacker inserts a large amount of invalid information into the index to prevent users from finding the correct information.
The idea is to slow down the download, by having the peer try to download pieces from an invalid peer.
They insert corrupted versions of a file into the network.
Imagine 500 copies of a file and only 2 of them being the real file, this deters pirates from finding the real file.
Most websites with lists of torrents a voting system. This deters this attack, as the top of searches is filled with non-corrupted files
However, most websites with lists of torrents a voting
This deters this attack, as the top of searches is filled with non-corrupted files.
In GameDevTycoon, the file was released before the initial upload to piracy sites. Unbeknownst to pirates, the file was corrupted. Winning the game is impossible in the pirated version. Everything else was perfect.
Most popular torrents are released by individuals or groups who built up a rapport over many years. On private trackers, individuals can be pointed to. Poisoned torrents are quickly labelled and the poster can be banned.
Or, on public trackers, downloading torrents made by trusted groups is preferable. After all, would you prefer to download Ubuntu from the Ubuntu team, or the user xxx-HACKER-ELITE-GHOST-PROTOCOL-xxx?
On public trackers, if a torrent is poisoned the torrent is reported and removed.
The simplest way to defend against a BitTorrent attack is to use an IP address not associated with you. Whether this is through a VPN or some other service.
Here are the things we've learnt:
Here are some things you may choose to do:
Hi! Bit of an unusual post. I'm writing about my first open source tool, Ciphey!
Ciphey answers the question:
"What does this encrypted text say?"
First, an important distinction.
I define encryption as:
Anything that you cannot read fluently
When normally encryption is defined as:
Text that has an algorithm applied so no one apart from the intended recipients can read it
My tool, Ciphey, isn't made for cryptographers. It's made for your aunt Linda who has never heard of the word cryptography but works in tech, so knows a little about some things.
Because of my target audience and my personal definition of encryption, Ciphey can automatically decrypt all of these:
Now, I know that you're probably cringing. Sha1 can't be decrypted! And decrypting binary? That's not an encryption, it's encoding!
Refer back to my definition and target market.
Okay, so Ciphey is cool. You input encrypted text, and Ciphey decrypts it. How?
First, Ciphey is made up of 2 core components.
Language Checker aims to answer the question:
"Is this text English?"
(and in the future, other languages)
It does this by utilising two popular algorithms.
The first is Chi Squared.
Chi Squared answers:
"How close is the frequency distribution of this text to the frequency distribution of English?"
Check out this cool video on the general idea by VSauce:
Chi-squared is very fast, but the accuracy isn't that good.
Onto the next algorithm.
What better way to check when something is English than to loop through an entire English dictionary and see how many dictionary words appear in the text?
The problem is that this is very slow. My dictionary is 500k words. You can't just loop through that every time you want to check when something is English.
This is where I use both of these algorithms, combined.
Chi squared tells me when something looks like English, and dictionary checker tells me when something consists of primarily English words.
Both together answer the question "is this English?" really well.
It's also a lot faster than normal.
Chi squared keeps a running average of all the scores it comes across. If it sees a score that's below 1 standard deviation, it goes into stage 2 of the algorithm - dictionary checker.
If 35% of the words in the string are English, it's likely to be English.
35% because of slang, passwords, usernames or program names can all exist in the text.
Yes, but I like to call it Brute Force Enhanced.
Ciphey uses a deep neural network (DNN) trained on Harry Potter to guess how likely a text is to be encrypted using a method.
As an example, the DNN might predict that the text is 81% likely to be SHA1, 1% likely to be Caesar and so on.
Ciphey then runs all of the decryption modules using multi-threading in the order of most likely to least likely.
If at any moment a decryption returns True, as it has found the plain-text, Ciphey stops and returns the answer.
This method of brute-force enhanced as well as language checker means Ciphey is very fast.
Decryption modules don't just return True. I have an internal data packet that's passed around.
{
"lc": self.lc,
"IsPlaintext?": True,
"Plaintext": translated,
"Cipher": "Caesar",
"Extra Information": "The rotation used is {counter}"
}
Self.lc is Language Checker. When a decryption module is done, it passes Language Checker back to the parent. The parent then adds the LC it received to the LC it holds as an attribute. This means that we can keep the running average accurate and the overall program accurate.
Sure, it's cool to encrypt text using 1 level of encryption. But what if someone uses 2 levels? Ciphey needs to solve this.
The way this would work is using files.
If the user enters the level command:
ciphey -l 3 "ha skjari isa"
Then Ciphey will run through 3 levels of decryption.
Every decryption Ciphey makes will be stored in a file. Ciphey will then have something like this:
for x in range(0, levels):
for word in file.open(decryption_list.txt, 'r'):
one_level_of_decryption(word)
Ciphey supports a lot, but not enough for it to be considered super cool. I need to work on this.
]]>Today, we're going to do a technical deep-dive into how Tor really works.
No mention of how to access Tor, no mention of what might be on Tor. This is how Tor works.
Without speculation and without exaggeration of what Tor is. Just a deep-dive into the technical stuff of how Tor works.
This article is designed to be read by anyone, with ZERO knowledge on networking or Tor.
Let's dive right in.
The United States Naval Research Laboratory developed The Onion Routing Protocol (Tor) to project U.S. intelligence communications online. Ironically, Tor has seen widespread use by everyone - even those organisations which the U.S. Navy fights against.
You may know Tor as the hometown of online illegal activities, a place where you can buy any drug you want, a place for all things illegal. Â Tor is much larger than what the media makes it out to be. According to Kings College much of Tor is legal.
When you normally visit a website, your computer makes a direct TCP connection with the website's server. Anyone monitoring your internet could read the TCP packet. They can find out what website you're visiting and your IP address. As well as what port you're connecting to.
If you're using HTTPS, no one will know what the message said. But, sometimes all an adversary needs to know is who you're connecting to.
Using Tor, your computer never communicates with the server directly. Tor creates a twisted path through 3 Tor nodes, and sends the data via that circuit.
The core principle of Tor is onion routing which is a technique for anonymous & secure communication over a public network. In onion routing messages are encapsulated in several layers of encryption.
"Onions have layers" - Shrek
So does a message going through Tor. Each layer in Tor is encryption, you are adding layers of encryption to a Tor message, as opposed to just adding 1 layer of encryption.
This is why it's called The Onion Routing Protocol, because it adds layers at each stage.
The resulting onion (fully encapsulated message) is then transmitted through a series of computers in a network (called onion routers) with each computer peeling away a layer of the âonionâ. This series of computers is called a path. Each layer contains the next destination - the next router the packet has to go to. When the final layer is decrypted you get the plaintext (non-encrypted message).
The original author remains anonymous because each node in the network is only aware of the preceding and following nodes in the path (except the first node that does know who the sender is, but doesnât know the final destination).
This has led to attacks where large organisations with expansive resources run servers to attempt to be the first and last nodes in the network. If the organisation's server is the first node, it knows who sent the message. If the organisation server is the last node, it knows the final destination and what the message says.
Now we have a basic overview of Tor, let's start exploring how each part of Tor works. Don't worry if you're confused, every part of Tor will be explained using gnarly diagrams đâ¨
Onion Routing is a distributed overlay network designed to anonymise TCP-based applications like web browsing, secure shell and instant messaging.
Clients choose a path through the network and build a circuit where each onion router in the path knows the predecessor and the successor, but no other nodes in the circuit. Paths and circuits are synonyms.
The original author (the question mark on the far left) remains anonymous, unless you're the first path in the node as you know who sent you the packet.
No one knows what data is being sent until it reaches the last node in the path; who knows the data but doesn't know who sent it. The second to last node in the path doesn't know what the data is, only the last node in the path does.
This has led to attacks whereby large organisations with expansive resources create Tor servers which aim to be the first and last onion routers in a path. If the organisation can do this, they get to know who sent the data and what data was sent, effectively breaking Tor.
It's incredibly hard to do this without being physically close to the location of the organisations servers, we'll explore this more later.
Throughout this article I'll be using Netflix as a normal service (Bob) and Amazon Prime Video as the adversary (Eve). In the real world, this is incredibly unlikely to be the case. I'm not here to speculate on what organisations might want to attack Tor, so I've used 2 unlikely examples to avoid the political side of it.
Each packet flows down the network in fixed-size cells. These cells have to be the same size so none of the data going through the Tor network looks suspiciously big.
These cells are unwrapped by a symmetric key at each router and then the cell is relayed further down the path. Let's go into Tor itself.
There is strength in numbers
Tor needs a lot of users to create anonymity, if Tor was hard to use new users wouldn't adopt it so quickly. Because new users won't adopt it, Tor becomes less anonymous. By this reasoning it is easy to see that usability isn't just a design choice of Tor but a security requirement to make Tor more secure.
If Tor isn't usable or designed nicely, it won't be used by many people. If it's not used by many people, it's less anonymous.
Tor has had to make some design choices that may not improve security but improve usability with the hopes that an improvement in usability is an improvement in security.
Tor is not a completely decentralised peer-to-peer system like many people believe it to be. If it was completely peer to peer it wouldnât be very usable. Tor requires a set of directory servers that manage and keep the state of the network at any given time.
Tor is not secure against end to end attacks. An end to end attack is where an entity has control of both the first and last node in a path, as talked about earlier. This is a problem that cyber security experts have yet to solve, so Tor does not have a solution to this problem.
Tor does not hide the identity of the sender.
In 2013 during the Final Exams period at Harvard a student tried to delay the exam by sending in a fake bomb threat. The student used Tor and Guerrilla Mail (a service which allows people to make disposable email addresses) to send the bomb threat to school officials.
The student was caught, even though he took precautions to make sure he wasnât caught.
Gurillar mail sends an originating IP address header along with the email thatâs sent so the receiver knows where the original email came from. With Tor, the student expected the IP address to be scrambled but the authorities knew it came from a Tor exit node (Tor keeps a list of all nodes in the directory service) so the authorities simply looked for people who were accessing Tor (within the university) at the time the email was sent.
Tor isn't an anonymising service, but it is a service that can encrypt all traffic from A to B (so long as an end-end attack isn't performed). Tor is also incredibly slow, so using it for Netflix isn't a good use case.
When you use a VPN, the VPN forwards all your internet traffic to the appropriate destination. When it does so, the VPN encrypts your traffic. All your internet service provider can see is encrypted traffic heading from your computer to the VPN.
They canât see inside your packets. They donât know who youâre talking to - other than the VPN.
VPNâs arenât private in the same way that Tor is. VPNs protect you against ISPs or local adversaries (ones monitoring your laptopâs WiFi). But, they donât protect you from themselves.
The VPN is the man in the middle. It knows who you are and who youâre talking to. Depending on the traffic, the VPN also decrypts your packet. Meaning they know everything. With a VPN, you have to trust it. With Tor, you donât have to put a lot of trust in.
In Tor, one rogue node is survivable. If one of the nodes in our graph earlier was an adversary, theyâll only know our IP address or our data packet. Tor protects you from Tor. VPNâs expect that you trust them.
Tor protects you from the Tor network. One rogue node is survivable. They donât expect you to trust the network.
No one, apart from you, should know the IP addresses of the origin and destination - and know the contents of the message.
Now that we have a good handle on what Tor is, let's explore onion routing.
Given the network above, we are going to simulate what Tor does. Your computer is the one on the far left, and you're sending a request to watch Stranger Things on Netflix (because what else is Tor used for đ). This path of nodes is called a circuit. Later on, we're going to look into how circuits are made and how the encryption works. But for now we're trying to generalise how Tor works.
We start off with the message (we haven't sent it yet). We need to encrypt the message N times (where N is how many nodes are in the path). We encrypt it using AES, a symmetric key crypto-system. The key is agreed using Diffie-Hellman. Don't worry, we'll discuss all of this later. There is 4 nodes in the path (minus your computer and Netflix) so we encrypt the message 4 times.
Our packet (onion) has 4 layers. Blue, purple, orange, and teal. Each colour represents one layer of encryption.
We send the onion to the first node in our path. That node then removes the first layer of encryption.
Each node in the path knows what the key to decrypt their layer is (via Diffie-Hellman). Node 1 removes the blue layer with their symmetric key (that you both agreed on).
Node 1 knows you sent the message, but the message is still encrypted by 3 layers of encryption, it has no idea what the message is.
As it travels down the path, more and more layers are stripped away. The next node does not know who sent the packet. All it knows is that Node 1 sent them the packet, and it's to be delivered to Node 3.
Now Node 3 strips away a layer.
The final node knows what the message is and where itâs going, but it doesnât know who sent it. All it knows is that Node 3 sent them the message, but it doesn't know about anyone else in the path. One of the key properties here is that once a node decrypts a layer, it cannot tell how many more layers there are to decrypt. It could be as small as 1 or 2 or as large as 200 layers of encryption.
Now there's no way Amazon can find out you watch Netflix! Netflix sends back a part of Stranger Things.
Let's see how it works in reverse.
Node 4 adds its layer of encryption now. It doesn't know who originally made the request, all it knows is that Node 3 sent the request to them so it sends the response message back to Node 3.
And so on for the next few nodes.
Now the response packet is fully encrypted.
Now the packet is fully encrypted, the only one who still knows what the message contains is Node 4. The only one who knows who made the message is Node 1. Now that we have the fully encrypted response back, we can use all the symmetric keys to decrypt it.
You might be thinking "I've seen snails đ faster than this" and you would be right. This protocol isn't designed for speed, but at the same time it has to care about speed.
The algorithm could be much slower, but much more secure (using entirely public key cryptography instead of symmetric key cryptography) but the usability of the system matters. So yes, it's slow. No it's not as slow as it could be. But it's all a balancing act here.
The encryption used is normally AES with the key being shared via Diffie-Hellman.
The paths Tor creates are called circuits. Let's explore how Tor chooses what nodes to use in a circuit.
Each machine, when it wants to create a circuit, chooses the exit node first, followed by the other nodes in the circuit. Tor circuits are always 3 nodes. Increasing the length of the circuit does not create better anonymity. If an attacker owns the first and last nodes in the network, you can have 1500 nodes in the circuit and it still wouldn't make you more secure.
When Tor selects the exit node, it selects it following these principles:
torrc
(the configuration file of Tor) have settings about which exit nodes not to choose? All paths in the circuit obey these rules:
If you choose the same node twice, it's guaranteed that the node will either be the guard node (the node you enter at) or the exit node, both dangerous positions. There is a 2/3 chance of it being both the guard and exit nodes, which is even more dangerous. We want to avoid the entry / exit attacks.
Operators who run more than 1 Tor node can choose to signify their nodes as 'family'. This means that the nodes have all the same parent (the operator of their network). This is again a countermeasure against the entry / exit attacks, although operators do not have to declare family if they wish. If they want to become a guard node (discussed soon) it is recommended to declare family, although not required.
Subnets define networks. IP addresses are made up of 8 octets of bits. As an example, Google's IP address in binary is:
01000000.11101001.10101001.01101010
The first 16 bits (the /16 subnet) is 01000000.11101001
which means that Tor does not choose any nodes which start with the same 16 bits as this IP address. Again, a counter-measure to the entry / exit attacks.
If subnets sound confusing, I've written this Python code to help explain them:
# ip addresses are in binary, not the usual base 10 subnets are usually powers of 2, this is 2^4.
IP = "01000000.11101001.10101001.01101010"
subnet = 16
# this will store the subnet address once we find it
subnet_ip = []
IP_list = list(IP)
counter = 0
for i in IP_list:
# we want to end the loop when we reach the subnet number
if counter >= subnet:
break
# the ip address segments each oclet of bits with full stops
# we don't want to count a fullstop as a number
# but we want to include it in the final subnet
if i == ".":
subnet_ip.append(".")
continue
else:
# else it is a number so we append and increment counter
subnet_ip.append(i)
counter = counter + 1
print("Subnet is " + ''.join(subnet_ip))
Non-running means the node currently isn't online. You don't want to pick things that aren't online. Non-valid means that some configuration in the nodes torrc
is wrong. You don't want to accept strange configurations in case they are trying to hack or break something.
A guard node is a privileged node because it sees the real IP of the user. Itâs âexpensiveâ to become a guard node (maintain a high uptime for weeks and have good bandwidth).
This is possible for large companies who have 99.9% uptime and high bandwidth (such as Netflix). Tor has no way to stop a powerful adversary from registering a load of guard nodes. Right now, Tor is configured to stick with a single guard node for 12 weeks at a time, so you choose 4 new guard nodes a year.
This means that if you use Tor once to watch Amazon Prime Video, it is relatively unlikely for Netflix to be your guard node. Of course, the more guard nodes Netflix creates the more likely it is. Although, if Netflix knows you are connecting to the Tor network to watch Amazon Prime Video then they will have to wait 4 weeks for their suspicions to be confirmed, unless they attack the guard node and take it over.
Becoming a guard node is relatively easy for a large organisation. Becoming the exit node is slightly harder, but still possible. We have to assume that the large organisation has infinite computational power to be able to do this. The solution is to make the attack highly expensive with a low rate of success.
The more regular users of Tor, the harder is if for a large organisation to attack it. If Netflix controls $\frac{50}{100}$ nodes in the network:
The chance of you choosing a guard node from Netflix is 50%.
If suddenly 50 more normal user nodes join then that's $\frac{50}{150}$, reducing the probability of Netflix owning a guard node (and thus, a potential attack) and making it even more expensive.
There is strength in numbers within the Tor service.
When a Tor client starts up for the first time, it chooses a small & random set of guard nodes. For the next few months, it makes sure each circuit is using one of these pre-selected nodes as its guard node.
The official proposal from the Tor documentation states:
1. Introduction and motivation
Â Tor uses entry guards to prevent an attacker who controls some
Â a fraction of the network from observing a fraction of every user's
Â traffic. If users chose their entries and exits uniformly at
Â random from the list of servers every time they build a circuit,
Â then an adversary who had (k/N) of the network would deanonymize
Â F=(k/N)^2 of all circuits... and after a given user had built C
Â circuits, the attacker would see them at least once with
Â probability 1-(1-F)^C. Â With large C, the attacker would get a
Â sample of every user's traffic with probability 1.
Â To prevent this from happening, Tor clients choose a small number
Â of guard nodes (currently 3). Â These guard nodes are the only
Â nodes that the client will connect to directly. Â If they are not
Â compromised, the user's paths are not compromised.
Â But attacks remain. Â Consider an attacker who can run a firewall
Â between a target user and the Tor network, and make many of the
Â guards they don't control appear to be unreachable. Â Or consider
Â an attacker who can identify a user's guards, and mount
Â denial-of-service attacks on them until the user picks a guard
that the attacker controls.
Guard node pinning is important because of Torâs threat model. Tor assumes that it may only take a single opening for an adversary to work out who you are talking to, or who you are. Since a single vulnerability circuit can destroy your integrity, Tor tries to minimise the probability that we will ever construct one or more vulnerable circuits.
Tor guard nodes can be DOSâd, or an attacker could have a majority share of guard nodes on the internet when you connect to try and get you. By guard node pinning, it aims to make this much harder.
In the event of an attacker working out your guard nodes and shutting them down, forcing you to connect to their guard nodes. Or, you connect to a guard node controlled by an adversary Tor has algorithms in place to try and detect this. Outined here.
The state of the Tor network is tracked and publicised by a group of 9 trusted servers (as of 2019) known as directory nodes. Each of which is controlled by a different organisation.
Each node is a seperate organisation because it provides redundancy and distributes trust. The integrity of the Tor network relies on the honesty and correctness of the directory nodes. So making the network resilient and distributing trust is critical.
Directory nodes maintain a list of currently running relays (publicly listed node in the Tor network). Once per hour directory nodes publish a consensus together. The consensus is a single document compiled and voted on by each directory node. It ensures that all clients have the same information about the relays that make up Tor.
When a Tor user (a client or a node) wants to know the current state of the network, it asks a directory node. As weâll see later, directory nodes are essential for all parts of Tor, especially in hidden services.
Relays keep the directory nodes up to date. They send directory node(s) a notification whenever they come online or updated. Whenever a directory node receives a notification, it updates its personal opinion on the current state of the Tor network. All directory nodes then use this opinion to form a consensus of the network.
Letâs now look at what happens when disagreements arise in the directory services when forming a consensus.
The first version of Tor took a simple approach to conflict resolution. Each directory node gave the state of the network as it personally saw it. Each client believed whichever directory node it had spoken to recently. There is no consensus here among all directory nodes.
In Tor, this is a disaster. There was nothing ensuring that directory nodes were telling the truth. If an adversary took over one directory node, they would be able to lie about the state of the network.
If a client asked this adversary controlled directory for the state of the network, itâd return a list. This list contains only nodes that the adversary controlled. The client would then connect to these adversary nodes.
The second version of the Tor directory system made this attack harder. Instead of asking a single directory node for its opinion, clients asked every directory node and combined their opinions into a consensus. But, clients could form differing views on the network depending on when they had last spoken to each directory node. This gave way to statistical information leakage - not as bad as Tor 1.0. Besides, every client had to talk to every directory node, which took time and was expensive.
The third and current version of the directory system moved the responsibility of calculating a consensus from clients to directory nodes.
Iâm not sure if you saw it earlier, but I made the distinction between nodes in the directory services and nodes that arenât.
If a repressive state wants to block Tor, it uses the directory nodes. Directory nodes keep up-to-date lists of Tor relay nodes and are publicly available for anyone to download.
The state can query a directory node for a list of active Tor relays, and censor all traffic to them.
Tor keeps an up-to-date listing of countries where it is possibly blocked (censored) if you're interested.
Tor helps its users circumvent the censorship by hiding the fact they are using Tor. They do this through a proxy known as a Bridge Node. Tor users send their traffic to the bridge node, which forwards the traffic onto the userâs chosen guard nodes.
The full list of Bridge nodes is never published, making it difficult for states to completely block Tor. You can view some bridge nodes here. If this doesnât work, Tor suggests:
Another way to get bridges is to send an email to bridges@torproject.org. Please note that you must send the email using an address from one of the following email providers: Riseup or Gmail.
Itâs possible to block Tor another way. Censoring states can use Deep Packet Inspection (DPI)to analyse the shape, volume, and feel of each packet. Using DPI states can recognise Tor traffic, even when they connect to unknown IP addresses or are encrypted.
To circumvent this, Tor developers have made Pluggable Transports (PT). These transform Tor traffic flow between the client and the bridge. In the words of Torâs documentation:
This way, censors who monitor traffic between the client and the bridge will see innocent-looking transformed traffic instead of the actual Tor traffic. External programs can talk to Tor clients and Tor bridges using the pluggable transport API, to make it easier to build interoperable programs.
Ever heard those rumours "there are websites on the dark-web, on Tor that when you visit them you'll see people doing nasty things, selling illegal things or worse: watching The Hangover Part 3"
When people talk about these websites they are talking about Tor Hidden Services.
These are a wild concept and honestly deserve an entire blogpost on their own. Hidden services are servers, like any normal computer server.
Except in a Tor Hidden Service it is possible to communicate without the user and server knowing who each other are.
The device (the question mark) knows that it wants to access Netflix, but it doesn't know anything about the server and the server doesn't know anything about the device that's asked to access it. This is quite confusing, but don't worry, I'm going to explain it all with cool diagrams. â¨
When a server is set up on Tor to act as a hidden service, the server sends a message to some selected Onion Routers asking if they want to be an introduction point to the server. It is entirely up to the server as to who gets chosen as an introduction point, although usually they ask 3 routers to be their introduction points.
The introduction points know that they are going to be introducing people to the server.
The server will then create something called a hidden service descriptor which has a public key and the IP address of each introduction point. It will then send this hidden service descriptor to a distributed hash table which means that every onion router (not just the introduction points) will hold some part of the information of the hidden service.
If you try to look up a hidden service the introduction point responsible for it will give you the full hidden service descriptor, the address of the hidden service's introduction points.
The key for this hash table is the onion address and the onion address is derived from the public key of the server.
The idea is that the onion address isnât publicised over the whole Tor network but instead you find it another way like from a friend telling you or on the internet (addresses ending in .onion).
The way that the distributed hash table is programmed means that the vast majority of the nodes won't know what the descriptor is for a given key.
So almost every single onion router will have minimal knowledge about the hidden service unless they explicitly want to find it.
Let's say someone gave you the onion address. You request the descriptor off the hash table and you get back the services introduction points.
If you want to access an onion address you would first request the descriptor from the hash table and the descriptor has, letâs say 4 or 5 IP addresses of introductory nodes. You pick one at random let's say the top one.
Youâre going to ask the introduction point to introduce you to the server and instead of making a connection directly to the server you make a rendezvous point at random in the network from a given set of Onion Routers.
You then make a circuit to that rendezvous point and you send a message to the rendezvous point asking if it can introduce you to the server using the introduction point you just used. Â You then send the rendezvous point a one time password (in this example, let's use 'Labrador').
The rendezvous point makes a circuit to the introduction point and sends it the word 'Labrador' and its IP address.
The introduction point sends the message to the server and the server can choose to accept it or do nothing.
If the server accepts the message it will then create a circuit to the rendezvous point.
The server sends the rendezvous point a message. The rendezvous point looks at both messages from your computer and the server. It says "well, I've received a message from this computer saying it wants to connect with this service and Iâve also received a message from the service asking if it can connect to a computer, therefore they must want to talk to each other".
The rendezvous point will then act as another hop on the circuit and connect them.
In short, a hidden service works like this, taken from here:
Tor projects its users from analysis attacks. The adversary wants to know who Alice is talking to. Yet, Tor does not protect against confirmation attacks. In these attacks, the adversary aims to answers the question âIs Alice talking to Bob?â
Confirmation attacks are hard and need a lot of preparation and resources. The attacker needs to be able to track both ends of the circuit. The attacker can either directly track each devices internet connection or the guard and exit nodes.
If Alice sends a packet like this:
# (timestamp, size, port, protocol)
(17284812, 3, 21, SSH)
And Bob receives this packet, the attacker can see that the packets are the same - even though the attacker cannot see what the packet is as it is encrypted. Does Bob tend to receive packets at the same time that Alice sends them? Are they the same size? If so, it is reasonable to infer that Alice and Bob are communicating with each other.
Tor breaks packets up into sizeable chunks for a reason - to try and prevent this kind of thing. Tor is working on padding all packets to make this harder.
Theyâre discussing adding packet order randomisation too. But this is too costly at the moment. The Tor browser does add some extra defences, such as reordering packets.
If Alice sends the packets, A, B, C and Bob receives them in B, A, C it is harder to detect that they are the same. Itâs not foolproof, but it does become harder.
An attack where the attacker tries to control both ends of the circuit is called a Sylbil Attack. Named after the main character of the book Sybil by Flora Rheta Schreiber. We discussed some of this earlier, where an attacker controls both the guard & exit nodes.
Sybil attacks are not theoretical. In 2014 researchers at Carnegie Mellon University appeared to successfully carry out a Sybil Attack against the real-life Tor network.
When Lizard Squad - a group of hackers tried to perform a Sybil attack, a detection system alarmed. Tor has built-in monitoring against these kinds of events, and they are working on more sophisticated monitoring against Sybil attacks.
In 2007 Dan Egerstad - a Swedish security consultant, revealed he has intercepted usernames and passwords sent through Tor by being an exit node. At the time, these were not TLS or SSL encrypted.
Interestingly, Dan Egerstad had this to say on the Tor nodes:
If you actually look into where these Tor nodes are hosted and how big they are, some of these nodes cost thousands of dollars each month just to host because theyâre using lots of bandwidth, theyâre heavy-duty servers and so on. Who would pay for this and be anonymous?
Tor does not normally hide the fact that you are using Tor. Many websites (such as BBCâS iPlayer or editing Wikipedia) block you when using a known Tor node.
Some applications, under Tor, reveal your true IP address. One such application is BitTorrent.
Jansen et al described an attack where they DDOS exit nodes. By degrading the network (removing exit nodes) an attacker increases the chance to getting an exit node.
Tor users who visit a site twice, once on Tor and once off, can be tracked. The way you move your mouse is unique. There is a JavaScript time measurement bug report on the Tor project that shows how itâs possible to monitor the mouse locations on a site (even when on Tor). Once you fingerprint someone twice, you know theyâre the same person.
It should be noted, that Tor browser offers 3 levels of security (located in the settings). The highest security level disables JavaScript, some images (as they can be used to track you) and some fonts too. The lesson is, if you want high-security Tor, use the high-security version.
Now, all these attacks sound cool. But thatâs not how most Tor users are caught. Most Tor users make mistakes and are caught because of themselves.
Take Dredd pirate Roberts, Founder of the Silk Road dark marketplace. He gave himself away by posting about it on social media.
Most Tor users are caught (if theyâre doing illegal things) by bad operational security, and not normally because of a security issue with Tor.
Itâs worth repeating this story, that we saw earlier.
In 2013 during the Final Exams period at Harvard, a student tried to delay the exam by sending in a fake bomb threat. The student used Tor and Guerrilla Mail (a service which allows people to make disposable email addresses) to send the bomb threat to school officials.
The student was caught, even though he took precautions to make sure he wasnât caught.
Guerilla mail sends an originating IP address header along with the email thatâs sent to the receiver, so it knows where the original email came from. With Tor, the student expected the IP address to be scrambled but the authorities knew it came from a Tor exit node (Tor keeps a list of all nodes in the directory service) so the authorities looked for people who were accessing Tor (within the university) at the time the email was sent.
If this person went to a coffee shop or something, he probably would of be fine.
Thereâs a fantastic talk at DEFCON 22 about how Tor users got caught. None of the stories mentioned was caused by Tor, but rather bad OpSec.
Tor is a fascinating protocol full of algorithms that have been refined over the years. I've come to appreciate Tor, and I hope you have to. Here is a list of things we've covered:
If you want to learn more, check out the paper on Tor titled "Tor: The Second-Generation Onion Router".
If you liked this article and want more like it, sign up to my email list below â¨ I'll only send you an email when I have something new, which is every month / 2 months or so.
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these articles in my spare time. This blog is my full time job, so any and all donations are appreciate
]]>Dynamic programming is breaking down a problem into smaller sub-problems, solving each sub-problem and storing the solutions to each of these sub-problems in an array (or similar data structure) so each sub-problem is only calculated once.
It is both a mathematical optimisation method and a computer programming method.
Optimisation problems seek the maximum or minimum solution. The general rule is that if you encounter a problem where the initial algorithm is solved in O(2^{n}) time, it is better solved using Dynamic Programming.
Richard Bellman invented DP in the 1950s. Bellman named it Dynamic Programming because at the time, RAND (his employer), disliked mathematical research and didn't want to fund it. He named it Dynamic Programming to hide the fact he was really doing mathematical research.
Bellman explains the reasoning behind the term Dynamic Programming in his autobiography, Eye of the Hurricane: An Autobiography (1984, page 159). He explains:
"I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. Iâm not using the term lightly; Iâm using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word âprogrammingâ. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let's kill two birds with one stone. Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."
Sub-problems are smaller versions of the original problem. Let's see an example. With the equation below:
$$1 + 2 + 3 + 4$$
We can break this down to:
$$1 + 2$$
$$3 + 4$$
Once we solve these two smaller problems, we can add the solutions to these sub-problems to find the solution to the overall problem.
Notice how these sub-problems breaks down the original problem into components that build up the solution. This is a small example but it illustrates the beauty of Dynamic Programming well. If we expand the problem to adding 100's of numbers it becomes clearer why we need Dynamic Programming. Take this example:
$$6 + 5 + 3 + 3 + 2 + 4 + 6 + 5$$
We have $6 + 5$ twice. The first time we see it, we work out $6 + 5$. When we see it the second time we think to ourselves:
"Ah, 6 + 5. I've seen this before. It's 11!"
In Dynamic Programming we store the solution to the problem so we do not need to recalculate it. By finding the solutions for every single sub-problem, we can tackle the original problem itself.
Memoisation is the act of storing a solution.
Let's see why storing answers to solutions make sense. We're going to look at a famous problem, Fibonacci sequence. This problem is normally solved in Divide and Conquer.
There are 3 main parts to divide and conquer:
Dynamic programming has one extra step added to step 2. This is memoisation.
The Fibonacci sequence is a sequence of numbers. It's the last number + the current number. We start at 1.
$$1 + 0 = 1$$
$$1 + 1 = 2$$
$$2 + 1 = 3$$
$$3 + 2 = 5$$
$$5 + 3 = 8$$
In Python, this is:
def F(n):
if n == 0 or n == 1:
return n
else:
return F(n-1)+F(n-2)
If you're not familiar with recursion I have a blog post written for you that you should read first.
Let's calculate F(4). In an execution tree, this looks like:
We calculate F(2) twice. On bigger inputs (such as F(10)) the repetition builds up. The purpose of dynamic programming is to not calculate the same thing twice.
Instead of calculating F(2) twice, we store the solution somewhere and only calculate it once.
We'll store the solution in an array. F[2] = 1. Below is some Python code to calculate the Fibonacci sequence using Dynamic Programming.
def fibonacciVal(n):
memo[0], memo[1] = 0, 1
for i in range(2, n+1):
memo[i] = memo[i-1] + memo[i-2]
return memo[n]
In theory, Dynamic Programming can solve every problem. The question is then:
"When should I solve this problem with dynamic programming?"
We should use dynamic programming for problems that are between tractable and intractable problems.
Tractable problems are those that can be solved in polynomial time. That's a fancy way of saying we can solve it in a fast manner. Binary search and sorting are all fast. Intractable problems are those that run in exponential time. They're slow. Intractable problems are those that can only be solved by bruteforcing through every single combination (NP hard).
When we see terms like:
"shortest/longest, minimized/maximized, least/most, fewest/greatest, "biggest/smallest"
We know it's an optimisation problem.
Dynamic Programming algorithms proof of correctness is usually self-evident. Other algorithmic strategies are often much harder to prove correct. Thus, more error-prone.
When we see these kinds of terms, the problem may ask for a specific number ( "find the minimum number of edit operations") or it may ask for a result ( "find the longest common subsequence"). The latter type of problem is harder to recognize as a dynamic programming problem. If something sounds like optimisation, Dynamic Programming can solve it.
Imagine we've found a problem that's an optimisation problem, but we're not sure if it can be solved with Dynamic Programming. First, identify what we're optimising for. Once we realize what we're optimising for, we have to decide how easy it is to perform that optimisation. Sometimes, the greedy approach is enough for an optimal solution.
Dynamic programming takes the brute force approach. It Identifies repeated work, and eliminates repetition.
Before we even start to plan the problem as a dynamic programming problem, think about what the brute force solution might look like. Are sub steps repeated in the brute-force solution? Â If so, we try to imagine the problem as a dynamic programming problem.
Mastering dynamic programming is all about understanding the problem. List all the inputs that can affect the answers. Once we've identified all the inputs and outputs, try to identify whether the problem can be broken into subproblems. If we can identify subproblems, we can probably use Dynamic Programming.
Then, figure out what the recurrence is and solve it. When we're trying to figure out the recurrence, remember that whatever recurrence we write has to help us find the answer. Sometimes the answer will be the result of the recurrence, and sometimes we will have to get the result by looking at a few results from the recurrence.
Dynamic Programming can solve many problems, but that does not mean there isn't a more efficient solution out there. Solving a problem with Dynamic Programming feels like magic, but remember that dynamic programming is merely a clever brute force. Sometimes it pays off well, and sometimes it helps only a little.
Now we have an understanding of what Dynamic programming is and how it generally works. Let's look at to create a Dynamic Programming solution to a problem. We're going to explore the process of Dynamic Programming using the Weighted Interval Scheduling Problem.
Pretend you're the owner of a dry cleaner. You have n customers come in and give you clothes to clean. You can only clean one customer's pile of clothes (PoC) at a time. Each pile of clothes, i, must be cleaned at some pre-determined start time $s_i$ and some predetermined finish time $f_i$.
Each pile of clothes has an associated value, $v_i$, based on how important it is to your business. For example, some customers may pay more to have their clothes cleaned faster. Or some may be repeating customers and you want them to be happy.
As the owner of this dry cleaners you must determine the optimal schedule of clothes that maximises the total value of this day. This problem is a re-wording of the Weighted Interval scheduling problem.
You will now see 4 steps to solving a Dynamic Programming problem. Sometimes, you can skip a step. Sometimes, your problem is already well defined and you don't need to worry about the first few steps.
Grab a piece of paper. Write out:
In the dry cleaner problem, let's put down into words the subproblems. What we want to determine is the maximum value schedule for each pile of clothes such that the clothes are sorted by start time.
Why sort by start time? Good question! We want to keep track of processes which are currently running. If we sort by finish time, it doesn't make much sense in our heads. We could have 2 with similar finish times, but different start times. Time moves in a linear fashion, from start to finish. If we have piles of clothes that start at 1 pm, we know to put them on when it reaches 1pm. If we have a pile of clothes that finishes at 3 pm, we might need to have put them on at 12 pm, but it's 1pm now.
We can find the maximum value schedule for piles $n - 1$ through to n. And then for $n - 2$ through to n. And so on. By finding the solution to every single sub-problem, we can tackle the original problem itself. The maximum value schedule for piles 1 through n. Sub-problems can be used to solve the original problem, since they are smaller versions of the original problem.
With the interval scheduling problem, the only way we can solve it is by brute-forcing all subsets of the problem until we find an optimal one. What we're saying is that instead of brute-forcing one by one, we divide it up. We brute force from $n-1$ through to n. Then we do the same for $n - 2$ through to n. Finally, we have loads of smaller problems, which we can solve dynamically. We want to build the solutions to our sub-problems such that each sub-problem builds on the previous problems.
I know, mathematics sucks. If you'll bare with me here you'll find that this isn't that hard. Mathematical recurrences are used to:
Define the running time of a divide and conquer (dynamic programming) technique
Recurrences are also used to define problems. If it's difficult to turn your subproblems into maths, then it may be the wrong subproblem.
There are 2 steps to creating a mathematical recurrence:
Base cases are the smallest possible denomination of a problem.
When creating a recurrence, ask yourself these questions:
"What decision do I make at step 0?"
It doesn't have to be 0. The base case is the smallest possible denomination of a problem. We saw this with the Fibonacci sequence. The base was:
It's important to know where the base case lies, so we can create the recurrence. In our problem, we have one decision to make:
or
If n is 0, that is, if we have 0 PoC then we do nothing. Our base case is:
if n == 0, return 0
Now we know what the base case is, if we're at step n what do we do? For each pile of clothes that is compatible with the schedule so far. Compatible means that the start time is after the finish time of the pile of clothes currently being washed. The algorithm has 2 options:
We know what happens at the base case, and what happens else. We now need to find out what information the algorithm needs to go backwards (or forwards).
"If my algorithm is at step i, what information would it need to decide what to do in step i+1?"
To decide between the two options, the algorithm needs to know the next compatible PoC (pile of clothes). The next compatible PoC for a given pile, p, is the PoC, n, such that $s_n$ (the start time for PoC n) happens after $f_p$ (the finish time for PoC p). The difference between $s_n$ and $f_p$ should be minimised.
In English, imagine we have one washing machine. We put in a pile of clothes at 13:00. Our next pile of clothes starts at 13:01. We can't open the washing machine and put in the one that starts at 13:00. Our next compatible pile of clothes is the one that starts after the finish time of the one currently being washed.
"If my algorithm is at step i, what information did it need to decide what to do in step i-1?"
The algorithm needs to know about future decisions. The ones made for PoC i through n to decide whether to run or not run PoC i-1.
Now that weâve answered these questions, weâve started to form a Â recurring mathematical decision in our mind. If not, thatâs also okay, it becomes easier to write recurrences as we get exposed to more problems.
Hereâs our recurrence:
$$ Â OPT(i) = \begin{cases} 0, \quad \text{If i = 0} \\ max{v_i + OPT(next[i]), OPT(i+1)}, Â \quad \text{if n > 1} \end{cases}$$
Let's explore in detail what makes this mathematical recurrence. OPT(i) represents the maximum value schedule for PoC i through to n such that PoC is sorted by start times. OPT(i) is our subproblem from earlier.
We start with the base case. All recurrences need somewhere to stop. If we call OPT(0) we'll be returned with 0.
To determine the value of OPT(i), there are two options. We want to take the maximum of these options to meet our goal. Our goal is the maximum value schedule for all piles of clothes. Once we choose the option that gives the maximum result at step i, we memoize its value as OPT(i).
Mathematically, the two options - run or not run PoC i, are represented as:
$$v_i + OPT(next[n])$$
This represents the decision to run PoC i. It adds the value gained from PoC i to OPT(next[n]), where next[n] represents the next compatible pile of clothing following PoC i. When we add these two values together, we get the maximum value schedule from i through to n such that they are sorted by start time if i runs.
Sorted by start time here because next[n] is the one immediately after v_i, so by default, they are sorted by start time.
$$OPT(i + 1)$$
If we decide not to run i, our value is then OPT(i + 1). The value is not gained. OPT(i + 1) gives the maximum value schedule for i+1 through to n, such that they are sorted by start times.
The solution to our Dynamic Programming problem is OPT(1). We can write out the solution as the maximum value schedule for PoC 1 through n such that PoC is sorted by start time. This goes hand in hand with "maximum value schedule for PoC i through to n".
From step 2:
$$OPT(1) = max(v_1 + OPT(next[1]), OPT(2))$$
Going back to our Fibonacci numbers earlier, our Dynamic Programming solution relied on the fact that the Fibonacci numbers for 0 through to n - 1 were already memoised. That is, to find F(5) we already memoised F(0), F(1), F(2), F(3), F(4). We want to do the same thing here.
The problem we have is figuring out how to fill out a memoisation table. In the scheduling problem, we know that OPT(1) relies on the solutions to OPT(2) and OPT(next[1]). PoC 2 and next[1] have start times after PoC 1 due to sorting. We need to fill our memoisation table from OPT(n) to OPT(1).
We can see our array is one dimensional, from 1 to n. But, if we couldn't see that we can work it out another way. The dimensions of the array are equal to the number and size of the variables on which OPT(x) relies. In our algorithm, we have OPT(i) - one variable, i. This means our array will be 1-dimensional and its size will be n, as there are n piles of clothes.
If we know that n = 5, then our memoisation array might look like this:
memo = [0, OPT(1), OPT(2), OPT(3), OPT(4), OPT(5)]
0 is also the base case. memo[0] = 0, per our recurrence from earlier.
When I am coding a Dynamic Programming solution, I like to read the recurrence and try to recreate it. Our first step is to initialise the array to size (n + 1). In Python, we don't need to do this. But you may need to do it if you're using a different language.
Our second step is to set the base case.
To find the profit with the inclusion of job[i]. we need to find the latest job that doesnât conflict with job[i]. Â The idea is to use Binary Search to find the latest non-conflicting job. I've copied the code from here but edited.
First, let's define what a "job" is. As we saw, a job consists of 3 things:
# Class to represent a job
class Job:
def __init__(self, start, finish, profit):
self.start = start
self.finish = finish
self.profit = profit
Start time, finish time, and the total profit (benefit) of running that job.
The next step we want to program is the schedule.
# The main function that returns the maximum possible
# profit from given array of jobs
def schedule(job):
# Sort jobs according to start time
job = sorted(job, key = lambda j: j.start)
# Create an array to store solutions of subproblems. table[i]
# stores the profit for jobs till arr[i] (including arr[i])
n = len(job)
table = [0 for _ in range(n)]
table[0] = job[0].profit;
Earlier, we learnt that the table is 1 dimensional. We sort the jobs by start time, create this empty table and set table[0] to be the profit of job[0]. Since we've sorted by start times, the first compatible job is always job[0].
Our next step is to fill in the entries using the recurrence we learnt earlier. To find the next compatible job, we're using Binary Search. In the full code posted later, it'll include this. For now, let's worry about understanding the algorithm.
If the next compatible job returns -1, that means that all jobs before the index, i, conflict with it (so cannot be used). Â Inclprof means we're including that item in the maximum value set. We then store it in table[i], so we can use this calculation again later.
# Fill entries in table[] using recursive property
for i in range(1, n):
# Find profit including the current job
inclProf = job[i].profit
l = binarySearch(job, i)
if (l != -1):
inclProf += table[l];
# Store maximum of including and excluding
table[i] = max(inclProf, table[i - 1])
Our final step is then to return the profit of all items up to n-1.
return table[n-1]
The full code can be seen below:
# Python program for weighted job scheduling using Dynamic
# Programming and Binary Search
# Class to represent a job
class Job:
def __init__(self, start, finish, profit):
self.start = start
self.finish = finish
self.profit = profit
# A Binary Search based function to find the latest job
# (before current job) that doesn't conflict with current
# job. "index" is index of the current job. This function
# returns -1 if all jobs before index conflict with it.
def binarySearch(job, start_index):
# https://en.wikipedia.org/wiki/Binary_search_algorithm
# Initialize 'lo' and 'hi' for Binary Search
lo = 0
hi = start_index - 1
# Perform binary Search iteratively
while lo <= hi:
mid = (lo + hi) // 2
if job[mid].finish <= job[start_index].start:
if job[mid + 1].finish <= job[start_index].start:
lo = mid + 1
else:
return mid
else:
hi = mid - 1
return -1
# The main function that returns the maximum possible
# profit from given array of jobs
def schedule(job):
# Sort jobs according to start time
job = sorted(job, key = lambda j: j.start)
# Create an array to store solutions of subproblems. table[i]
# stores the profit for jobs till arr[i] (including arr[i])
n = len(job)
table = [0 for _ in range(n)]
table[0] = job[0].profit;
# Fill entries in table[] using recursive property
for i in range(1, n):
# Find profit including the current job
inclProf = job[i].profit
l = binarySearch(job, i)
if (l != -1):
inclProf += table[l];
# Store maximum of including and excluding
table[i] = max(inclProf, table[i - 1])
return table[n-1]
# Driver code to test above function
job = [Job(1, 2, 50), Job(3, 5, 20),
Job(6, 19, 100), Job(2, 100, 200)]
print("Optimal profit is"),
print(schedule(job))
Congrats! đĽł We've just written our first dynamic program! Â Now that weâve wet our feet, Â let's walk through a different type of dynamic programming problem.
Imagine you are a criminal. Dastardly smart. You break into Bill Gatesâs mansion. Wow, okay!?!? How many rooms is this? His washing machine room is larger than my entire house??? Ok, time to stop getting distracted. You brought a small bag with you. A knapsack - if you will.
You can only fit so much into it. Letâs give this an arbitrary number. The bag will support weight 15, but no more. What we want to do is maximise how much money we'll make, $b$.
The greedy approach is to pick the item with the highest value which can fit into the bag. Let's try that. We're going to steal Bill Gates's TV. ÂŁ4000? Nice. But his TV weighs 15. So... We leave with ÂŁ4000.
TV = (ÂŁ4000, 15)
# (value, weight)
Bill Gates has a lot of watches. Let's say he has 2 watches. Each watch weighs 5 and each one is worth ÂŁ2250. When we steal both, we get ÂŁ4500 with a weight of 10.
watch1 = (ÂŁ2250, 5)
watch2 = (ÂŁ2250, 5)
watch1 + watch2
>>> (ÂŁ4500, 10)
In the greedy approach, we wouldn't choose these watches first. But to us as humans, it makes sense to go for smaller items which have higher values. The Greedy approach cannot optimally solve the {0,1} Knapsack problem. The {0, 1} means we either take the item whole item {1} or we don't {0}. However, Dynamic programming can optimally solve the {0, 1} knapsack problem.
The simple solution to this problem is to consider all the subsets of all items. For every single combination of Bill Gates's stuff, we calculate the total weight and value of this combination.
Only those with weight less than $W_{max}$ are considered. We then pick the combination which has the highest value. This is a disaster! How long would this take? Bill Gates's would come back home far before you're even 1/3rd of the way there! In Big O, this algorithm takes $O(n^2)$ time.
You can see we already have a rough idea of the solution and what the problem is, without having to write it down in maths!
Imagine we had a listing of every single thing in Bill Gates's house. We stole it from some insurance papers. Now, think about the future. What is the optimal solution to this problem?
We have a subset, L, which is the optimal solution. L is a subset of S, the set containing all of Bill Gates's stuff.
Let's pick a random item, N. L either contains N or it doesn't. If it doesn't use N, the optimal solution for the problem is the same as ${1, 2, ..., N-1}$. This is assuming that Bill Gates's stuff is sorted by $value / weight$.
Suppose that the optimum of the original problem is not optimum of the sub-problem. if we have sub-optimum of the smaller problem then we have a contradiction - we should have an optimum of the whole problem.
If L contains N, then the optimal solution for the problem is the same as ${1, 2, 3, ..., N-1}$. We know the item is in, so L already contains N. To complete the computation we focus on the remaining items. We find the optimal solution to the remaining items.
But, we now have a new maximum allowed weight of $W_{max} - W_n$. If item N is contained in the solution, the total weight is now the max weight take away item N (which is already in the knapsack).
These are the 2 cases. Either item N is in the optimal solution or it isn't.
If the weight of item N is greater than $W_{max}$, then it cannot be included so case 1 is the only possibility.
To better define this recursive solution, let $S_k = {1, 2, ..., k}$ and $S_0 = \emptyset$
Let B[k, w] be the maximum total benefit obtained using a subset of $S_k$. Having total weight at most w.
Then we define B[0, w] = 0 for each $w \le W_{max}$.
Our desired solution is then B[n, $W_{max}$].
$$ Â OPT(i) = \begin{cases} B[k - 1, w], \quad \text{If w < }w_k \\ max{B[k-1, w], b_k + B[k - 1, w - w_k]}, \; \quad \text{otherwise} \end{cases}$$
Okay, pull out some pen and paper. No, really. Things are about to get confusing real fast. This memoisation table is 2-dimensional. We have these items:
(1, 1), (3, 4), (4, 5), (5, 7)
Where the tuples are (weight, value)
.
We have 2 variables, so our array is 2-dimensional. The first dimension is from 0 to 7. Our second dimension is the values.
And we want a weight of 7 with maximum benefit.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | ||||||||
(4, 3) | ||||||||
(5, 4) | ||||||||
(7, 5) |
The weight is 7. We start counting at 0. We put each tuple on the left-hand side. Ok. Now to fill out the table!
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | |||||||
(4, 3) | 0 | |||||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
The columns are weight. At weight 0, we have a total weight of 0. At weight 1, we have a total weight of 1. Obvious, I know. But this is an important distinction to make which will be useful later on.
When our weight is 0, we can't carry anything no matter what. The total weight of everything at 0 is 0.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | ||||||
(4, 3) | 0 | |||||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
If our total weight is 1, the best item we can take is (1, 1). As we go down through this array, we can take more items. At the row for (4, 3) we can either take (1, 1) or (4, 3). But for now, we can only take (1, 1). Our maximum benefit for this row then is 1.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | |||||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
If our total weight is 2, the best we can do is 1. We only have 1 of each item. We cannot duplicate items. So no matter where we are in row 1, the absolute best we can do is (1, 1).
Let's start using (4, 3) now. If the total weight is 1, but the weight of (4, 3) is 3 we cannot take the item yet until we have a weight of at least 3.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | |||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
Now we have a weight of 3. Let's compare some things. We want to take the max of:
$$MAX(4 + T[0][0], 1)$$
If we're at 2, 3 we can either take the value from the last row or use the item on that row. We go up one row and count back 3 (since the weight of this item is 3).
Actually, the formula is whatever weight is remaining when we minus the weight of the item on that row. The weight of (4, 3) is 3 and we're at weight 3. 3 - 3 = 0. Therefore, we're at T[0][0]. T[previous row's number][current total weight - item weight].
$$MAX(4 + T[0][0], 1)$$
The 1 is because of the previous item. The max here is 4.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | ||||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
$$max(4 + t[0][1], 1)$$
Total weight is 4, item weight is 3. 4 - 3 = 1. Previous row is 0. t[0][1].
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | |||
(5, 4) | 0 | |||||||
(7, 5) | 0 |
I won't bore you with the rest of this row, as nothing exciting happens. We have 2 items. And we've used both of them to make 5. Since there are no new items, the maximum value is 5.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | |||||||
(7, 5) | 0 |
Onto our next row:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | ||||
(7, 5) | 0 |
Here's a little secret. Our tuples are ordered by weight! That means that we can fill in the previous rows of data up to the next weight point. We know that 4 is already the maximum, so we can fill in the rest.. This is where memoisation comes into play! We already have the data, why bother re-calculating it?
We go up one row and head 4 steps back. That gives us:
$$max(4 + T[2][0], 5)$$.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | |||
(7, 5) | 0 |
Now we calculate it for total weight 5.
$$max(5 + T[2][1], 5) = 6$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | ||
(7, 5) | 0 |
We do the same thing again:
$$max(5 + T[2][2], 5) = 6$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | |
(7, 5) | 0 |
Now we have total weight 7. We choose the max of:
$$max(5 + T[2][3], 5) = max(5 + 4, 5) = 9$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 |
If we had total weight 7 and we had the 3 items (1, 1), (4, 3), (5, 4) the best we can do is 9.
Since our new item starts at weight 5, we can copy from the previous row until we get to weight 5.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 |
We then do another max.
Total weight - new item's weight. This is $5 - 5 = 0$. We want the previous row at position 0.
$$max(7 + T[3][0], 6)$$
The 6 comes from the best on the previous row for that total weight.
$$max(7 + 0, 6) = 7$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 |
$$max(7 + T[3][1], 6) = 8$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 |
$$max(7+T[3][2], 9) = 9$$
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
9 is the maximum value we can get by picking items from the set of items such that the total weight is $\le 7$.
Now, what items do we actually pick for the optimal set? We start with this item:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
We want to know where the 9 comes from. It's coming from the top because the number directly above 9 on the 4th row is 9. Since it's coming from the top, the item (7, 5) is not used in the optimal set.
Where does this 9 come from?
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
This 9 is not coming from the row above it. Item (5, 4) must be in the optimal set.
We now go up one row, and go back 4 steps. 4 steps because the item, (5, 4), has weight 4.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
4 does not come from the row above. The item (4, 3) must be in the optimal set.
The weight of item (4, 3) is 3. We go up and we go back 3 steps and reach:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
(1, 1) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
(4, 3) | 0 | 1 | 1 | 4 | 5 | 5 | 5 | 5 |
(5, 4) | 0 | 1 | 1 | 4 | 5 | 6 | 6 | 9 |
(7, 5) | 0 | 1 | 1 | 4 | 5 | 7 | 8 | 9 |
As soon as we reach a point where the weight is 0, we're done. Our two selected items are (5, 4) and (4, 3). The total weight is 7 and our total benefit is 9. We add the two tuples together to find this out.
Let's begin coding this.
Now we know how it works, and we've derived the recurrence for it - it shouldn't be too hard to code it. If our two-dimensional array is i (row) and j (column) then we have:
if j < wt[i]:
If our weight j is less than the weight of item i (i does not contribute to j) then:
if j < wt[i]:
T[i][j] = T[i - 1][j]
else:
// weight of i >= j
T[i][j] = max(val[i] + t[i - 1][j-wt(i), t[i-1][j])
// previous row, subtracting the weight of the item from the total weight or without including ths item
This is what the core heart of the program does. I've copied some code from here to help explain this. I'm not going to explain this code much, as there isn't much more to it than what I've already explained. If you're confused by it, leave a comment below or email me đ
# Returns the maximum value that can be put in a knapsack of
# capacity W
def knapSack(W , wt , val , n):
# Base Case
if n == 0 or W == 0:
return 0
# If weight of the nth item is more than Knapsack of capacity
# W, then this item cannot be included in the optimal solution
if (wt[n-1] > W):
return knapSack(W , wt , val , n-1)
# return the maximum of two cases:
# (1) nth item included
# (2) not included
else:
return max(val[n-1] + knapSack(W-wt[n-1] , wt , val , n-1),
knapSack(W , wt , val , n-1))
# To test above function
val = [60, 100, 120]
wt = [10, 20, 30]
W = 50
n = len(val)
print(knapSack(W , wt , val , n))
# output 220
Time complexity is calculated in Dynamic Programming as:
$$Number \;of \;unique \;states * time \;taken \;per\; state$$
For our original problem, the Weighted Interval Scheduling Problem, we had n piles of clothes. Each pile of clothes is solved in constant time. The time complexity is:
$$O(n) + O(1) = O(n)$$
I've written a post about Big O notation if you want to learn more about time complexities.
With our Knapsack problem, we had n number of items. The table grows depending on the total capacity of the knapsack, our time complexity is:
$$O(nw)$$
Where n is the number of items, and w is the capacity of the knapsack.
I'm going to let you in on a little secret. It's possible to work out the time complexity of an algorithm from its recurrence. You can use something called the Master Theorem to work it out. This is the theorem in a nutshell:
Now, I'll be honest. The master theorem deserves a blog post of its own. For now, I've found this video to be excellent:
Dynamic Programming & Divide and Conquer are similar. Dynamic Programming is based on Divide and Conquer, except we memoise the results.
But, Greedy is different. It aims to optimise by making the best choice at that moment. Sometimes, this doesn't optimise for the whole problem. Take this question as an example. We have 3 coins:
1p, 15p, 25p
And someone wants us to give a change of 30p. With Greedy, it would select 25, then 5 * 1 for a total of 6 coins. The optimal solution is 2 * 15. Greedy works from largest to smallest. At the point where it was at 25, the best choice would be to pick 25.
Greedy vs Divide & Conquer vs Dynamic Programming | ||
---|---|---|
Greedy | Divide & Conquer | Dynamic Programming |
Optimises by making the best choice at the moment | Optimises by breaking down a subproblem into simpler versions of itself and using multi-threading & recursion to solve | Same as Divide and Conquer, but optimises by caching the answers to each subproblem as not to repeat the calculation twice. |
Doesn't always find the optimal solution, but is very fast | Always finds the optimal solution, but is slower than Greedy | Always finds the optimal solution, but could be pointless on small datasets. |
Requires almost no memory | Requires some memory to remember recursive calls | Requires a lot of memory for memoisation / tabulation |
There are 2 types of dynamic programming. Tabulation and Memoisation.
We've computed all the subproblems but have no idea what the optimal evaluation order is. We would then perform a recursive call from the root, and hope we get close to the optimal solution or obtain a proof that we will arrive at the optimal solution. Memoisation ensures you never recompute a subproblem because we cache the results, thus duplicate sub-trees are not recomputed.
From our Fibonacci sequence earlier, we start at the root node. The subtree F(2) isn't calculated twice.
This starts at the top of the tree and evaluates the subproblems from the leaves/subtrees back up towards the root. Memoisation is a top-down approach.
We've also seen Dynamic Programming being used as a 'table-filling' algorithm. Usually, this table is multidimensional. This is like memoisation, but with one major difference. We have to pick the exact order in which we will do our computations. The knapsack problem we saw, we filled in the table from left to right - top to bottom. We knew the exact order of which to fill the table.
Sometimes the 'table' is not like the tables we've seen. It can be a more complicated structure such as trees. Or specific to the problem domain, such as cities within flying distance on a map.
Generally speaking, memoisation is easier to code than tabulation. We can write a 'memoriser' wrapper function that automatically does it for us. With tabulation, we have to come up with an ordering.
Memoisation has memory concerns. If we're computing something large such as F(10^8), each computation will be delayed as we have to place them into the array. And the array will grow in size very quickly.
Either approach may not be time-optimal if the order we happen (or try to) visit subproblems is not optimal. Â If there is more than one way to calculate a subproblem (normally caching would resolve this, but it's theoretically possible that caching might not in some exotic cases). Memoisation will usually add on our time-complexity to our space-complexity. For example with tabulation we have more liberty to throw away calculations, like using tabulation with Fib lets us use O(1) space, but memoisation with Fib uses O(N) stack space).
Memoisation vs Tabulation | ||
---|---|---|
Tabulation | Memoisation | |
Code | Harder to code as you have to know the order | Easier to code as functions may already exist to memoise |
Speed | Fast as you already know the order and dimensions of the table | Slower as you're creating them on the fly |
Table completeness | The table is fully computed | Table does not have to be fully computed |
Most of the problems you'll encounter within Dynamic Programming already exist in one shape or another. Often, your problem will build on from the answers for previous problems. Here's a list of common problems that use Dynamic Programming.
I hope that whenever you encounter a problem, you think to yourself "can this problem be solved with ?" and try it.
An AVL tree (named after inventors Adelson-Velsky and Landis) is a self-balancing binary search tree.
Before we delve into AVL trees, we need to learn a few things.
The height of a tree is the length of the longest path from the root node (the top node) to some node. In the image above, the tree has a height of 3.
This tree is called balanced. Balance is an important topic in AVL trees. First, let's look at why balance is important. The speed of searching a balanced tree is $\theta{log n]$. The speed of searching a non-balanced tree is $\theta{n}$.
This tree isn't balanced, it's a sad tree đ˘. Formally, balance is defined as:
The tree is considered balanced if the height is $\theta{log n}$
This sucks as a definition. So, everytime you make a tree you have to check this weird property? Balance can be better explained.
Balance is where each node has one left and one right child. In the above example, no nodes had a left child so it is unbalanced.
The depth of a node is a distance from it to the root. The height of a node is the length of the longest path from the node to a leaf.
Here is a tree with heights of each node listed next to them.
Each leaf has its own pointers, both pointing towards NULL (with a height of -1).
We often do not include the null pointers in diagrams, but they're important as they make our lives easier later on.
AVL trees require the weight of the left and right children of every node to differ by at most =+ 1. In fact, anything in the range $-1 \le x \le 1$ is fine. Preferably, we'd like them to differ by 0. That is, it's perfectly matched on left and right. This isn't realistic and is impossible for an odd number of nodes in a tree.
$$S - R \le 1$$
Now, we know we can't make them 0. And it's tiresome to calculate $$S-R$$ for every single pair of nodes. So we just make sure the right has +1 more node than the left. A very short proof of this working is:
$$ left = n$$
$$ right = n + 1$$
$$H = right - left = n - n + 1 = 1$$
Let's start out with this tree. If you look at the root node, the right side has height 2. The left side has height 3. 2 - 3 = -1, which is fine.
Now, let's insert 23.
Right is 2, left is 4. 2 - 4 = -2. Uh oh! It's a sad tree now because it's not balanced. We can turn this tree into a balanced tree by rotating it.
Our tree is right heavy. We want to rotate it left. We can do it like this:
Rotations take O(1) time.
We rotate the tree we had above, now it is a happy tree. A colleague told me that 26 is between 23 and 29, therefore 26 goes on top. While this isn't official advice in any textbooks, it seems to work for most of the problems I'm going to present in this article.
We insert(55). Now it's a sad tree again, what a disaster! What we have here is a zig-zag:
If we rotate the zig-zag we get:
That's.... no better than before. Let's quickly rotate our tree to match this zig-zag pattern:
Now, the solution to the zig-zag pattern is to rotate again. A double rotation. 55 is inbetween 50 and 65, so it should go on top.
Now it's a happy tree! đ´
Now we understand rotations, let's take a look at AVL insertions one more time.
Suppose X is lowest node in the now not-AVL tree. Assume right(x) is higher (heavier).
AVL Deletion is the same as insertion, but instead of insertion we delete. The same 2 principles apply:
Diffie-Hellman-Merkle is a way to share a secret key with someone (or something) without actually sending them the key. Before we look into how we share keys let's first look into what keys are and why we would want to invent a method to share keys without giving the other person the key.
Your front door is usually locked by a key. This key unlocks & locks your front door. You have one key which you use to unlock and lock things.
Only people with the key or a copy of the key can unlock the door. Now, imagine you're going to be on holiday Friday, Saturday, Sunday in Bali. You want to invite your friend around to look after your cat đş while you're on the beautiful beaches đď¸.
Your only friend is unfortunately on holiday Wednesday, Thursday and Friday. They get back right as you leave for holiday. You can't be there to physically give them the key, but hiding the key under a rock outside your door seems insecure. Anyone could lift up that rock and find the key, but you just want your friend to have the key.
This is where Diffie-Hellman comes in. Well, with Diffie-Hellman you're not exchanging physical keys but rather digital keys. Let's explore some basic cryptography to understand why digital key exchange sucks just as much as real life key exchange.
Julius Caesar used a cipher to send messages that no one else could read other than the intended recipient. Mainly because no one could read back in 100 BC, and those that could wouldn't understand a random string of letters. That's the whole point of cryptography. To create ways to communicate without third parties understanding the message. This cipher is Caesar's Cipher. Given an alphabet and a key (the key is an integer between 1 and 25), shift all of the alphabet letters by key.
With a shift of 3, as seen in the image above, A becomes D, B becomes E and so on until it wraps around with X = A. The original message is called the plaintext and the encrypted message is called the ciphertext.
The easiest way to perform Caesar's Cipher is to turn all of the letters into numbers, a = 1, b = 2, c = 3 and so on.
To encrypt, E, you calculate this for every letter (where s is the shift):
$$ E_{s}(letter) = (letter + shift)$$
To decrypt Caesar's cipher, D, you calculate this for every letter:
$$ D_{s}(letter) = (letter - shift)$$ Â
Something important to note is that this version of the cipher doesn't support Â wraparound (for brevity).
As you can tell, it's not very secure. With 25 total shifts you just have to shift the text 25 times until you find the decrypted code, this is called a brute force attack. You take the encrypted text and shift it all 25 times until you find the decrypted text. But let's imagine for a second that this was a hard cipher - that brute force isn't feasible.
The shift is the key of Caesar's cipher. But the problem still persists, how do you tell your friend you're using a shift of 9? Any and all forms of communication can be listened in on. It doesn't matter if you're writing a letter or going to a hidden forest in Switzerland 30 miles from the nearest town. If you communicate the key, it can be listened in on.
How do you tell your friend you're using a shift of 9, for example? You have to communicate it to them somehow. Any and all forms of communication can be listened in on - whether that's writing a letter or going to a hidden forest in Switzerland 30 miles from the nearest town and telling your friend.
The problem becomes even more apparent when you realise that communicating parties over the internet normally have no prior knowledge about each other and are thousands of miles apart. This is where the magic of Diffie-Hellman-Merkle key exchange comes in.
Diffie-Hellman is a way to securely exchange keys in public. It was conceptualised Â by Ralph Merkle, and named After Whitfield Diffie and Martin Hellman. I have chosen to include Merkle's name as the title because he put in just as much work as Diffie-Hellman and his name never appears when this algorithm is talked about.
U.S. Patent 4,200,770, from 1977, is now expired and describes the now-public-domain algorithm. It credits Hellman, Diffie, and Merkle as inventors.
Let's go through how this algorithm works.
For this algorithm, we will also walk through the colour mixing method for explaining how it works.
Alice and Bob publicly agree to use a modulus p = 23 and g = 5 (which is a primitive root modulo 23, explained later). Modulus is just the remainder of the division. Note: this example comes from Wikipedia.
It's hard to describe the paint method in text, so if you want to know about this method I suggest watching this video:
We'll colour G yellow. We have 2 copies of G (yellow) as seen above.
When Alice and Bob agree on these numbers, Eve knows they are using these numbers.
2. Alice needs to calculate a private key.
She does this by picking a secret number (a). She computes $G^a \; mod \; p$ and sends that result to Bob.
Alice chooses a secret, random integer a = 4.
Alice computes $A = 5^4 \; mod \; 23 = 4$ and sends the number 4 to Bob.
She colours this private key reddish-brown.
Eve doesn't know Alice's secret number is 4, only that the result of this equation is 4. It's not feasible for Eve to calculate what Alice's secret number is from the resultant of this equation.
3. Bob makes his own private key. Its colour is dark green.
He calculates this by picking a secret number (b) and computing $g^b \; mod \; p$. He then sends the result to Alice. Bob creates a random private key, for this example we'll use 3.
Then Bob calculates $B = 5^3 \; mod \; 23 = 10$ and sends 10 to Alice.
4. Now Bob takes the number Alice sent him and computes $b^a \; mod \; p$.
In the colour analogy, this is taking Alice's paint colour and adding it to Bob's paint colour.
Bob computes $s = 4^3 \; mod \; 23 Â = 18$
Bob doesn't send this to Alice.
5. Alice computes $A^b \; mod \; p$.
In the paint analogy, this is Alice adding Bob's paint (that Bob sent her) to her paint.
Alice calculates $s = 10^4 \; mod \; 23 = 18$
The magic is that Alice and Bob now have the same number, or the same paint colour.
Let's discuss in detail the mathematics behind this cool algorithm.
Diffie-Hellman-Merkle works because of a cool modulus exponent principle. First, let's explain what modulus is before we try to understand this principle.
Imagine a finite range of numbers, for example, 1 to 12. These numbers are arranged in a circle, much like a clock (modular arithmetic is sometimes called clock arithmetic because of this).
Count 13 around this clock. You get to 12 and then you need to count 1 more - so you go back to 1. Modular arithmetic is still defined as the remainder of division, however it can also be defined (and is more commonly defined) as a clock.
Functions using modular arithmetic tend to perform erratically, which in turn sometimes makes them one-way functions. Let's see this with an example by taking a regular function and seeing how it works when it becomes a modular arithmetic function.
$$3^x$$
When we insert 2 into this function, we get $3^2 = 6$. Insert 3 and we get $3^3 = 9$.
This function is easy to reverse. If we're given 9, we can tell that the function had an input of 3, because of $3^3 = 9$.
However, with modular arithmetic added, it doesn't behave sensibly.
Image we had this formula:
$$ 3^x \; mod \; 7 = 1$$
How would you find out what x is? You can't put the mod on the other side, because there isn't really an inverse of modular arithmetic. What about guessing? Let's input 5:
$$ 3^5 \; mod \; 7 = 5$$
Okay, that was too big. You might want to go lower, maybe 4 or 3 but actually this is the wrong direction. When x is 6, it is equal to 1.
In normal arithmetic, we can test numbers and get a feel for whether we are getting warmer or colder, but this isn't the case with modular arithmetic.
Often the easiest way to reverse modular arithmetic is to compile a table for all values of x until the right answer is found. Although this may work for smaller numbers, it is computationally infeasible to do for much larger numbers. This is often why modular arithmetic is known as a one-way function.
If I gave you a number such as 5787 and told you to find the function for it, it would be infeasible. In fact, if I gave you the ability to input any number into the function it would still be hard. It took me a mere few seconds to make this function, but it'll take you hours or maybe even days to work out what x is.
Diffie-Hellman-Merkle is a one-way function. While it is relatively easy to carry out this function, it is computationally infeasible to do the reverse of the function and find out what the keys are. Although, it is possible to reverse an RSA encryption if you know some numbers such as N.
The primitive root of a prime number, p, is a number, a, such that all numbers:
$$a \; mod \; p, a^2 mod p, a^3 \; mod \; p, a^4 \; mod \; p, ...$$
are different. There is a formula for counting what the indices are, but I think it's far more intuitive to think "the second one is to the power of 2, the third one is to the power of 3" and so on.
Let's see an example where $p = 7$. Let's set $a_1 = 2$ and $a_2 = 3$.
$$2^0 = 1 ( mod \; 7) = 1$$
$$2^1 = 2 ( mod \; 7) = 2$$
$$2^2 = 4 ( mod \; 7) = 4$$
$$2^3 = 8 ( mod \; 7) = 1$$
Uh oh! 2^{0} is the same as 2^{3}. This means that 2 is not a primitive root of 7. Let's try again with 3.
$$3^0 = 1 ( mod \; 7) = 1$$
$$3^1 = 3 (mod \; 7) = 3$$
$$3^2 = 9 (mod \; 7) = 2$$
$$3^3 = 27 (mod \; 7) = 6$$
$$3^4 = 81 ( mod \; 7) = 4$$
$$3^5 = 243 ( mod \; 7) = 5$$
$$3^6 = 1 (mod \; 7) = 1$$
Now let's try a = 3.
$$3^0 = 1$$
$$3^1 = 3$$
$$3^2 = 2$$
$$3^3 = 6$$
$$3^4 = 4$$
$$3^5 = 5$$
$$3^6 = 1$$
Now we've gotten a cycle in these powers.
3^{6} = 1, and 3^{0} = 1. This is because we are using modulus it repeats into this cycle, so we can stop now. Unlike before where we reached 2^{3} and it cycled, it's okay if it cycles here because for any prime number, p, and any number, a, such that $a \ne 0 \; mod \; p$ and $a \ne 1 \; mod \; p$ the consecutive powers of $a$ may cover no more than p - 1 values modulo p. That is, we go from $1, ..., p - 1$. When p is 7, the consecutive powers cover up to 6.
$$a^b = c \; mod \; n$$
Such an equation means some numbers you can write it differently as:
$$log_a c = b \; mod \; n$$
Logarithms are the inverse of exponents, we've just inversed the sum here.
Now it's a well defined function, we can say in discrete terms that $log_3 5 = 5 \; (mod \; 7)$ (looking at the table above).
if you use a non-primitive root number it becomes easier, as we have a smaller number of outcomes (because it repeats earlier), as seen below.
$$2^0 = 1 (mod \; 7) = 1$$
$$2^1 = 2 ( mod \; 7) = 2$$
$$2^2 = 4 ( mod \; 7) = 4$$
$$2^3 = 8 ( mod \; 7) = 1$$
By using a primitive root, we get a much larger outcome which makes it harder.
$$3^0 = 1 ( mod \; 7) = 1$$
$$3^1 = 3 (mod \; 7) = 3$$
$$3^2 = 9 ( mod \; 7) = 2$$
$$3^3 = 27 ( mod \; 7) = 6$$
$$3^4 = 81 ( mod \; 7) = 4$$
$$3^5 = 243 ( mod \; 7) = 5$$
$$3^7 = 1 ( mod \; 7) = 1$$
It is relatively easy to calculate exponentials modulo a prime, that is a, I, p calculate $a^i mod p$.
Exponentiation is a cheap operation. you can do it even for very large numbers while logarithm is a much more difficult function to calculate for large numbers.
To calculate exponentiation, you give number 2 and you respond to me what the answer is. that's exponentiation, going from left to right.
$$3^0 = 1$$
$$3^1 = 3$$
$$3^2 = 2$$
$$3^3 = 6$$
$$3^4 = 4$$
$$3^5 = 5$$
Logarithm is how to go back, from right to left. Logarithms are much harder than exponentiation.
Let's go back to seeing how Diffie-Hellman worked, but this time with a lot more knowledge of how the mathematics works.
We have 2 people, Alice and Bob. Each of them has to agree in advance some prime number q (publicly known number) and its primitive root $\alpha$(publicly Â known).
We randomly select the prime number 7 with the primitive root 3.
Let Alice and Bob wish to exchange a key, then they do the following:
$$3^0 = 2$$
$$3^1 = 3$$
$$3^2 = 2$$
$$3^3 = 6$$
$$3^4 = 4$$
$$3^5 = 5$$
and they choose one of the exponents, chosen randomly and kept in a secret. Now Alice does $y_a = a^{x a} mod q$ and sends it to Bob.
This example isn't very impressive, and sometimes $3^5 = 5$ but for much larger numbers most things change everything, this is almost RSA encryption (the idea is the same, but it's not quite the same as this is key exchange, not encryption)
Bob then does the same as Alice. Both Alice and Bob are now capable of calculating the shared key.
Alice calculates $k = (y_b)^{x_a} \; mod \; q$
Bob calculates $k = (y_a)^{x_b} \; mod \; q$
Now they have the same numbers, k is the common secret key.
$$(\alpha ^ {x_b})^{x_a} = (\alpha ^ {x_a})^{x_b}$$
This equation above is what makes it all work. The formulae are the same, but different. You get Alice to do one side of the formula, Bob does another side and they end up with the same number.
This really is the equation that puts it all together. Most of this blog post led up to this equation.
a and b are secret, and without these numbers, there is no easy way to repeat these computations because in to do it you need to know the secrets.
The above formula shows that the two methods are exactly equal. If you do the left equation, you get the same result as the right equation.
Diffie-Hellman-Merkle is a fascinating way of sharing a secret over an unsecured communications medium, by not sharing it at all over that medium.
If you want to learn about the different attacks that have been attempted on Diffie-Hellman and one attack which might be a very big problem, sign up to my email list below and I'll send you the PDF đâ¨
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these articles in my spare time. This blog is my full time job, so any and all donations are appreciated
]]>Public key cryptography seems magical to everyone, even those who understand it. In this post, I'm going to explain public key cryptography. Public Key Cryptography is based on asymmetric cryptography, so first let us talk about symmetric cryptography.
Your front door is usually locked by a key. This key unlocks & locks your front door. With symmetric cryptography, you have one key which you use to unlock and lock things.
Only people with the key or a copy of the key can unlock the door. Now, imagine you're on holiday in Bali. You want to invite your friend around to look after your cat đş while you're on the beautiful beaches đď¸.
Before the holiday, you give your friend the key to your door. Your friend is then robbed, so someone else has your front door key now. Or your friend leaves it laying around and someone clones it. The problem with symmetric key cryptography is that this one key is easy to clone, it's easy to attack your house in many different ways.
Let's take this from an analogy to a real-life example of symmetric cryptography.
Julius Caeser used a cipher to send messages that no one else could read other than the intended recipient. Mainly because no one could read back in 100 BC, and those that could wouldn't understand a random string of letters. That's the whole point of cryptography. To create ways to communicate without third parties listening in. This cipher is Caeser's Cipher. Given an alphabet and a key (the key is an integer between 1 and 25), shift all of the alphabet letters by key.
With a shift of 3, as seen in the image above, A becomes D, B becomes E and so on until it wraps around with X = A. The original message is called the plaintext and the encrypted message is called the ciphertext.
The easiest way to perform Caesar's Cipher is to turn all of the letters into numbers, a = 1, b = 2, c = 3 and so on.
To encrypt, E, you calculate this for every letter (where s is the shift):
$$ E_{s}(letter) = (letter + shift) Â mod \; Â 26$$
This is called a function. You put an input into it, and an output comes out. A lot of functions are known as two-way functions. You can encrypt by using the function above, and it makes sense that to decrypt you just do the opposite. Given a function that doubles a number, if you have a doubled number and you want to reverse the function do the opposite of multiplying by 2, divide the number by 2.
mod is the modulus operator. It's the remainder of dividing. We do modulous because there isn't a 27th letter in the alphabet, you just wrap around from "z" back to "a". We'll talk more about modular arithmeticlater on in this article. Look at this small example below:
$$ 4 Â \; mod \; 3 = 1 $$
Because 4 divided by 3 has a remainder of 1.
To decrypt Caesar's cipher, D, you calculate this for every letter:
$$ D_{s}(letter) = (letter - shift) \; mod \; 26$$ Â
As you can tell, it's not very secure. With 25 total shifts you just have to shift the text 25 times until you find the decrypted code, this is called a brute force attack. You take the encrypted text and shift it all 25 times until you find the decrypted text. But let's imagine for a second that this was a hard cipher - that brute force isn't feasible.
How do you tell your friend you're using a shift of 9, for example? You have to communicate it to them somehow. Any and all forms of communication can be listened in on - whether that's writing a letter or going to a hidden forest in Switzerland 30 miles from the nearest town and telling your friend.
The latter isn't very feasible, but it is a lot more secure than telling your friend in Times Square, New York đ˝ what the shift is.
Now, imagine you brought your lunch to work in a special lunchbox - the same you've had since nursery school. Someone steals your food and your lunchbox. You don't mind losing the food, but you do want the lunchbox back. You want a way for them to securely return your lunchbox without you knowing who took it - because that takes the pressure off of them.
You place a box in the staff room with a lock & key. You give copies of keys to everyone in the office and hope for the best - that someone will return the lunchbox by placing it in the box.
Unfortunately, the keys everyone has also unlocks the box as well as locks it. This means that someone could unlock the box and re-steal your lunchbox.
We need to find a way to get rid of this idea of sharing keys, get rid of the idea of 'any key can lock and unlock', and this is where asymmetric cryptography comes in.
You install an extraordinary lock on this box, one that has two separate keys. The first key đ can only turn clockwise, from A (locked) to B (unlocked) to C (locked).
The second key đď¸ can only turn anti-clockwise, from C to B to A. You pick the first key and keep it to yourself. This is called a private key. The second key is called the public key. This key is given out to everyone in the office. You want everyone to have this key.
When someone returns your prized lunchbox, they can leave it in this box. All the public keys can do is lock the box. Your private key is the only one that can open it.
This is public key cryptography. Everyone knows that if they put something in the box and lock it, only you can open it with your private key.
With symmetric cryptography, everyone could open your box if they had the key. Now, no one apart from you can open the box.
Public key cryptography was first formulated by Whitfield-Diffie or James Ellis (Ellis discovered first, but he didn't publish it. Whitfield-Diffie published first). Both Ellis and Whitfield-Diffie enjoyed that public key cryptography could work in theory, but never managed to figure out how it would work in practice.
The production of a working Public Key Encryption system is attributed to RivestâShamirâAdleman (RSA) or Clifford Cocks. Like above, Cocks discovered first, but he didn't publish it. RivestâShamirâAdleman published first.
Let's look at how this works mathematically.
While the box analogy was something physical, we're going to go back to encrypting messages much like we did with Caeser Cipher.
$$ K^-_b(K^+_b(m)) = m$$
When you apply the public key ($K^+$) to the encrypted message, and then the private key ($K^-$) to the encrypted message you get the plaintext message. We are also looking for these attributes:
It is computationally easy to:
But it is also computationally infeasible to:
We want to turn a message into numbers. Previously we assigned a number to each letter, A = 1 and so on. The American Standard Code for Information Interchange (ASCII) is a table of all English letters and most symbols along with their associated ASCII code & Binary output.
When you press a key on the keyboard, the keyboard converts this to Ascii as numbers are easier to work with than letters for a computer. If you want to learn more about ASCII, check out this video.
Let's encrypt the word "cats". In binary, according to Ascii, this is:
$$c = 01100011$$
$$a = 01100001$$
$$t=01110100$$
$$s = 01110011$$
If you add them all together and convert to base 10, you get 4430123. For our example, we're going to look at how RivestâShamirâAdleman (RSA), a public key cipher, calculates public & private keys. We're also going to use much smaller numbers, so the maths isn't as hard to read.
Below is a calculator I created for turning ASCII into Binary.
# https://stackoverflow.com/a/40949538
def string2bits(s=''):
return [bin(ord(x))[2:].zfill(8) for x in s]
# edit the text below to see how this works
text = 'cats'
bits = string2bits(text)
for x in bits:
print(x)
Prime numbers are numbers that only have 2 factors, 1 and itself. We're going to pick 5 & 7, not large prime numbers but small for brevity.
2. Compute n = pq, z = (p-1)(q-1)
$$n = 5 * 7 = 35$$
$$z = (5 - 1)(7-1) = 4 * 6 = 24$$
3. Choose e (with e < z) such that e has no common factors with z.
$$ e = 5$$
5 has no common factors with 24, and it is smaller than 24.
4. Choose d such that $ed - 1$ is exactly divisible by z.
The easiest way to do this would be to loop over all possible values of d in code. This code is written in Functional Python, but the language and paradigm doesn't matter.
Since we're using such small numbers, we have overlap. Both e and d are 5. Let's set d to 29, just so we don't have this overlap.
$$ d = 29 $$
5. The public key is (n, e). The private key is (n, d)
$$key_{public} = (35, 5)$$
$$key_{private} = (35, 29)$$
Below is code to generate RSA keys. Note that we have overlap on d with p = 5 and q = 7, as discussed above.
def rsa(p, q):
n = p * q
z = (p - 1) * (q - 1)
# calculate e such that e is less than z
# and e has no common factors with z
for i in range(1, z - 1):
if z % i != 0:
e = i
break
d = (filter(lambda x: ((x * 5) - 1) % 24 == 0, range(1, 200)))[0]
return{"Public key": [n, d], "Private Key": [n, e]}
# change p and q here to any prime numbers of your choice
p = 5
q = 7
print(rsa(p, q))
To send an encrypted message, Bob computes $C = m^e \; mod \; n$ for message m and key e. To decrypt the message, Alice computes $m = c^d \; mod \; n$.
Encrypting "cats" gives us Â $427^5 \; mod \; 35 = 7$.
Decrypting "cats" gives us 4430123.
Then to send a message m, Bob computes $c=m^e \; (mod \; N)$ and sends it to Alice and Alice decrypts the message using her private key d with $m=c^d \; (mod \; N)$
Prime factorisation. It's easy to multiply two prime numbers together, but it's incredibly hard to find out what prime numbers were used to make that number. You can easily multiply these two together:
$$ 23,719 * 41,843 = 992,474,117$$
But if I gave you 992,474,117 and told you to find the prime numbers that were used to make this number, it's not computationally feasible. Even more so when you realise the prime numbers used are very, very large.
This is known as a trap-door function or a one-way function. While it is easy to go through one way, it is computionally infeasiable to go the other way. Boiling an egg is a one-way function because it is easy to boil an egg, but it is not possible to un-boil an egg. Let's go deeper into the mathematics and explore modular arithmetic.
Imagine a finite range of numbers, for example, 1 to 12. These numbers are arranged in a circle, much like a clock (modular arithmetic is sometimes called clock arithmetic because of this)
Count 13 around this clock. You get to 12 and then you need to count 1 more - so you go back to 1. Modular arithmetic is still defined as the remainder of division, however it can also be defined (and is more commonly defined) as a clock.
Functions using modular arithmetic tend to perform erratically, which in turn sometimes makes them one-way functions. Let's see this with an example by taking a regular function and seeing how it works when it becomes a modular arithmetic function.
$$3^x$$
When we insert 2 into this function, we get $3^2 = 6$. Insert 3 and we get $3^3 = 9$
This function is easy to reverse. If we're given 9, we can tell that the function had an input of 3, because of $3^3 = 9$.
However, with modular arithmetic added, it doesn't behave sensibly.
Image we had this formula:
$$ 3^x mod \; 7 = 1$$
How would you find out what x is? You can't put the mod on the other side, because there isn't really an inverse of modular arithmetic. What about guessing? Let's input 5:
$$ 3^5 mod \; 7 = 5$$
Okay, that was too big. You might want to go lower, maybe 4 or 3 but actually this is the wrong direction. When x is 6, it is equal to 1.
In normal arithmetic, we can test numbers and get a feel for whether we are getting warmer or colder, but this isn't the case with modular arithmetic.
Often the easiest way to reverse modular arithmetic is to compile a table for all values of x until the right answer is found. Although this may work for smaller numbers, it is computationally infeasible to do for much larger numbers. This is often why modular arithmetic is known as a one-way function.
If I gave you a number such as 5787 and told you to find the function for it, it would be infeasible. In fact, if I gave you the ability to input any number into the function it would still be hard. It took me a mere few seconds to make this function, but it'll take you hours or maybe even days to work out what x is.
RSA is a one-way function. While it is relatively easy to carry out this function, it is computationally infeasible to do the reverse of the function and find out what the keys are. Although, it is possible to reverse an RSA encryption if you know some numbers such as N.
Recall earlier where I detailed how the RSA algorithm worked. There was one number, $n$. This n is special because under some circumstances n can make this one-way function reversible
N is a product of 2 prime numbers. As we saw earlier, if we take $5$ and $7$ and multiply them together, we get:
$$n = 35$$
In order for Bob to send Alice a message, he encrypts the message using Alice's public key. Now that the message is encrypted, there has to be some way for Alice to decrypt it. There has to be some way for Alice to reverse this, but only for Alice to reverse it. You can't have Eve or Niamh or Hannah reversing it - because that beats the point of encrypting it.
RSA is designed so the person who knows P and Q (the two prime numbers that are multiplied together to give N) can decrypt the message.
Although Alice has told the world her public key is $n = 35$, no one apart from Alice knows that $P = 7, Q = 5$. Note that the prime numbers are intentionally small for brevity.
You may be thinking "it's easy to guess that 35's prime factors are 5 and 7" and you would be right. But, with large enough numbers it is virtually impossible to find $p$ and $q$.
In fact, with large enough numbers multiplying p and q are essentially one way functions. It is possible that in the future, perhaps the near future (with the invention of quantum computers) that factoring numbers becomes easy. Mathematicians have tried and failed for thousands of years to find an efficient way to factor numbers, so for now it is considered secure.
Okay, let's look at how modulus works in all of this. You understand why multiplication works, and how modulus works. But what about the other equations? What are they for?
Let's demonstrate the deciphering algorithm using an identity due to Euler and Fermate:
For any integer (M), M is relatively prime to n:
$$M^{\phi(n)} \equiv 1 (mod n)$$
Here, $\phi (n)$ is the Euler totient function giving the number of positive integers less than n which are relatively prime to n. Relatively prime is where 2 numbers only share the factor $1$ with each other. In modern day we use Carmichael's function over Euler's function, as Euler's function can sometimes produce numbers too large to use. However, we're using Euler's totient function as it is what the original RSA paper used.
This sounds confusing, but let's break it down. By elementary properties of the totient function:
$$\phi (n) = \phi (p) * \phi (q)$$
$$ = (p - 1) * (q - 1)$$
$$ = n - (p + q) + 1$$
Since d is relatively prime to $\phi i (n)$, it has a multiplicative inverse $e$ in the ring of integers modulo $\phi (n)$. What this means is that the formula we used for RSA can be reversed (the trap door can be reversed) given some knowledge about the numbers used.
Without this special mathematical property it wouldn't be possible to reverse the encryption and find out the ciphertext if you know some of the numbers used.
The modular multiplicative inverse of the encryption algorithm $c = m^e \; mod \; n$ is $m = c^d \; mod \; n$. All of this maths has built up to this. Modular arithmetic and one-way functions are heavily involved here. In order to encrypt, you calculate c. In order to decrypt, you calculate m. Both of these require knowledge of $n$, which is the special number we talked about earlier.
If you want to learn more about the maths of RSA, I highly reccomend the readable, origianl RSA paper.
How do you prove that a message sent by Bob was actually sent by Bob, and not sent by Eve? You need a way to authenticate them. In the real world, we authenticate using signatures. Although these can be forged, you can authenticate using a biometric scanner, but your fingerprints can be lifted and copied.
You can use a passcode, but again much like how Caeser cipher and its single key is useless, authentication methods that use single keys aren't as perfect.
You can use a passcode, but again much like how Caeser's cipher and its single key is useless, authentication methods that use single keys aren't as perfect.
$$ K^-_b(K^+_b(m)) = m = K^+_b(K^-_b(m)) $$
Let's say Bob wants to prove to Alice that Bob wrote the message he sent her. Bob sends his original message with an encrypted version of the message with his private key ($k^-$). Alice uses Bob's public key($k^+$) which, using the formula above, turns the encrypted message back into the normal message. Then Alice checks the message Bob sent with the message she got from the encrypted message. If they match, she can be sure that someone with Bob's private key (probably Bob) sent it.
This method sucks for encrypting because if Bob encrypts his message with his private key, anyone can read it with his private key. Also, it's computationally expensive to prove that Bob sent something. This is why we create a digest of the message and encrypt that instead to verify Bob. This digest of a message is done using a hash function.
To learn more about hash functions, I wrote a sister article which explains them here.
By encrypting the hash of the message we speed up the process of encrypting it, which makes authentication a lot faster. Now, let's play a prank on Bob.
We create an e-mail order to a pizza shop asking for 4 pepperoni pizzas. We sign this email with our private key. We send the pizza store our public key, but we tell them that Bob's phone is dead and that our public key is actually Bob's public key.
The pizza store verifies the signature and sends 4 pepperoni pizzas đ to Bob. The worst part is, Bob doesn't even like pepperoni. This is where a certification authority comes into play.
Certificate authorities (CA) bind a public key to a specific entity. This entity provides proof of identity to the CA, the CA then creates a certificate binding the entity to its public key. The idea is to take the trust out of trusting an individual for public keys. You still have to trust an organisation, but many people find trusting an organisation is better than trusting an individual.
The certificate containing the entities public key is digitally signed by the CA. This signing is the CA saying "this is the entities public key".
When Alice want's Bob's public key, she gets Bob's certificate. She then applies the CA's public key to Bob's certificate to get Bob's public key.
Cloudflare has an amazing article on certificate authorities here.
Phil Zimmerman invented Pretty Good Privacy (PGP), the de facto standard for email encryption. Zimmerman used RSA in PGP. RSA is patented and he did not have permission from RSA inc (the company that holds the patent) to publish another cipher using RSA.
Zimmerman was also a target of a 3-year U.S federal investigation because at the time cryptography programs were considered munitions under U.S law. When asked whether all of the trouble was worth it to publish PGP, he said he had "no regrets". Let's look at how this used to be illegal algorithm works.
When Alice wants to send a confidental email to Bob, she:
In total, Alice uses three keys. Her private key, Bob's public key, and the newly created symmetric key.
This idea of encrypting a symmetric key with a public key is called a Hybrid Cryptosystem. Some email messages can be incredibly large, encrypting these with a public key system would take a very long time.
Use a symmetric key system such as AES, which is incredibly hard to break (but not as hard as RSA). Encrypt the AES key (and only the key, not the whole email) with the public key. This way, the receiver can apply their private key and find out the AES symmetric key to decrypt the email.
Not many people use PGP, because of how difficult it is to set up. At most, you need to download a program you trust to correctly implement PGP. In 2018 it was shown that email clients such as Apple Mail, Thunderbird, and Outlook - who have settings to enable PGP can be forced to show the non-encrypted versions.
Not to mention how suspicious it looks for one person to send encrypted emails on a network of non-encrypted emails. The only email client (and address provider) which enables PGP by default is ProtonMail, but even then it's only for Proton-to-Proton emails and you have to trust the company to implement it correctly.
Most of them do a good job, but we understand your point. We built ProtonMail to make PGP encryption accessible to non-technical people. We will make sure this goal is reached 100%. ;) Thanks again!
— ProtonMail (@ProtonMail) January 21, 2018
Cryptography has been used for thousands of years, almost as long as mankind has held secrets. In our constant effort to keep our secrets secret to everyone apart from a select few we've found this magical algorithm that works pretty well. No doubt, in 300 or 400 years it will have been broken much like how Caeser thought his cipher would never be broken.
Hey đ Want to subscribe to my blog and stay up to date with posts similar to this one? Subscribe to my email list below. I won't spam you. I will only send you posts similar to this one đâ¨
If you're feeling extra generous, I have a PayPal and even a Patreon. I'm a university student who writes these articles in my spare time. This blog is my full time job, so any and all donations are appreciated
]]>