- Andrew Jones
Password Cracking with a Genetic Algorithm (Part 1)
What are Genetic Algorithms?
Genetic Algorithms (GA) are a really interesting approach to solving optimisation problems. They're built around the same evolutionary mechanisms we consider in Darwin's “Survival of the Fittest” concept - and can be very effective at quickly evolving strong solutions.
There have been some very cool applications for Genetic Algorithms, the one I like the most is probably this here in which they used a GA in conjunction with a Neural Network to teach a computer to play a level of Super Mario World.
In the world of Machine Learning, they’ve also been used for feature selection, as well as hyper-parameter optimisation. They've also been applied to the famous ‘Travelling Salesman’ problem.
I've been interested in GA’s for some time, but haven't had the chance to really get my hands dirty. I found a few examples online but wanted to code one up on my own to really try and understand how they work.
I've gone for a fairly simple application - cracking a numeric passcode. I use the term 'cracking' loosely as each attempt at solving the passcode will be compared against the actual one and scored based on how many digits are correct. Nonetheless, I felt it was an interesting application that I could at least wrap my wee brain around while getting more comfortable with the algorithm as a whole!
In my example, we’ll start with a passcode of length 10, with each digit in the passcode being an integer between 0 and 9.
An example passcode would be:
[1, 4, 4, 8, 6, 0, 4, 6, 4, 9]
Let’s now run through what we’re going to be dealing with - I've emboldened all the keywords, I'm sure you'll get annoyed by it by the second paragraph...
In the language of GA’s - the passcode itself (or any attempted solution for it) would be called a Chromosome. Each digit in our passcode would be considered a Gene.
A batch of attempted solutions for the passcode would be a Population. The very first Population is made up of randomly created Chromosomes, and subsequent Populations are made from the evolutionary rules that I’ll describe below… Each iteration of a new Population is called a Generation.
For each Chromosome in our Population we’re going to test how well it matches with our actual passcode. Our metric for this is called Fitness (as in, Survival of the Fittest…). In our specific case, each attempted solution will be compared against the actual passcode to see how many digits are correct (and in the correct position).
Once we have Fitness scores for all Chromosomes in the current Population, we decide that we only want to keep a few of the best ones (i.e. the ones with the highest Fitness scores) for use in future Generations.
The ones we decide to keep are known as Parents. These Parents then breed in a process called Crossover to create Children. There are lots of ways this can be done, in our example we’ll take a randomly chosen section of Genes from one Parent and swap them with that same section of Genes from another Parent. Each time we do this we are creating new Children.
These Children form the Population for the next Generation.
Something else we can do in the breeding process called Elitism. This is where we move a small number of the highest scoring Parents to the next Generation, in their current state (i.e. without any breeding). This ensures that some of the best possible solutions persist through Generations.
Another useful step in the process is called Mutation. This helps avoid local convergence by introducing some level of randomness to a select number of solutions. This is often only applied on a probabilistic basis, i.e. only 10% of Children are altered. In our example, for 1 in every 10 Children, we’ll change a randomly selected Gene to a new, randomly selecting integer.
So…the above process is iterated through either N times, or until a certain threshold is met. In our case we’ll iterate through until we have a passcode solution that matches our actual passcode.
Time to run through some code!
Not many packages are required, the last two are really just for measuring how long our solution takes, and plotting progress.
First we need to set up some parameters for our passcode cracking shenanigans:
We’ll choose to set up a 10 digit passcode, and it will contain digits between 0 and 9.
Now we want to set up some parameters for each generation, namely; how many potential solutions do we want to try, how many parents we want to select from each population to breed from, and how many elite chromosomes will be passed through as children without any breeding.
Now we’ve got our parameters all in place, we can generate the secret passcode that we’ll attempt to crack.
Then we have to create our initial population. As we’ve set our population size to 10, this will be 10 randomly generated passcodes of length 10.
Now we can create the necessary functions for the evolutionary stages, namely:
Breeding to create new Children
Here, for each Chromosome (potential solution) in the Population, we’re checking the number of correct digits that appear in the correct position. Each Chromosome just gets a single number to represent this score. The output is a list of paired Chromosomes and their scores.
In this step, we’re sorting the fitness scores in descending order and selecting the top N of them, dictated by our num_parents parameter above. We’re then finding the specific chromosomes that achieved those scores and appending those to a new list.
Breeding to create new Children
The first function is the logic for breeding. It takes two parents and randomly selects start and end cut point positions. These positions are used to combine the Genes of the two parents to create a new child.
I've visualised this below, each time breeding happens the cut points will be different as they're randomly selected meaning that the influence of Parent A and B will vary.
The second function is used to select the parents to be passed to the breeding function to create enough children for a new population. We also incorporate elitism here
Mutating children is not acceptable subject matter in most places, and I'm sure this sentence means I'm now on a watch-list somewhere - but it's essentially what we're doing here.
This function runs through each child in our new population, and for 1 in every 10 it will swap out one randomly selected gene for a randomly selected new digit. The output from this is our new population, ready to run through the whole process again.
The image below would represent one of the children that has been randomly selected for mutation. The position of the digit to be mutated is random, as is the digit that replaces it.
Now we’ve got all our functions in place, we can create a loop to run through them all until a solution to our passcode is found.
We use a while loop as we want it to continue until it finds a solution rather than be limited to a certain number of generations.
We set up an empty list called success, to which we will append the best solution (based on Fitness Score) at each generation. We also set up a counter for the number of generations it takes us to crack the passcode.
Alright, into the loop itself.
Firstly we score each chromosome in the population with our fitness function, we then append the best fitness score for the generation to our success list.
At this point, we want to put a break clause in - if our fitness score = the number of digits in our passcode then we’ve correctly cracked it – and we want the loop to stop.
If this is the case, we print out a string showing how many generations it took as well as confirmation of the secret passcode and the passcode we think is correct.
If we’re yet to crack it, we continue the evolutionary process.
We select the best chromosomes as parents and then create children. We then apply the probabilistic mutation process and create our population for the next generation. At this point we increment our generation counter by one and start the process all over again!
Let’s run it and see how we get on!
For the secret passcode: [6, 8, 4, 1, 8, 9, 0, 8, 1, 4] - we got this output:
Cracked in 191 generations, and 1.6231245 seconds!
Secret passcode = [6, 8, 4, 1, 8, 9, 0, 8, 1, 4]
Discovered passcode = [6, 8, 4, 1, 8, 9, 0, 8, 1, 4]
It worked! 191 iterations isn't too bad either, when you consider there are 10 billion (10^10) possible combinations for that passcode.
Note: If you run this exact code, your passcode will be different as it's randomly generated
We can also plot the max Fitness Score by generation with the following code:
How about something bigger...
Just for fun I thought I’d try it on a longer passcode – let’s try say 500 digits.
The number of possible combinations here is very large (10^500).
This took quite a while longer to crack but not as long as I thought it might initially! I've just left the population size, parents, and elite as is. In reality I should probably increase those in this case (it'll therefore probably need less iterations, and be faster to solve).
I won’t print out the actual list as it’s really long, but here is the summary and chart:
Cracked in 36,250 generations, and 59.37182 seconds!
I really enjoyed learning about GA's and hope you do too. It's such an interesting concept, I really want to find some better applications.
Please do share with anyone you think might like this.
Andrew Jones has over 10 years experience in Analytics, Data Science, and Machine Learning within the Digital, Retail, Telecoms and Travel industries - most recently at Amazon, & Sony PlayStation