Description

In this challenge you’re given the Vigenère encryption of a long string consisting of concatenated English words. The goal is to recover the secret key.

Solution

The key in this challenge is the knowledge about the plaintext to consist of english words. This results in an inhomogenous distribution of the letters in the plaintext, as letters aren’t equally distributed in english language.

We can look up a general distribution of letters in the english language:

'A': 8.167, 'B': 1.492, 'C': 2.782, 'D': 4.253, 'E': 12.702,
'F': 2.228, 'G': 2.015, 'H': 6.094, 'I': 6.966, 'J': 0.153,
'K': 0.772, 'L': 4.025, 'M': 2.406, 'N': 6.749, 'O': 7.507,
'P': 1.929, 'Q': 0.095, 'R': 5.987, 'S': 6.327, 'T': 9.056,
'U': 2.758, 'V': 0.978, 'W': 2.360, 'X': 0.150, 'Y': 1.974,
'Z': 0.074

Obviously E appears way more often than Z e.g.

A fast and cheap approach would include only using the numbers of E as a measure of equality to the english language and hence to the plaintext distribution, but there is the chance, that E is actually not the most frequent character and e.g. T is. This would result in doing mistakes at some indices of the key.

Vigenere

The Vigenere Chiffre is like the Ceasar chiffre only a shift of the characters, but in this case the text is splitted into blocks of the lengths of the key and each index in these blocks is shifted by a different value, but by the same for the same index through all blocks. This gives us keylength different ceasar chiffre encryptions, that need to be decrypted. In theory to iterate through all possible decryptions we end up with \(26^\text{keylength}\) but we get an advantage, cause we found a way to measure which of these 26 possible ceasar shifts is the most likely.

Keylength

In this case the keylength is assumed to be less than 50, but not known. Note that this method only works, as the ciphertext is way longer than the keylength and hence we can iterate for each index or position of the key enough characters to have a representative distribution.

So how do we find the keylength now? While there are much faster approaches like only using E’s frequency as described above or searching for specific sequences in the ciphertext, that are appearing several times and could correspond to repeating sequences like ‘ing’ in the plaintext, these approaches are more vulnerable to mistakes, as these sequence could appear in the ciphertext randomly without the plaintext being ‘ing’.

So we just iterate through all possible keylengths and return the one, that includes after shifting each position “best possible” a distribution as close as possible to the english one.

Another small note is, that if we didn’t have such an upper bound on the keylength, or the keylength was let’s say 10, just taking the lowest avg chi-squared wouldn’t work. Obviously multiples of the key cannot make the test statistic worse, as in worst case they can use the original key repeated several times. So they are more flexible but have at least the same test-statistic. So mutliples of the real keylength will have a smaller statistic and selecting them would not result in the correct decryption. In this case this problem didn’t appear, as there are no mutliples of 37 (which is in this task the keylength) below 50.

Difference between current distribution and english language letter distribution

To compute the difference between two discrete distributions, there are many different metrices. Only considering the E as described above can be seen as only comparing the distributions peaks, which obviously doesn’t compare them very well, as only a fraction of their information is used.

Other well known metrices are KL-divergence, Shannon divergence or chi-squared. The first two unfortunately don’t behave very well when there are 0 entries, as they use log and hence the statistic evaluates to infitiny. So if e.g. Z doesn’t appear at all in our distribution the statistic concludes that both distributions can’t be “similar”. This could be solved through adding a small positive frequency offset to all frequencies ( and normalize again), but this still gives rare events a big influence or hat-value. Hence I decided to use the chi squared test statistic to compare the distributions. If the statistic is small and close to 0 this means, that both distributions are fairly close to each other and we might have chosen the correct shift (key character) for this position.

Key

XERKBKMLWERKQVTKIFUEEKMDEGASQNFTBZYPV