A Beginners Guide To Codebreaking
A Beginners Guide To Codebreaking
A Beginners Guide To Codebreaking
codebreaking
Written by Prof. Graham A. Niblo
Edited by Dr. Claire Swabey
Version 1.2
29th September 2019
Substitution ciphers
Caesar shift ciphers
The easiest method of enciphering a text message is to replace each letter by
another, using a fixed rule, so for example every letter a may be replaced by D, and
every letter b by the letter E and so on.
Applying this rule to the previous paragraph produces the text
Note the convention in these notes that cipher-text is written in capital letters, while
plaintext is usually lowercase.
Such a cipher is known as a shift cipher since the letters of the alphabet are shifted
round by a fixed amount, and as a Caesar shift since such ciphers were used by
Julius Caesar.
There are only 26 Caesar shift ciphers (and one of them does nothing to the text) so
it is not too hard to decipher the text by brute force. We can try each of the shifts in
Just because we can use brute force to solve the cipher doesn’t mean we have to. If
that was all there was to codebreaking it would be entirely the province of
computer scientists and engineers who are very smart at speeding up that sort of
computation. At the cutting edge of cryptography it is the interaction of those
disciplines with mathematics which enables governments (and criminal hackers)
to read poorly encrypted communications, and we can begin to see where
mathematics comes into the picture even when considering a simple cipher like
the Caesar shift.
Notice that in order to know which shift cipher has been used it is enough to work
out where one of the letters has been shifted. That tells us the amount of shift and
therefore the entire cipher. This can be done, for example, by discovering which
character has replaced the plaintext letter e. The letter e has been chosen here for
a reason, it is the single most common letter to be found in English text (curiously,
largely because the word the is one of the most common words - we will come
back to that point in a minute). It is an interesting exercise to choose some text and
analyse the letter frequencies to confirm that for yourself, and you can find a useful
online text analyser at
http://www.dcode.fr/frequency-analysis
to speed that up.
Running the cipher text above through the analyser we see that the letter H
appears more than twice as frequently than any other letter, more than 20% of the
time. We can compare the frequencies with those for the lead story from the BBC
news site this morning which has been run through the same checker. The
For a more sophisticated cipher like the affine shift cipher or the keyword cipher
we will need to know more than just one letter to break it, so let’s look again at our
first example. The spaces in the text imply that the encryption has left the word
structure intact. So we might guess that the three-letters starting the sentence
form a 3 letter word, and as remarked above the most common 3 letter word in
English is “the“. This fits with our frequency count which suggests (correctly) that
e has been replaced by H, and a quick check shows that the Caesar shift by 3 does
indeed encode the word the as WKH, and it is easy to complete the decryption.
a b c d e f g h i j k l m
n o p q r s t u v w x y z
Y Z A B C D E F G H J K L
Here we have chosen the key word SIMPSONS so we continue the alphabet from
the first unused letter after the last used letter, N, which is Q. Of course if the key
phrase is carefully chosen (for example “The quick brown fox jumps over the lazy
dog”) there might not be any letters left to use up but such a choice is not
necessary. If instead of using a genuine word or phrase we allow us to use any
ordering of the letters in our cipher-text alphabet then the number of such ciphers
is 26!, or approximately 1027, and brute force cannot be used to attack the problem.
In practice a fully random encryption table would be impossible for an agent to
reliably memorise (they work under conditions of extreme stress after all) so a
genuine word or phrase will have been used which reduces the number of ciphers
considerably. According to the Oxford English Dictionary authorities “there are, at
the very least, a quarter of a million distinct English words”, which would still make
a brute force attack impossible without the aid of a computer, but frequency
analysis still works, especially if we can see the word shapes.
As before we notice that the first word has three letters and, since it occurs several
times, may well be the word “the”. This gives a strong hint that the letter e is
Reading carefully we see the single letter word H, and the four letter word th_t
circled above, and guess that H is a vowel, almost certainly the letter a. Making
that replacement we get the following, where we are using our upper/lowercase
convention to save space:
Now the two 2 letter words ending with “t” are “at” and “it” so the word Ft
circled above is one of these, and since it is followed by another two letter word,
FU, beginning with the same letter we probably have “it is” here, meaning that
F enciphers i and U enciphers s.
Hence we get:
Before reading on it is worth looking at this to see if you can spot any other likely
substitutions of your own.
tM = to, so M = o
haXe = have, so X = v
easC = easy, so C = y
As we identify more letters it gets easier to guess even more and we can decipher
the text to get the following extract from Simon Singh’s excellent history of codes
and ciphers, The Code Book:
Frequency analysis
We have already seen how frequency analysis can help us to identify common
letters and common words. We can go further with this analysis, comparing the
number of occurrences of each character in the cipher text with an expected
frequency for the standard English alphabet. In the plain text above a character
count gives us the following table of occurrences.
A B C D E F G H I K L M N O P R S T U V W Y
32 7 14 11 55 5 2 26 27 6 9 11 20 18 16 17 17 35 4 4 4 12
The consonants h,s,t are relatively common in English plaintext as are the
vowels a,e,i and o. The vowel u is much less common and any occurrence of q is
Usually the length of the text groups doesn’t matter, however, in analysing a
Vigenère cipher (see below) a carelessly chosen block length may make the length
of the keyword more apparent, since it can reveal the repetitions more easily.
To attack cipher text that has been grouped in this way we have to work with
letters not words. To do so we use the frequency analysis described above,
together with a little judgement (or luck!). The process can be hard, but wars have
been won or lost on the back of it, and so have fortunes. As remarked by Jericho,
the lead character in Robert Harris’s novel “Enigma”,
H K N Q T W Z C F I L O R
n o p q r s t u v w x y z
U X A D G J M P S V Y B E
The affine shift ciphers can also be written in shorthand form x → ax+b and the
Caesar shift ciphers are special cases of the affine shift ciphers with a=1. We think
of the pair of numbers (a,b) as the key to the cipher. It is an interesting question,
that we will consider later, how many possible keys there are!
Notice that in both the Caesar shift x → x+3 and the affine shift x → 3x+5 the letter y
is enciphered as B, since 25+3 = 28 = 26+2, and 3x25+5 = 80 = 3x26+2. It follows that
two different affine shift ciphers can encrypt a letter in the same way, so it is no
longer sufficient to discover the letter substituting for e in order to decipher the
message. Mathematicians would say that there are “two degrees of freedom” in
our choice of cipher so we might hope that deciphering two letters is sufficient.
Luckily this is true, since if we know two values of the expression ax+b we can solve
the two corresponding simultaneous equations to find the integers a and b.
We may be more familiar with this exercise when solving pairs of equations over
the real numbers, but the same method works for modular arithmetic, with one
important caveat.
We will look more carefully at when division is allowed in a minute. For now it is
worth noting that this caveat has an interpretation in cryptography. In order for the
rule x → ax+b to define a cipher it had better be the case that each of the numbers 1
… 26 appears exactly once in the list of numbers ax+b as x ranges from 1 to 26. It
doesn’t matter which value we choose for the addition term b, but if we choose the
multiplication factor a carelessly (so that we can’t divide by a mod 26) this might
not be the case.
Let’s try an example. Suppose that we have been given a cipher text which we
believe to be encrypted with an affine shift cipher, and that the two most common
letters in the cipher text are S and L, appearing, respectively, roughly 12% and 9%
of the time. We guess that this means e is encrypted as S and t as L. In terms of
the modular arithmetic this tells us that
5a + b = 19 mod 26
20a + b = 12 mod 26.
As with ordinary simultaneous equations we can take the difference to deduce that
It is tempting to solve this by dividing both sides by 15, to get a=19/15, but this
won’t work as a has to be an integer. What we really have to do is find the
multiplicative inverse for 15 mod 26. This fancy phrase just means we need to find
a number a’ so that 15a’ = 1 mod 26, or in other words find a’ so that 15a’ is 1 plus a
multiple of 26. This would then allow us to deduce that a=(15a’)a=(15a)a’=19a’, so
multiplying 19 by a’ will give us a.
We could do this by trial and error. For each a the number a’ will have to be one of
the twelve odd numbers other than 13, and there are clever ways to try to solve for
a’, but to speed things up, here is a table of multiplicative inverses mod 26 for the
twelve numbers that have them:
a 1 3 5 7 9 11 15 17 19 21 23 25
a’ 1 9 21 15 3 19 7 23 11 5 17 25
We write the keyword at the head of a table with three columns, then enter the
plain-text in the boxes below. The last, empty, box is padded with an X (usually -
there is no fixed rule for which character is used) so that all the boxes are full. Next
we rearrange the columns so that the letters in the keyword are now in alphabetic
order, ABD, and read off the rows grouping the letters in blocks of 5 for easy and
accurate transmission:
Height
1 2 3 5 6 7 10 14 15 21 30 35 42 70 105
Count
60 64 26 15 27 6 12 5 7 6 9 4 6 3 3
Column heights of 1,2,3,5,6,7,10 all seem unlikely given that the keyword or
phrase would then have to have at least 21 letters in it. On the other hand a column
height of 30 would correspond to a keyword of length 7, which is quite feasible,
and gives rise to a good number (9) of TH adjacencies, as marked in green in the
corresponding 30x7 grid.
Notice that in three cases, rows 8, 9 and 16 the T and H appear in columns 2 and 7
respectively. This suggests that whatever order the columns should be in we
should end up with column 2 next to (and to the left of) column 7. In two of the
rows, 9 and 16, there is an E in the fourth entry so we are led to try putting these
three columns together in the order 2,7,4.
Assuming this is not an Olde English text, ruling out “Twas” as a word, these three
columns are not likely to be the first three, so we need something to the left and
the possibilities for that put S,H, I or A to the left of the string TWA in the first row.
Trying each in turn we get STWA, HTWA, ITWA or ATWA and the first two seem
One possibility for the remaining columns reads AHITWAS but then the next row
reads GNRDGOI which is clearly wrong. There is one other way to rearrange the
columns to get this first row, but that is also unlikely as it gives the second row
ONRDGGI. On the other hand these same letters might suggest the word GOING in
row 2 and a rearrangement and further experimentation gives the final
arrangement, which you might recognise from earlier:
“It was hard going, but Jericho didn’t mind. He was taking action, that was the
point. It was the same as code-breaking. However hopeless the situation, the rule
was always to do something. No cryptogram, Alan Turing used to say, was ever
solved by simply staring at it.”
We stared pretty hard at this, but there was nothing simple about breaking it. I
think Jericho, and maybe even Alan Turing, would approve.
14
10.5
3.5
frequency distributions, and we see that the cipher-text distribution, while not
uniform is much flatter and lacks the distinctive spike at the left, suggesting that
the frequency distribution of letters is not a good match to the standard English
language. From this we conclude that the text is not encrypted with a transposition
or a mono alphabetic substitution cipher like one of the shifts, or the keyword
substitution we studied above.
So we guess that the text has been encrypted with a polyalphabetic cipher, and
since we only know about the Vigenère cipher we will assume that is what we have
here.
The first step is to try to find the likely keyword length, which we will denote k,
which is at least 2 since we are not considering a mono-alphabetic substitution. To
do this we will compute the index of coincidence for sequences of letters spaced k
apart in the cipher-text. Start by taking k=2. We consider the sequence of every
k 2 3 4 5 6 7 8 9
ioc 0.04695 0.05435 0.04616 0.047614 0.069209 0.046907 0.047228 0.04809
Notice that for k=6 we obtain an ioc of 0.069209, which is very close to the
expected value of 0.0668 for English text, whereas the other values of k give a much
lower value, which suggests that key length is 6.
The next step is to split the text into blocks of 6 and to carry out frequency analysis
on each of the 6 columns this gives us. Here is the first of the six sequences that
gives us
XMDFPHXBFHAEHGTMGHXXMKMRULAGHHGMXRXLYGXHTBWXGMHMLBXMBKEGV
BMAXKYRLLRBUTFMBFKHGXTFLAXTLYTLMFBKGOFHHTELFGTWKKGXMBALKB
TWKHOGNMVWZPAXXGXKTATTXKGFHBEFLKNXTRTVTEKXWMXZNXVMAXVALLO
NNXXMAFHGMNXXLGLKKGFTTBXGELLXXVR
The most common letter by far is X, so we deduce that e has been encrypted by X
in this sequence, and since the Vigenère cipher uses Caesar shift ciphers this gives
a decrypt of
etkmwoeimohlonatnoeetrtybshnoonteyesfneoaidentotsietirlnc
itherfyssyibamtimroneamsheasfastmirnvmooalsmnadrrnetihsri
adrovnutcdgwheenerahaaernmoilmsrueayacalredteguecthechssv
uueethmontueesnsrrnmaaienlsseecy
To do so we look for the pattern t_e across the first three columns after
decrypting columns 1 and 3. We find this pattern in rows 23, 48,109,164 and 176
where the encryption string is MRH, MGH, MYH, MMH and MBH, so if any of these are
an encryption of the then h must be encrypted as R, G, Y, M or B. These
correspond to the affine shifts mapping e to O, D, V, J or Y. We have already seen
from our frequency analysis that the most likely encryption of e is either to map it
to J or to X, and putting this together with the list we just produced that makes the
mapping to J more likely so we assume that our second column is enciphered
using a shift mapping e to J and make that substitution.
e
word substitution. Choose a
codeword or phrase that only you
w
and your friends know and write it
f g
down, missing out any letters when they
repeat. For example if your code phrase was
u v
h
ESCAPROMNQTUVWXYZBDFGHIJKL.
t
i
it in the spaces in the inner wheel of this
code wheel encryptor. Now use it instead of
s
j
accomplished code breakers) can
r
crack. k
q l
m n o p
Z A B C
To apply a Caesar shift turning
a into D, (the shift x+4)rotate
Y XA D
G D
the inner wheel until the “D”
X
on its outer ring lines up with
“a” then read your message one
U J
letter at a time from the outer
wheel to the inner. R A F K
E P S V Y B
Q V
W
P
M
O
L
F G
To apply the affine shift cor-
U V
Now we are getting somewhere. We know this is an article about encryption, and
Graham A. Niblo On ciphers version 1.2, Page 33
right at the start of the text we see the pattern enc_ _ _tio_. This corresponds
to the cipher-text XSFJD JMNRF suggesting that ryp in positions 4,5,6 have been
enciphered as JDJ in turn using shifts mapping r to J, y to D and p to J. Trying the
J to r shift as the decrypt on the fourth column, the D to y to the fifth and the J to
p on the sixth gives us the following
encryptionmakesthemodernworldgoroundeverytimeyoumakeamobi
lephonecallbuysomethingwithacreditcardinashoporonthewebor
evengetcashfromanatmencryptionbestowsuponthattransactiont
heconfidentialityandsecuritytomakeitpossibleifyouconsider
electronictransactionsandonlinepaymentsallthosewouldnotbe
possiblewithoutencryptionsaiddrmarkmanulisaseniorlecturer
incryptographyattheuniversityofsurreyatitssimplestencrypt
ionisallabouttransformingintelligiblenumbersortextsoundsa
ndimagesintoastreamofnonsensetherearemanymanywaystoperfor
mthattransformationsomestraightforwardandsomeverycomplexm
ostinvolveswappinglettersfornumbersandusemathstodothetran
sformationhowevernomatterwhichmethodisusedtheresultingscr
ambleddatastreamshouldgivenohintsabouthowitwasencrypteddu
ringworldwariithealliesscoredsomenotablevictoriesagainstt
hegermansbecausetheirencryptionsystemsdidnotsufficientlys
cramblemessagesrigorousmathematicalanalysisbyalliedcodecr
ackerslaidbarepatternshiddenwithinthemessagesandusedthemt
orecreatethemachineusedtoencryptthemthosecodesrevolvedaro
undtheuseofsecretkeysthatweresharedamongthosewhoneededtoc
ommunicatesecurelytheseareknownassymmetricencryptionsyste
msandhaveaweaknessinthateveryoneinvolvedhastopossessthesa
mesetofsecretkeys
The shift ciphers used have therefore been shifts by 19,5,3,18, 5, 20 respectively.
How would the spies have remembered this sequence? It might have been chosen
as the lottery numbers one week, but actually it spells out the word SECRET, with
our usual convention a=1, b=2, c=3 and so on.
This was a far from easy exercise, and it used everything we know about letter
frequencies, common patterns, cribs and the index of coincidence. Combining
them has allowed us to decipher a message that would have defeated all but the
best cryptographers in the past.
As Turing said,