Why The Layout of Morse Code is My Go-To For Cracking Codes

Learn / Frequency Analysis

Marty

Published August 20th 2021

What is Morse Code?

Morse code is a method of information encoding such that it can be in a form that's friendly for transmission over telegraph. The necessity for this arose from Samuel Morse's tragic history that could have found benefits in faster long distance communication. Both the telegraph and code were concocted mostly by Samuel Morse with the code side of things being refined over some time before it became what it is today. It has become a clever method of combining a series of binary (from Latin's 'bininarus': "Consisting of two") signals to form a large range of meaning.

If we look up to the diagram at the top of the page, we can see how a binary signal (left and right) can be combined until a character is reached. One of the more popular examples of binary encoding, "SOS", would be transmitted with an easy to remember sequence of 3 dots and 3 dashes: . . . - - - . . . (left left left, right right right, left left left)

At this point something may have occurred to you: "What's with the organisation of the characters? Why isn't a dot = A and a dash = B. Dot dot = C, etc... ? Why doesn't the pattern mirror the pattern English speakers are used to seeing in an alphabet?"

To which my reply is "That's actually a really great question but to help connect this to the point I want to make, I'm first going to discuss a field that we can leverage with the answer. That field is Cryptography."

Encryption is a funky word used to describe the process of concealing information in the guise of seemingly arbitrary garbage. Conversely, the process of reversing that garbage back to legible information is appropriately labelled 'Decryption'.

Cryptography is a two-way street in that what information is hidden can be then revealed by reverse engineering the cipher (the method of its creation). There's a similar process called 'Hashing' which is a one-way method of concealing data mathematically with the use of modulo operations which means the only way to decrypt it is through a brute-force method.

There exist many different ciphers, but there are generally two mainly digestible categories that don't require huge computational power: the transposition cipher, and the substitution cipher.

Transposition ciphers encapsulate the process of re-arranging (transposing) the characters such as "hello there" to "ereht olleh" by means of reversal - an anagram is another example so long as it was done with a method. This one is a little rudimentary to rely on, which is probably why you'd spot it as a fun quiz night question down at the pub.

A substitution cipher on the other hand, is a method of substituting a character for another one entirely - and a more classic example of the spirit and fun of cryptography. The Caesar Cipher is a simple example of substitution whereby each character is shifted down the alphabet by an offset. For example "Cat" encrypted with an offset of 1 would shift each letter down the alphabet once to create "Dbu". As aforementioned, we can decrypt by reverse-engineering the function and go backwards by the "secret" offset.

If you're interested, a more amusing and effective cipher is the Vigenère cipher which is a polyalphabetic cipher that utilises a keyword instead of an offset and creates a believable visual illusion of gibberish.

What is Cryptography?

What does the arrangement of Morse code have to do with uncovering encrypted information?

The key to this question is by simply looking at what a dot means and what a dash means, and why a dot and a dash. We know a dot means E and a dash means T. Why E and T? And why is E bound to the dot and T to the dash? Turns out the reason to both these ideas is by very practical design. Let's get right into it!

What's displayed below is what is referred to as the English frequency analysis chart:

This chart tells us what the most commonly used letters of the English language are. As we can see, 'e' is easily dominant with 't' taking a close second next to 'a' (which is a bit odd since 'a' is dot-dash and I'd consider two dots a bit faster to communicate).

Okay so what? How is this useful? Let's consider the below message:

"Everyone should go watch The Imitation Game if you are interested in learning a little bit about the Enigma cryptography machine and how one of the pioneers to computer science, Alan Turing, helped to reverse engineer the machine and make a significant impact during the world war."

Using the earlier mentioned Caesar cipher with an offset of 2, we get the following result:

"Gxgtaqpg ujqwnf iq ycvej Vjg Kokvcvkqp Icog kh aqw ctg kpvgtguvgf kp ngctpkpi c nkvvng dkv cdqwv vjg Gpkioc etarvqitcrja ocejkpg cpf jqy qpg qh vjg rkqpggtu vq eqorwvgt uekgpeg, Cncp Vwtkpi, jgnrgf vq tgxgtug gpikpggt vjg ocejkpg cpf ocmg c ukipkhkecpv korcev fwtkpi vjg yqtnf yct."

Looks pretty good! A nice bunch of garbage for us to crack!

Breaking Down The Problem

We are now going to refer to the English frequency analysis chart to see what we can learn about the jumbled text. If you're a programmer, it's a bit of fun to automate the analytics yourself; and if you're not, my condolences. :(

Screen Shot 2021-08-19 at 7.28.03 pm.png

Chopping together some quick analytics, we can see 'g' is the most used character, next a tie between 'p', 'v', and 'k' for second, and finally 'c' for third. Based on this information, we'd be correct in assuming 'g' correlates 'e'. We know as the encryptors, that 'v' is the equivalence of 't' and that 'c' is 'a'. As expected, our findings reflect the frequency analysis chart and after some trial and error we could create the resulting paragraph from substituting anything with a frequency over 11:

"exeraone uhownf io yateh the ioitation iaoe ih aow are intereutef in nearnini a nittne dit adowt the eniioa erartoirarha oaehine anf hoy one oh the rioneeru to eoorwter ueienee, anan twrini, henref to rexerue eniineer the oaehine anf oame a uiinihieant ioraet fwrini the yotnf yar."

Thinking Like A Cryptographer

It's a lot more legible than what it was, but there's still work to be done. The key is in a combination of statistics and educated guesses. "anf" looks like "and", and "one oh the" looks like "one of the", and "oaehine" looks like "machine". This reveals four more letters.

If we're lucky enough to be working with some capital letters that didn't look random, we could deduce that they're referring to nouns or titles.

"The" is generally regarded as the most used word in the English language, so observing for frequent occurrences of the same three letters could have alluded to that earlier. It is in fact exactly by noticing a similar pattern that protagonist Alan Turing came to his revelations during the Imitation Game.

We're now given a glimpse into the idea that cryptography isn't so much an art of legible text, as it is an art of patterns. After using frequency analysis to break down the defences of encrypted text, we can then return to the values that got us there to find meaning.

We could eventually notice that 'e' is 2 letters from 'g', and of 'p', 'v', and 'k', the 'v' persists the trend with 't' (as the second frequency of our investigations). Had we noticed a pattern, not only could we logically deduce the value of 't' (which always trumps guessing), but we we could have have eventually realised that this encryption worked around the idea of 2, effectively giving the cipher away.

What Could Possibly Stop Me Now?

If you did your homework and have a keen eye, you probably made a note that even though this is a genius way to attack encrypted code, it would be useless on something like the Vigenère cipher as with that cipher the character 'e' is constantly being replaced by a different character every time! ... Or so you'd think.

Well at the time of my writing this, we are reaching the extent of my knowledge on frequency analysis and the solution to this would be exploring more computational methods - which is out of scope of the spirit of the article. Regardless, I'll try to loosely explain :)

To figure this out, we first need to understand what makes this problem harder: and that is that the frequency of a character is no longer a reliable element; BUT the length of words and their separations remains the same. What we want to do is essentially use complicated logical reasoning through shenanigans with n-grams and the gcd (greatest common divisor) to study the placement of words in order to understand how long the key 'could' be. In tandem with frequency analysis with a dash of brute forcing, instead of putting our focus on the text, we would instead aim to uncover the encryption key itself.

Fortunately for me, some one much smarter than me has already gone through the effort of explaining exactly how this is done if you're a curious cookie