Caesar Cipher / Affine Cipher cracking

Recently I started doing a course on encryption (which due to unforseen earthquakes I had to drop again).  What I did before I left the course is got quite interested in the process of cracking codes (actually, I had an interest in that already - this course just gave me an excuse to delve into it).

I was given some text to decrypt, and figured that I could do it the easy way or the hard way... and being me - I did it the hard way.  I spent a couple of hours putting the following code together to help decode monoalphabetic substitution ciphers (that is, the ciphers where one letter is exchanged for another based on an algorithm).

The simplest is a Caesar Cipher, where you simply shift the alphabet by a certain amount (ie: you choose a shift of 3, so 'a' in a normal word becomes 'd' in the encoded version.  The word 'hello' becomes 'khoor'.  The fastest way to crack this is to simply print all 25 possibilities (not 26, because our encrypted version is the 26th).  Then just pick the one that makes sense.

An Affine Cipher uses a slightly more complex version of shift, where:

 Encrypted(x) = (a.x + b) mod 26 

You need 2 values to encrypt - 'a' and 'b'.  These both need to be chosen carefully as there are number-pairs that produce odd results, like encrypting two different plaintext letters to the same encrypted one - making decrypting a lot more difficult.

Anyway - I wrote a caesar cypher encoder a while back, and modified it to produce all possible solutions for the english alphabet.  This worked on the second of the three encoded texts I was given (and its hard-coded so you can test it).


# Caesar Cypher Decoder
# -------------
# Simple example of how it can be implemented and cracked
# By Steve
# 17 Nov 2010
for i in range (1,26):
    offset = i
    s = 'pu aol aolvyf vm ubtilyz pa ohwwluz yhaoly mylxbluasf aoha if zvtl bulewljalk sbjr aol tvza lslnhua ild aybaoz zwypun bw if pukbjapvu'
    t = '' # encrypted string
    # create a list of lower case characters
    x = [chr(x) for x in range (97, 97+26)] # chr(97) is 'a'
    # offset the list
    for i in range (offset): # pop the last item and insert it at the front
        x.insert(0,x.pop()) # 'offset' times
    for i in s: # iterate through the unencrypted string
        i = ord(i.lower())-97 # get the lower case value of each character referenced to 0
        if i == -65: t += ' ' # if its a space, leave it alone
        else: t += x[i] # otherwise select the matching 'offset' character and add i
    print "Shifted:", offset, " : ", t

This is all well-and-good, but it won't solve an Affine Cipher.  The shift is not linear, so we need to be a little craftier.

The first step is to understand that the english language has some heavily used letters, some hardly used ones - and a good mix inbetween.  Realising this, you can say that the most used letter in an encoded piece of text will most likely be the most used letter in the english language (which happens to be 'e', by a good margin).  Other useful ideas are that a single letter will probably be an 'i' or an 'a', and double letters at the ends of words will probably be 'll' or 'ss' (ie: fall, or pass).  Actually, since I wrote this code I've come to realise that if you have two different single-letter words in an affine cipher then you can proceed straight to some solving of simple simultaneous equations and crack it very quickly... even one mixed with some guesswork would allow you to solve quite quickly.  Anyway, I wasn't really looking for simple solutions - I wasnted to write some code to see if I could do it automagically Cool

This code is not finished - like a lot of my code snippets, I get it working then need to move on (mainly due to the workload I have in my degree).  I would love to add the single-letter solving ideas, and the double - but at this stage its implemented manually after the first pass finishes.  I would also like to implement a dictionary lookup to see if I can make multiple passes and narrow down a solution (something that might also be made to work better than guessing if someone took the spaces out of the encoded text).  These things are on my list for a rainy day...

I did implement a simple generator for creating english language letter-use statistics, and I simply filled the text file (CipherSamplePlainEnglish.txt) with copy-pasted large text tracts from the internet (news sites, blog sites and short story sites).  This seemed to work well, and the more text I analysed, the closer the final result was (up to a point anyway).  By default it will read up to 100,000 lines of text to sample.  The code prints the list of letters in order of frequency once its done analysing.  It also has a fallback if the file is not present, or empty, to use a sample I was provided.

Ok, so the code is as follows, and I also used the guesser/solver here to finish it (another step I want to include eventually):


# Name:        Cypher Cracker
# Purpose:     To see if I can...
# Author:      Steve
# Created:     31/03/2011
# Copyright:   (c) Steve 2011
# Licence:     Beerware - if you like it, buy me a beer :)
#!/usr/bin/env python
# Samples to crack, copy/paste these in when asked:
# di hvangnkfkjs jdgj ei mgs pgkf kt g eavjds ugqci kc taj cqppkukitj rqcjkpkugjkat pav aqv vipqcktw ja cqhhavj kj
# pu aol aolvyf vm ubtilyz pa ohwwluz yhaoly mylxbluasf aoha if zvtl bulewljalk sbjr aol tvza lslnhua ild aybaoz zwypun bw if pukbjapvu
# le ulk lmc kzie wz lwc dwpe efverweziet nle hkq fp ciweznwpwi iremnwkz uwdd zejer pkraen wn le uwdd xe dkzawza nk rezeu wn
def main():
    linesToRead = 100000
    alphabetList = [chr(i) for i in xrange(ord('a'), ord('z')+1)]
    DefaultEnglish  = {'a':73, 'b':9, 'c':30, 'd':44, 'e':130, 'f':28, 'g':16, \
                'h':35, 'i':74, 'j':2, 'k':3, 'l':35, 'm':25, 'n':78, \
                'o':74, 'p':27, 'q':3, 'r':77, 's':63, 't':93, 'u':27, \
                'v':13, 'w':16, 'x':5, 'y':19, 'z':1}
        file = open("CipherSamplePlainEnglish.txt")
        readText = file.readlines(linesToRead) # Read the first 100,000 lines of the file
        sampleText = "".join(readText)
        sampleText.replace('\n', '') # had problem with some text including line breaks
        sampleText = ''
    # Process the text.  Still need to check for no text, as the file could have been empty
    if sampleText == '':
        sortedEnglishKeys = sortDict(DefaultEnglish)
        generatedEnglish = dict([(n, 0) for n in alphabetList])
        for i in sampleText:
            if i < 'a' or i > 'z':
                generatedEnglish[i] += 1
        sortedEnglishKeys = sortDict(generatedEnglish)
        print"\nSORTED ENGLISH"
        print sortedEnglishKeys
    cypherDict = dict([(n, 0) for n in alphabetList])
    cypherText = raw_input('Enter the encyphered text: ')
    for i in cypherText:
        if i < 'a' or i > 'z':
            cypherDict[i] += 1
    sortedCypherKeys = sortDict(cypherDict)
    print"\nSORTED CYPHER"
    print sortedCypherKeys
    print "\nDECRYPTED TEXT (first pass)"
    decrypted = ""
    for i in cypherText:
        if i == ' ':
            decrypted += ' '
        elif i > 'a' or i < 'z':
            decrypted += sortedEnglishKeys[sortedCypherKeys.index(i)]
    print decrypted
    print"Swap phase, try to improve decryption"
        swap1 = raw_input("enter first letter to swap : ")
        if swap1 == '':
        swap2 = raw_input("enter second letter to swap: ")
        tmp1 = decrypted.replace(swap1, '-')
        tmp2 = tmp1.replace(swap2, swap1)
        decrypted = tmp2.replace('-', swap2)
        print decrypted
# Sorts the dictionary by value
def sortDict(dictToSort = {}):
    items = [[v, k] for k, v in dictToSort.items()]
    return [items[i][1] for i in range(len(items))]
if __name__ == '__main__':

So, there you have it.  My complete work on code cracking... such as it is.  I got a good deal more enjoyment out of this than I probably should have - and spent a good deal more time on it than was probably sensible... and the first-pass results are somewhat painful to interpret... but I managed to crack one with a few swaps, and the other with some guesswork (the three encrypted quotes are commented in the code - can you get them?).
Enjoy Cool