Hash Tables - How cool are they...

I wrote this after finishing my degree... partly because I wanted to see if I could and partly to brush up on my coding skills (which had taken a back seat to my cramming for exams skills).   We'd studied hash tables back in 1st year, but that was the last time I'd had much to do with them.

The idea is that you can store anything using the value you are storing to generate the location you store it in - which means you can do the same calculation you did to store it in order to find it.

That might not seem that clever, but typically in computing things end up in lists... and to find them you have to go through the list until you find it.  If your list is 100000000 items long, and you need to find 100 items then you have to go through the list (or, on average, half of the list - still 50000000 items) 100 times.  Thats going to be around 5000000000 comparisons to find your 100 items.

Using the hash table approach, you do 100 calculations and just grab the item at the resulting location... well, almost.  There are some gotchas, but typically your search is going to be VERY fast compared to the linear lookup of a list.

I'll add more in time, but here is my (very rough) implementation.

 

#-------------------------------------------------------------------------------
# Name:        Hashtable
# Purpose:     To see if I could...
#
# Author:      Steve
#
# Created:     11/11/2012
# Copyright:   (c) admin 2012
# Licence:     Beerware - if you like it, buy me a beer
#-------------------------------------------------------------------------------

from sys import exit


"""Hash Table implementation for testing purposes only - it does not have any
dynamic resizing ability.
usage:
    a = Hashtable()
    a.insert("Hello")
    print a.array
    print a.loadfactor()
    etc
"""
class Hashtable(object):
    def __init__(self):
        self._hash_multiplier = 123456791
        self._hash_bitshift = 32
        self.comparisons = 0
        self.linear_probes = 0
        self.max_size = 600
        self.empty = None
        self.array = [self.empty for x in range(self.max_size)]


    def __str__(self):
        retval = "["
        for item in self.array:
            item = '' if item == self.empty else item
            retval += (item + ", ")
        retval += "]"
        return retval


    """Generate and return the load factor for this hashtable
    The value will be a float between 0 and 1.0
    """
    def get_loadfactor(self):
        return 1.0 - ((float)(self.array.count(self.empty)) / self.max_size)


    def generate_hash(self, data, double_hashing = False):
        value = ""
        for item in data:
            value += str(ord(item))
        value = ((long(value) * self._hash_multiplier) >> self._hash_bitshift)
        value = value ** 2 if double_hashing else value
        return value % self.max_size


    def insert(self, data):
        secondary_hash = False
        success = False
        self.comparisons += 1
        while success == False:
            ptr = self.generate_hash(data, secondary_hash)
            self.comparisons += 1
            if self.array[ptr] == self.empty or self.array[ptr] == data:
                self.array[ptr] = data
                success = True
                if secondary_hash:
                    print"Double-hashing succeeded in finding a free slot\n"
            elif not secondary_hash:
                secondary_hash = True
                print"Tried to insert \'{0}\', but space filled by \'{1}\'" \
                        " - Trying double-hashing".format(data, self.array[ptr])
            else:
                print"Double-hashing failed, trying linear probing" \
                        " from first generated location"
                ptr = self.generate_hash(data, False) # Reset to primary hash
                ttl = self.max_size # Set a time-to-live - only scan array once
                while ttl >= 0 and success == False:
                    ptr = (ptr + 1) % self.max_size
                    self.linear_probes += 1
                    ttl -= 1
                    self.comparisons += 1
                    if self.array[ptr] == self.empty or self.array[ptr] == data:
                        self.array[ptr] = data
                        success = True
                        print"Finally made room for it by shifting right " \
                            "{0} places\n".format(self.max_size - ttl)
                        return
                print
                print self.array
                print
                print"CRASH! Linear probing failed as the array is full, " \
                        "there is no more room!"
                exit("Exiting Hashtable, please code up some dynamic resizing")


    def get(self, data):
        secondary_hash = False
        found = False
        count = 0
        while found == False:
            count += 1
            ptr = self.generate_hash(data, secondary_hash)
            if self.array[ptr] == data:
                found = True
            elif not secondary_hash:
                secondary_hash = True
            else: # linear probing time
                ptr = self.generate_hash(data, False)
                ttl = self.max_size
                while ttl >= 0 and found == False:
                    ptr = (ptr + 1) % self.max_size
                    ttl -= 1
                    count += 1
                    found = True if self.array[ptr] == data else False
        print"Required {0} comparisons to find \'{1}\'".format(count, data)



def main():
    string = """Personal information and bank account details for hundreds of teachers has been leaked in another Education Ministry Novopay blunder.
        And no assurance has been offered that money was not stolen since the payroll system was first installed.
        In the past month staff at Auckland's Marshall Laing Primary School were twice given access to unrecognised staff information through Novopay.
        The security breach could have allowed them to alter phone numbers, addresses and change bank account details for payments.
        Last week it was revealed a teacher in Manawatu was able to have her pay details changed by a school where she used to work.
        However, no one at her current school authorised the changes, which required sign-off by two payroll-authorised staff members.
        Education Minister Hekia Parata would not comment on the security breach as Novopay responsibilities had been passed on to Associate Minister Craig Foss.
        He said every Novopay issue was taken seriously.
        "As with all major projects, there will be a review looking into the implementation and development of Novopay," he said.
        "But there are no plans to implement another payroll system."
        Marshall Laing Primary School associate principal Norma Ebsworth said theoretically it would be easy to divert money from a teacher's bank account through the breach.
        "We were horrified to think if that is happening to another school and we're getting their rights, is someone getting rights to ours?" she said.
        "If someone got it who was not honourable, they could do something like that."
        Foss said he had been advised it was "not possible" to confirm that funds had not been diverted.
        Ebsworth said the school had been lucky that major payroll problems began only when a new office manager needed full access to the system.
        "When our office manager finally got full admin rights she got admin rights to people from a different school,
        a school we think is a secondary school because it had 100-odd staff and we didn't recognise any staff," she said.
        "Obviously we didn't go into an unknown school to see what we could do.
        "The principal then contacted Novopay who took those admin rights away for that school and gave her admin rights for another school, again."
        Although Novopay removed both unwanted administration rights within a day, the school's new office manager still doesn't have full administration rights to Marshall Laing staff.
        Novopay business owner Rebecca Elvy said the system was rigorously tested to ensure it is a secure online service.
        "We take the security of payroll data and employees' personal information very seriously," she said.
        "We have escalated the issue at Marshall Laing Primary school to Talent2 to urgently investigate and resolve."""

    split = string.translate(None, "?\"\',.")
    split = split.lower().split()
    print "Number of non-unique items in the data: " + str(len(split))
    table = Hashtable()
    for i in split:
        table.insert(i)

    print table
    print "%i comparisons made for %i items" % (table.comparisons, len(split))
    print "%i shifts made while linear probing" % table.linear_probes
    print "{0} items stored, in a space of {1} elements".format
            (table.max_size - table.array.count(table.empty), table.max_size)
    print "Load factor: {0:.2f}".format(table.get_loadfactor())
    print
    table.get("urgently")
    table.get("rigorously")
    table.get("office")


if __name__ == "__main__":
    main()