Hashing in Python

A hashing algorithm characterizes a complex piece of data with something simpler, and preferably unique. The most common example is to find a number that could represent a character string. A hashing algorithm has to be fast, be­cause it often needs to convert a string into an index to a list or tuple. Consider the string “while.” There are five characters (bytes) here. How can this string be used as an index into a tuple?

Any numerical operation on the codes used to represent the character might work, but some result in codes that are too large. Simply adding the codes would give a value of 537, which could work, but also might be too large. Imagine the application is to look up Python key words; there are 33 of them. The value result­ing from the hash should be an index between 0 and 32, so take the hash mod 33. If that is tried, the result is that half of the 33 entries will be empty, and half will have two or more strings that have the same index. The results are as follows:

When two items hash to the same value, it is said to be a collision. In this case, the collisions are as follows:

Two values cannot occupy the same location in a tuple, so something must be done. The simplest way to deal with collisions is to have extra space in the list or tuple. If the size of the tuple is specified as 145, then all strings hash to distinct values. Of course, now 112 tuple entries are empty, but does that really matter? The alternative to a table indexed by hashing (a hash table) would be a list that has to be searched, and hashing is much faster.

1. DJB2

This algorithm starts with a predefined seed for a hash value, multiplies it by 33 and adds the next character from the string, multiplies that by 33, adds the next character, and so on. The code is

def djb2 (s, size):

sum = 5381

for i in range (0, len(s)):

sum = sum*33 + ord(s[i])

sum = sum%size

return sum

Why multiply by 33? It works well, and nobody knows why. The seed of 5381 can be changed to see how different values work. With the configuration given here, there will need to be 112 elements in the tuple to avoid collisions. If the program is changed slightly so that an exclusive OR replaces the sum, the size decreases to 105. That is

sum = sum*33 ^ ord(s[i])

2. SDBM

This is a method devised for scrambling bits, but makes for a good hashing function. The iteration is hash(i) = hash(i – 1) * 65599 + str[i]. The number 65599 is arbitrary, but it happens to be prime. A function to implement this is

def sdbm (s, size):

hash = 0

for i in range (0, len(s)):

hash = ord(s[i]) * 65599 + hash

return hash%size

There are many other hashing methods (see Knuth). The idea is an important one. It is, for example, a way to implement Python dictionaries: hash the key to an integer and use that to access the value.

Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.

Leave a Reply

Your email address will not be published. Required fields are marked *