CISC 235 Lecture Notes - Lecture 12: Self-Balancing Binary Search Tree, Binary Tree, Null Pointer
CISC 235 – Lecture 12
Hash Tables and Hash Functions
We have seen that with a balanced binary tree we can guarantee
O(log n) worst-case time for insert, search and delete
operations. Our challenge now is to try to improve on that ... if we
Suppose we have data objects that consist of pairs - each pair
having a key value that is unique to the object, and some "satellite"
data that may or may not be be unique. For example, the data
might consist of "phone number, name" pairs in which each phone
number must be unique (the key), but there might be two people
both named "Cholmondeley Arbuthnot" (duplicate satellite data).
Assuming for the moment that the keys are integers, the simplest
method to store the data is in an array that has an address for each
possible key - this is called direct addressing. Then insert, search
and delete all take O(1) time - can't beat that!
The problem of course is that the set of possible keys may be
immense - with 10-digit phone numbers there are 10^10 possible
combinations, and most people have at most a couple of hundred
phone numbers in their contact list. Even if you have every person
in Canada in your contact list, creating an array with 10^10
elements and only using 35,000,000 of them is not very practical.
However, the idea of using a simple array to store our data is very
appealing and as we shall see, with a little care and attention we
can get good average-case complexity for our three essential
operations, even though we may not have optimal worst-case
Since we are introducing average-case complexity we should
spend a moment looking at balanced binary trees in this way. In a
complete binary tree (ie one in which there are no missing children
- all the leaves are at the bottom level) - almost exactly half the
vertices are at maximum distance from the root. This implies that
the average insert/search/delete time is going to be close to log
n. We can make the same argument for Red-Black trees.
Since we contemplating using an array that is smaller than the set
of all possible keys, we clearly need some way to map key values
onto array addresses. For example if our array (which we will call
T) has index values 0, 1, ..., 9 and our key value is 34, we need to
decide which array address to use to store the data. We call this
mapping a hash function. We call the array T a hash table.
For the rest of this discussion, we will use the following notation
m : the size of T. T has addresses 0, 1, ..., m-1
n : the number of data objects we need to be able to store in T. If
this is not known precisely, we should at least be able to put an
upper limit on it.
h(k) : a function that takes a key value as its argument and returns
a value in the range [0 .. m-1]
We will spend some time later talking about how to choose h(k),
but for now we will assume the keys are non-negative integers and
we will use
h(k) = k mod m
as our hash function. So continuing the previous example, with m
= 10 and k = 34, we get h(k) = 4
The problem is that since the number of possible keys exceeds m,
we may get collisions - two or more keys in our set that hash to the
same address. To deal with collisions we need to define a
collision resolution method.
Note that due to collisions, we must store the key value as well as
its satellite data - otherwise we cannot distinguish between the data
associated with different keys that have the same value of h(k)
Collision Resolution Methods
The very bad method: If we are trying to insert a value k and the
address h(k) is already occupied, we simply reject the new
data. This has the advantage of making insert/search/delete all
O(1) worst case - but it has the downside that we are frequently
unable to successfully insert new values even though there may be
a lot of empty space in T.
Another bad method: If we are trying to insert a value k and the
address h(k) is already occupied, we overwrite h(k) with the new
data. This has O(1) complexity for all operations, and we are
always able to insert a new value. Alas, we are likely to lose a lot
Challenge: come up with a situation where you can argue that one
of the above methods is useful.
A good method: Chaining
In a chained hash table, T is not used to store the data
directly. Each element of T is a pointer to the head of a linked list
of objects that have all hashed to the same location in T. If no key
values currently in the set have hash value = i, then T[i] is a null
pointer. Each data pair is implemented as an object that contains
the key value, the satellite data, and a pointer variable that will be
used to connect to the next object in the list, if any.