Class Notes (1,100,000)

CA (620,000)

Queen's (10,000)

CISC (500)

CISC 235 (20)

Robin W Dawes (20)

Lecture 12

# CISC 235 Lecture Notes - Lecture 12: Self-Balancing Binary Search Tree, Binary Tree, Null Pointer

by OC330671

CISC 235 – Lecture 12

Hash Tables and Hash Functions

We have seen that with a balanced binary tree we can guarantee

O(log n) worst-case time for insert, search and delete

operations. Our challenge now is to try to improve on that ... if we

can.

Suppose we have data objects that consist of pairs - each pair

having a key value that is unique to the object, and some "satellite"

data that may or may not be be unique. For example, the data

might consist of "phone number, name" pairs in which each phone

number must be unique (the key), but there might be two people

both named "Cholmondeley Arbuthnot" (duplicate satellite data).

Assuming for the moment that the keys are integers, the simplest

method to store the data is in an array that has an address for each

possible key - this is called direct addressing. Then insert, search

and delete all take O(1) time - can't beat that!

The problem of course is that the set of possible keys may be

immense - with 10-digit phone numbers there are 10^10 possible

combinations, and most people have at most a couple of hundred

phone numbers in their contact list. Even if you have every person

in Canada in your contact list, creating an array with 10^10

elements and only using 35,000,000 of them is not very practical.

However, the idea of using a simple array to store our data is very

appealing and as we shall see, with a little care and attention we

can get good average-case complexity for our three essential

operations, even though we may not have optimal worst-case

complexity.

Since we are introducing average-case complexity we should

spend a moment looking at balanced binary trees in this way. In a

complete binary tree (ie one in which there are no missing children

- all the leaves are at the bottom level) - almost exactly half the

vertices are at maximum distance from the root. This implies that

the average insert/search/delete time is going to be close to log

n. We can make the same argument for Red-Black trees.

Since we contemplating using an array that is smaller than the set

of all possible keys, we clearly need some way to map key values

onto array addresses. For example if our array (which we will call

T) has index values 0, 1, ..., 9 and our key value is 34, we need to

decide which array address to use to store the data. We call this

mapping a hash function. We call the array T a hash table.

For the rest of this discussion, we will use the following notation

consistently:

m : the size of T. T has addresses 0, 1, ..., m-1

n : the number of data objects we need to be able to store in T. If

this is not known precisely, we should at least be able to put an

upper limit on it.

h(k) : a function that takes a key value as its argument and returns

a value in the range [0 .. m-1]

We will spend some time later talking about how to choose h(k),

but for now we will assume the keys are non-negative integers and

we will use

h(k) = k mod m

as our hash function. So continuing the previous example, with m

= 10 and k = 34, we get h(k) = 4

The problem is that since the number of possible keys exceeds m,

we may get collisions - two or more keys in our set that hash to the

same address. To deal with collisions we need to define a

collision resolution method.

Note that due to collisions, we must store the key value as well as

its satellite data - otherwise we cannot distinguish between the data

associated with different keys that have the same value of h(k)

Collision Resolution Methods

The very bad method: If we are trying to insert a value k and the

address h(k) is already occupied, we simply reject the new

data. This has the advantage of making insert/search/delete all

O(1) worst case - but it has the downside that we are frequently

unable to successfully insert new values even though there may be

a lot of empty space in T.

Another bad method: If we are trying to insert a value k and the

address h(k) is already occupied, we overwrite h(k) with the new

data. This has O(1) complexity for all operations, and we are

always able to insert a new value. Alas, we are likely to lose a lot

of data.

Challenge: come up with a situation where you can argue that one

of the above methods is useful.

A good method: Chaining

In a chained hash table, T is not used to store the data

directly. Each element of T is a pointer to the head of a linked list

of objects that have all hashed to the same location in T. If no key

values currently in the set have hash value = i, then T[i] is a null

pointer. Each data pair is implemented as an object that contains

the key value, the satellite data, and a pointer variable that will be

used to connect to the next object in the list, if any.

**Unlock Document**

## More from OC330671

###### CISC 235 Lecture Notes - Lecture 1: Autocomplete

Lecture Note

###### CISC 235 Lecture Notes - Lecture 4: Reverse Polish Notation, Infix Notation, Operand

Lecture Note

###### CISC 235 Lecture Notes - Lecture 13: Quadratic Probing, Linear Probing, Hash Table

Lecture Note

## Classmates also unlocked

###### CISC 235 Lecture Notes - Lecture 1: Autocomplete

Lecture Note

###### CISC 235 Lecture Notes - Lecture 4: Reverse Polish Notation, Infix Notation, Operand

Lecture Note

###### CISC 235 Lecture Notes - Lecture 13: Quadratic Probing, Linear Probing, Hash Table

Lecture Note

###### CISC 235 Lecture Notes - Lecture 9: Binary Search Tree, Search Algorithm, Binary Tree

Lecture Note

###### CISC 235 Lecture Notes - Lecture 14: Donald Knuth, American Broadcasting Company

Lecture Note

###### CISC 235 Lecture Notes - Lecture 20: Minimum Spanning Tree, Binary Heap, Prims

Lecture Note

###### CISC 235 Lecture Notes - Lecture 2: Random-Access Machine, Complex Instruction Set Computing, Time Complexity

Lecture Note

###### CISC 235 Lecture Notes - Lecture 19: Directed Acyclic Graph, Topological Sorting, Critical Path Method

Lecture Note

###### CISC 235 Lecture Notes - Lecture 2: Random-Access Machine, Virtual Memory, Memory Address

Lecture Note

###### CISC 235 Lecture Notes - Lecture 8: Reverse Polish Notation, Polish Notation, Binary Tree

Lecture Note

###### CISC 235 Lecture Notes - Lecture 21: Bit Array, The Algorithm, Bloom Filter

Lecture Note

###### CISC 235 Lecture Notes - Lecture 3: Sorting Algorithm, Merge Sort, Bubble Sort

Lecture Note

###### CISC 235 Lecture 12: CISC_235_29-March-2016

Lecture Note

###### CISC235 Lecture 1: Week 1

Lecture Note

###### CISC 235 Lecture Notes - Lecture 11: Binary Search Tree, Longest Path Problem, Complex Instruction Set Computing

Lecture Note