Class Notes (1,100,000)
CA (620,000)
Queen's (10,000)
CISC (500)
Lecture 21

CISC 235 Lecture Notes - Lecture 21: Bit Array, The Algorithm, Bloom Filter


Department
Computing
Course Code
CISC 235
Professor
Robin W Dawes
Lecture
21

This preview shows pages 1-3. to view the full 13 pages of the document.
Lecture 21: Bloom Filters
We discussed a hypothetical algorithm for creating short URLs
that can be used in place of long URLs. For any given long URL
we need to create an unique short URL (we then need to set up a
structure that will allow us to associate the short URL with the
original long URL, but we are not focusing on that part of the
problem right now). The question we address is how to quickly
and efficiently create those unique short URLs.
We could try to design a function that would take a long URL as
the input value Bloom Filters
We discussed a hypothetical algorithm for creating short URLs
that can be used in place of long URLs. For any given long URL
we need to create an unique short URL (we then need to set up a
structure that will allow us to associate the short URL with the
original long URL, but we are not focusing on that part of the
problem right now). The question we address is how to quickly
and efficiently create those unique short URLs.
We could try to design a function that would take a long URL as
the input value and produce a (hopefully unique) short URL as the
output. Then we would have to search the set of previously used
short URLs to see if the newly generated one had been used
before. If it is unused, then everything is fine. But if it has already
been used for some other long URL, we need to come up with
another short URL ... for which we will presumably need another
long-to-short function. And if that one fails to create a short URL
that had not been used, we will need a third function, and so on.
Another issue to consider is that if our short URL service is
popular, we will eventually have generated millions of short
URLs. The logistics of searching such a set (which may be too

Only pages 1-3 are available for preview. Some parts have been intentionally blurred.

large to hold in internal memory) can be very time consuming.
If we consider the operations we need to do on the set of short
URLs in this part of the algorithm, we see there are only two:
- add a new value to the set
- test a value to see if it is already in the set
In 1970, Bloom suggested a new data structure that supports these
two operations very efficiently in terms of time (running time) and
space (memory). The trade-off is that it sometimes gives the
wrong answer. Another somewhat unusual feature of the data
structure is that the values that are stored in the structure are not
recoverable - you can't use the structure to list all the values in the
set, or find the largest value, or even precisely determine how
many values are in the set.
With respect to giving the wrong answer, it sometimes gives a
false positive - for some values, it will say "Yes, that value is in
the set" when the value is not actually in the set.
However, it never gives a false negative - so if it says "No, that
value is not in the set", that answer is always correct.
Why is this useful? Well, in our URL program this works very
well. We can structure the algorithm for selecting a short URL to
go with a long URL like this:
1. Generate a completely random short URL - the short URL is
not based on the long URL at all.
2. Check to see if it is in the set already. If so, repeat Step 1. If
not, use it.
Note that this avoids the problem of needing a sequence of
functions to create different short URLs from a single long URL -
we just use a random number generator to crank out new short

Only pages 1-3 are available for preview. Some parts have been intentionally blurred.

URLs as needed. Also note that if the "Is it in the set?" question is
answered incorrectly due to a false positive, it really doesn't cost us
too much - we just generate a new (random) short URL and test
that one. As long as the number of false positives is small, this is
not significant.
The data structure that performs this magic is called a Bloom
Filter. It consists of an m-length bit-vector (i.e. a one-
dimensional array of bits - so we need to implement this in a
language that lets us have direct access to individual bits in
memory ) and k different hashing functions that produce values in
the range [0 .. m-1]
How long should the bit-vector be? How many hashing functions
do we need? These questions will be answered in the fullness of
time. For now we will assume that we have chosen m and k
wisely. Let the hashing functions be h1(), h2() ... hk(), and let the
bit-vector be B.
The algorithm to add a new value v to the set looks like this:
insert(v):
for i in 1 .. k:
B[hi(v)] = 1
In other words, each of the hashing functions is used to set one bit
in B to 1. This means that v is represented in B by k bits being set
to 1.
The check for membership algorithm is equally simple:
member(v):
for i in 1 .. k:
if B[hi(v)] == 0:
return False
You're Reading a Preview

Unlock to view full version