# CS240 Lecture Notes - Suffix Tree, Quadtree, Binary Search Algorithm

68 views1 pages

16 Oct 2011

School

Department

Course

Professor

Dictionary as: Unsorted array:

Claim: sorting items by their probabilities of

access in non-increasing order minimizes the cost

of Search.

Proof Idea: If not in the right order, then if we

exchange two items which are in the wrong

order, then the cost decreases.

Move-To-Front heuristic: Upon a successful

search, bring the element forward to the front.

(used when not knowing prob ahead of time)The

cost of shift is already dealt with in the search,

works well in practice

Sorted Array:

Interpolation Search:

Idea: take advantage of the value of the key to

determine its location

Steps: - search(x) in array in index range[L,R]

-compute ration t = (A[R]-x)/(A[R]- A[L])

- compare result with key at floor(t*L + (1-t)*R)

and decide the part to recurse in

Run Time: O(n) worst case but if keys are random

then O(loglogn) search cost

Gallop Search:

Good for large arrays The number of comparison

is still O (logn) (we have at most doubled the

number of comparisons compared to binary

search.

Ordered Lists:

Insert/delete: O(1) , search O(n) ( no BS)

Skip Lists:

Idea: each key (node) has more than one link ( we

make binary search work for lists , Nodes are

ordered in increasing order and have multiple

varied links . Expected time for all operations is

Equal(log n). In practice, performs better

than balanced BSTs. P[#links =i] = 1/ 2^I , lower

or increase the dummy node height when needed

Multi-Dimensional data

Storage:

-Combine multi-dimensional key into one key:

impossible to search by one aspect of keys

- Use dictionary for each dimension: wastes space

-Use dictionaries for multi-dimensional data:

quad, k-d, range

Running Time 1D: Equal(n) report all keys ,

O(k+logn) where k is # of reported elements

Quadtrees: Consider a square that encloses all

points .Break it into four quadrants.Break each

quadrant into quadrants recursively until every

square contains at most one point. Worst Case

when dividing quadrants but not efficiently

[x.x+x’/2] x [y,y+y’/2] , Range search Equal(n)

even if empty. very easy to compute & handle ,

space wasteful , they don’t reflect the nodes, they

just divide into quadrants

KdTrees: kd-trees split rectangle into two part of

median of the point in it. Build kd time

Equal(nlogn) height Equal(logn) assume no two

points with same x,y coordinates, fix by breaking

ties according to the other coordinate, costly after

insert & delete causes unbalance, so do periodic

rebalancing. Range search O(k+G) & G is the # of

regions unsuccessfully & not fully in R. # of

region that intersect a line satisfies recurrence

relation Q(n)=2*Q(n/4)+1 , and solves to

Q(n)=O(root(n))

Complexity RSKD O(k+root(n)), not that easy but

no bad cases like quadtree

Range Search: Each item will be stored

repeatedly O(logn) times. Use BST inside BS,

space O(nlogn)

Insert point by x, from leaf walk back up & insert

point in subtrees of the nodes. Problem

rebalancing at most liner time . Range search

O(k+log^2(n))

Orthogonal Range Search : we specify a range for

certain aspects of the key and return accordingly

Dictionaries for text strings

Binary Tries: time O(# of bits in x) space O( sum

of the bits in all keys) Insert: 1- leaf: no action is

required 2- internal: add nodes or ( - no bits left :

nothing - bits left :add )

Fix prefix issue: - allow interior nodes to store

keys – add $ at end of every word

Tries for String: alphabet sigma, store multiway

tree, each node has up to sigma + 1 children,

might have unnecessary node

Compressed Tries: eliminate nodes with only one

child , & nodes store an index indicating the

character it splits by

String Matching

Applications: information retrieval, compression, data-

mining, classification

Naïve alg (brute Force) : worst input P= (a^m-1)*b , T = a^n

, worst case performance O (n-m+1)*m , start with first char

in pattern and skip by one in Tif miss match

Boyer-Moore ( Matching P backwards ) alg: heuristics :

- Character jump: check char occurrence in pattern & shift

- Suffix shift: check suffix occurrence in pattern & shift

No improvement in worst case to naïve P=ba..a T=aa.aa

Worst case running time O(m*(n-m+1) + |alphabet| )

On typical English text the alg probes 30% of chars in T

KMP alg: Failing fxn run time O(m) , KMP run time : 1-

increase I & j by one, 2- decrease j by atleast one, 3- increase

I by one .(1) and (3) execute in time at most n, and (2)

executes no more often than (1),hence a total of 2n

comparison. Hence total run time O(m+n)

Suffix tree: idea: Preprocess text T rather than pattern P.

Observation: P is a substring of T if and only if P is a prefix

of some suffix of T. build a compressed trie that stores all

suffixes of text T. -insert suffixes in decreasing order of

length.- if a suffix is a prefix of another suffix, we do not

insert it. Store two indexes l , r on each node v (both

internal nodes and leaves) where node v corresponds to

substring T[l ..r ]. How: goes throu the tree looking for

prefix matching P , if node reached is less than M then

unsuccessful, only check for at least M for successful search

& check if first M chars match, the time of operation depend

on the length of the word in query, not # items in dictionary

Naïve Boyer-Moore KMP Suffix tree

Preprocess P - O(m+|sigma|) O(m) -

Preprocess T - - - O(n^2)

Search P in T O(n*m) O(n*m) O(n+m) O(m)

Extra space - O(m+|sigma|) O(m) O(n)

Text compression

Huffman codes: To store (or transmit) a given text, we want

to transform it to be as short as possible. First approach:

fixed length , works best with random text( but English not

random ). Use variable length encoding ( popular is shorter

) Encoding tree has prefix code : the prefix of one code

cannot represent another symbol. Build tree: determine

frequency of char, build bottom up using priority queues,

weight of a leaf is f(c), & weight of internal node is the sum

of the leaves. Tree not unique , worst case run time O(

|sigma| lg |sigma| ) length of the encoding generated by

each tree: sum over c( (# of occurrences of c)* length code

of c ) = f(c) * depth(c) . the value is called weighted path

length (WPL) for the tree. Huffman tree minimizes this

value. Trie must be transmitted with decoder. Two passes:

one for computing the frequencies, one for the encoding.

Lempel Ziv compression:

Makes use of patterns in text files ( underlying patterns in

data . Lempel Ziv Welch : - fixed width encoding using k bits,

- store a dictionary of 2^k entries ( typical k = 12 ) first

ASCII chars , remaining entries involve multiple chars

Huffman best model, but LZW better in practice

BuildHuffTree(input)

for each unique character c do

Create a tree, T, with c as the only node

PQ.insert( f (c), T ) /* use f (c) as priority */

end for

while PQ.size() > 1 do

T1 PQ.extractMin() — (with priority w1)

T2 PQ.extractMin() — (with priority w2)

Create a tree, T, with empty root and T1, T2 as

children

PQ.insert( w1 + w2, T )

end while

return PQ.extractMin()

LZW-Encode(input)

Initialize dictionary D with all

single chars

s first char of input

n Code( s )

Output n

while input has more chars do

t s

s longest prefix from input in D

n Code( s )

Output n

c first character of s

Insert tc into D with next code

number

end while

LZ-Decode input An input stream of k bit

codes An output stream of ASCII chars

Initialize dictionary D with all single chars

n first k bits of input

s Decode( n )

Output s

while input has more codes do

t s

n next k bits of input

s Decode( n )

Output s

c first character of s

Insert tc into D with next code number

end while

KMP( P[0..m – 1 ],T[0..n − 1])

f KMPFailure( P )

i 0

j 0

while ( i < n ) do

if T[i ] = P[ j ] then

if (j = m − 1) then

return i − j

end if

i i + 1

j j + 1

else if ( j > 0 ) then

j f [j − 1]

else

I i + 1

end if

KMPFailure(P[0..m − 1])

f [0] 0

i 1

j 0

while ( i < m ) do

if ( P[i ] = P[j ] ) then

f [i ] j + 1

i i + 1

j j + 1

else if ( j > 0 ) then

j f [j − 1]

else

f [i ] 0

i i + 1

end if

end while

Brute-Force

Naive(P[0..m −1],T[0..n − 1])

for i 0 to n − m do

for j 0 to m − 1 do

if T[i + j ] = P[j ] then

if j = m − 1 then

return i

BST-RangeSearch(X, I):

if (X == null) return;

if ( key(X) < I )

BST-RangeSearch(left-child(X),I);

else if ( I < key(X) )

BST-RangeSearch(right-child(X), I);

else if ( key(X) ∈ I )

BST-RangeSearch(left-child(X), I);

report(key(X))

BST-RangeSearch(right-child(X),I);

Procedure BuildKDtree(P, depth)

if P contains only one point

return a leaf containing this point

else if depth is even

split P vertically into P1 and P2

else

split P horizontally into P1 and P2

left = BuildKDtree (P1, depth+1)

right = BuildKDtree (P2, depth+1)

return tree with subtrees left and

right

BTree-Search(node x, key k):

Find i such that i − 1-th key ≤ k < i -th key

if k == i − 1-th key then

return this node

else if x is a leaf

return ―k not in the tree‖

else

BTree-Search(i -th child of x, k)

If x = Null the return result

Kx ={k0,k1..km,km+1}

L = search lo in Kx

R = search hi in Kx

For I = L to R+1 do

Child = ith child of X

RangeSearch( child, I , Result )

If I != R+1 then add ki to result