CS240 Lecture Notes - Suffix Tree, Quadtree, Binary Search Algorithm

68 views1 pages
16 Oct 2011
School
Course
Professor
Dictionary as: Unsorted array:
Claim: sorting items by their probabilities of
access in non-increasing order minimizes the cost
of Search.
Proof Idea: If not in the right order, then if we
exchange two items which are in the wrong
order, then the cost decreases.
Move-To-Front heuristic: Upon a successful
search, bring the element forward to the front.
(used when not knowing prob ahead of time)The
cost of shift is already dealt with in the search,
works well in practice
Sorted Array:
Interpolation Search:
Idea: take advantage of the value of the key to
determine its location
Steps: - search(x) in array in index range[L,R]
-compute ration t = (A[R]-x)/(A[R]- A[L])
- compare result with key at floor(t*L + (1-t)*R)
and decide the part to recurse in
Run Time: O(n) worst case but if keys are random
then O(loglogn) search cost
Gallop Search:
Good for large arrays The number of comparison
is still O (logn) (we have at most doubled the
number of comparisons compared to binary
search.
Ordered Lists:
Insert/delete: O(1) , search O(n) ( no BS)
Skip Lists:
Idea: each key (node) has more than one link ( we
make binary search work for lists , Nodes are
ordered in increasing order and have multiple
varied links . Expected time for all operations is
Equal(log n). In practice, performs better
than balanced BSTs. P[#links =i] = 1/ 2^I , lower
or increase the dummy node height when needed
Multi-Dimensional data
Storage:
-Combine multi-dimensional key into one key:
impossible to search by one aspect of keys
- Use dictionary for each dimension: wastes space
-Use dictionaries for multi-dimensional data:
Running Time 1D: Equal(n) report all keys ,
O(k+logn) where k is # of reported elements
Quadtrees: Consider a square that encloses all
points .Break it into four quadrants.Break each
square contains at most one point. Worst Case
when dividing quadrants but not efficiently
[x.x+x’/2] x [y,y+y’/2] , Range search Equal(n)
even if empty. very easy to compute & handle ,
space wasteful , they don’t reflect the nodes, they
KdTrees: kd-trees split rectangle into two part of
median of the point in it. Build kd time
Equal(nlogn) height Equal(logn) assume no two
points with same x,y coordinates, fix by breaking
ties according to the other coordinate, costly after
insert & delete causes unbalance, so do periodic
rebalancing. Range search O(k+G) & G is the # of
regions unsuccessfully & not fully in R. # of
region that intersect a line satisfies recurrence
relation Q(n)=2*Q(n/4)+1 , and solves to
Q(n)=O(root(n))
Complexity RSKD O(k+root(n)), not that easy but
Range Search: Each item will be stored
repeatedly O(logn) times. Use BST inside BS,
space O(nlogn)
Insert point by x, from leaf walk back up & insert
point in subtrees of the nodes. Problem
rebalancing at most liner time . Range search
O(k+log^2(n))
Orthogonal Range Search : we specify a range for
certain aspects of the key and return accordingly
Dictionaries for text strings
Binary Tries: time O(# of bits in x) space O( sum
of the bits in all keys) Insert: 1- leaf: no action is
required 2- internal: add nodes or ( - no bits left :
nothing - bits left :add )
Fix prefix issue: - allow interior nodes to store
keys add \$ at end of every word
Tries for String: alphabet sigma, store multiway
tree, each node has up to sigma + 1 children,
might have unnecessary node
Compressed Tries: eliminate nodes with only one
child , & nodes store an index indicating the
character it splits by
String Matching
Applications: information retrieval, compression, data-
mining, classification
Naïve alg (brute Force) : worst input P= (a^m-1)*b , T = a^n
in pattern and skip by one in Tif miss match
Boyer-Moore ( Matching P backwards ) alg: heuristics :
- Character jump: check char occurrence in pattern & shift
- Suffix shift: check suffix occurrence in pattern & shift
No improvement in worst case to naïve P=ba..a T=aa.aa
Worst case running time O(m*(n-m+1) + |alphabet| )
On typical English text the alg probes 30% of chars in T
KMP alg: Failing fxn run time O(m) , KMP run time : 1-
increase I & j by one, 2- decrease j by atleast one, 3- increase
I by one .(1) and (3) execute in time at most n, and (2)
executes no more often than (1),hence a total of 2n
comparison. Hence total run time O(m+n)
Suffix tree: idea: Preprocess text T rather than pattern P.
Observation: P is a substring of T if and only if P is a prefix
of some suffix of T. build a compressed trie that stores all
suffixes of text T. -insert suffixes in decreasing order of
length.- if a suffix is a prefix of another suffix, we do not
insert it. Store two indexes l , r on each node v (both
internal nodes and leaves) where node v corresponds to
substring T[l ..r ]. How: goes throu the tree looking for
prefix matching P , if node reached is less than M then
unsuccessful, only check for at least M for successful search
& check if first M chars match, the time of operation depend
on the length of the word in query, not # items in dictionary
Naïve Boyer-Moore KMP Suffix tree
Preprocess P - O(m+|sigma|) O(m) -
Preprocess T - - - O(n^2)
Search P in T O(n*m) O(n*m) O(n+m) O(m)
Extra space - O(m+|sigma|) O(m) O(n)
Text compression
Huffman codes: To store (or transmit) a given text, we want
to transform it to be as short as possible. First approach:
fixed length , works best with random text( but English not
random ). Use variable length encoding ( popular is shorter
) Encoding tree has prefix code : the prefix of one code
cannot represent another symbol. Build tree: determine
frequency of char, build bottom up using priority queues,
weight of a leaf is f(c), & weight of internal node is the sum
of the leaves. Tree not unique , worst case run time O(
|sigma| lg |sigma| ) length of the encoding generated by
each tree: sum over c( (# of occurrences of c)* length code
of c ) = f(c) * depth(c) . the value is called weighted path
length (WPL) for the tree. Huffman tree minimizes this
value. Trie must be transmitted with decoder. Two passes:
one for computing the frequencies, one for the encoding.
Lempel Ziv compression:
Makes use of patterns in text files ( underlying patterns in
data . Lempel Ziv Welch : - fixed width encoding using k bits,
- store a dictionary of 2^k entries ( typical k = 12 ) first
ASCII chars , remaining entries involve multiple chars
Huffman best model, but LZW better in practice
BuildHuffTree(input)
for each unique character c do
Create a tree, T, with c as the only node
PQ.insert( f (c), T ) /* use f (c) as priority */
end for
while PQ.size() > 1 do
T1 PQ.extractMin() (with priority w1)
T2 PQ.extractMin() (with priority w2)
Create a tree, T, with empty root and T1, T2 as
children
PQ.insert( w1 + w2, T )
end while
return PQ.extractMin()
LZW-Encode(input)
Initialize dictionary D with all
single chars
s first char of input
n Code( s )
Output n
while input has more chars do
t s
s longest prefix from input in D
n Code( s )
Output n
c first character of s
Insert tc into D with next code
number
end while
LZ-Decode input An input stream of k bit
codes An output stream of ASCII chars
Initialize dictionary D with all single chars
n first k bits of input
s Decode( n )
Output s
while input has more codes do
t s
n next k bits of input
s Decode( n )
Output s
c first character of s
Insert tc into D with next code number
end while
KMPFailure(P[0..m − 1])
f [0] 0
i 1
j 0
while ( i < m ) do
if ( P[i ] = P[j ] ) then
f [i ] j + 1
i i + 1
j j + 1
else if ( j > 0 ) then
j f [j − 1]
else
f [i ] 0
i i + 1
end if
end while
Brute-Force
Naive(P[0..m −1],T[0..n − 1])
for i 0 to n − m do
for j 0 to m − 1 do
if T[i + j ] = P[j ] then
if j = m − 1 then
return i
BST-RangeSearch(X, I):
if (X == null) return;
if ( key(X) < I )
BST-RangeSearch(left-child(X),I);
else if ( I < key(X) )
BST-RangeSearch(right-child(X), I);
else if ( key(X) I )
BST-RangeSearch(left-child(X), I);
report(key(X))
BST-RangeSearch(right-child(X),I);
Procedure BuildKDtree(P, depth)
if P contains only one point
return a leaf containing this point
else if depth is even
split P vertically into P1 and P2
else
split P horizontally into P1 and P2
left = BuildKDtree (P1, depth+1)
right = BuildKDtree (P2, depth+1)
return tree with subtrees left and
right
BTree-Search(node x, key k):
Find i such that i − 1-th key ≤ k < i -th key
if k == i − 1-th key then
return this node
else if x is a leaf
return ―k not in the tree‖
else
BTree-Search(i -th child of x, k)
If x = Null the return result
Kx ={k0,k1..km,km+1}
L = search lo in Kx
R = search hi in Kx
For I = L to R+1 do
Child = ith child of X
RangeSearch( child, I , Result )
If I != R+1 then add ki to result
Unlock document

This preview shows half of the first page of the document.
Unlock all 1 pages and 3 million more documents.

Get access

\$10 USD/m
Billed \$120 USD annually
Homework Help
Class Notes
Textbook Notes