UNIT-4 PPT New
UNIT-4 PPT New
UNIT-4 PPT New
a b a c a a b
1
a b a c a b
4
2
a b a c
a b
Strings
Brute-force algorithm
Boyer-Moore algorithm
Knuth-Morris-Pratt algorithm
Brute-force algorithm
• The Brute Force algorithm compares the pattern to the text, one
character at a time, until unmatching characters are found
• The algorithm can be designed to stop on either the first occurrence of
the pattern, or upon reaching the end of the text.
• Compared characters are italicized.
• Correct matches are in boldface type.
Brute-force algorithm
• The Brute Force algorithm compares the pattern P with the text T for
each possible shift of P relative to T, until either
• a match is found, or
• all placements of the pattern are tried
Brute-force pattern matching runs in time O(nm)
Preprocessing of Pattern:
• Construct bad match table for preprocessing of the pattern
• This table never has elements smaller than 1
• Keep comparing the pattern to the text starting from rightmost
character in the pattern
• When mismatch occurs we have to shift the pattern to the
right corresponding to the value in the bad match table
• With this we can skip several characters unlike brute force
search
• So that the algorithm runs faster
Boyer-moore algorithm
text t1 . . . . . tj . . .
tj-r+1
.
pattern p1 . . . pr . . .
pattern p1 . . . pr
. . .
pattern p1 . . . pr . .
.
Knuth-Morris – Pratt algorithm
• Knuth-Morris-Pratt’s algorithm
compares the pattern to the text in
. . a b a a b x . . . . .
left-to-right, but shifts the pattern
more intelligently than the brute-
force algorithm. a b a a b a
• When a mismatch occurs, what is j
the most we can shift the pattern
so as to avoid redundant a b a a b a
comparisons?
• Answer: the largest prefix of P[0..j] No need to Resume
that is a suffix of P[1..j] repeat these comparing
comparisons
Knuth-Morris – Pratt algorithm
KMP Failure function or LPS(Longest Prefix that is also Suffix) array:
Dr.L.Lakshmi 34
Tries
• A trie is a tree-based data structure for representing a set of strings,
such as all the words in a text
A tries supports pattern matching queries in time proportional to the pattern size
(or)
• A trie is a tree-based data structure for storing strings in order to
support fast pattern matching.
e i mi nimize ze
Dr.L.Lakshmi 35
Tries
• Standard Tries
• Compressed Tries
• Suffix Tries
Dr.L.Lakshmi 36
Standard Tries
• The standard trie for a set of strings S is an ordered tree such that:
• each node but the root is labeled with a character
• the children of a node are alphabetically ordered
• the paths from the external nodes to the root yield the strings of S
Dr.L.Lakshmi 37
Standard Tries
Dr.L.Lakshmi 38
Word Matching with a Trie
• A standard trie supports the following operations on a preprocessed text in time O(m), where
m = |X|
-word matching: find the first occurrence of word X in the text
-prefix matching: find the first occurrence of the longest prefix of word X in the text
• Each operation is performed by tracing a path in the trie starting at the root
Dr.L.Lakshmi 39
Compressed Tries
• Trie with nodes of degree at least 2
• Obtained from standard trie by compressing chains of redundant nodes
Standard Trie:
Compressed Trie:
Dr.L.Lakshmi 40
Compact Storage of Compressed Tries
• A compressed trie can be stored in space O(s), where s = |S|, by using O(1) space index ranges at
the nodes
Dr.L.Lakshmi 41
Insertion and Deletion into/from a Compressed Trie
Dr.L.Lakshmi 42
Suffix Trie
• The suffix trie of a string X is the compressed trie of all the suffixes of X
m i n i m i z e
0 1 2 3 4 5 6 7
e i mi nimize ze
Dr.L.Lakshmi 44
Properties of Suffix Tries
• The suffix trie for a text X of size n from an alphabet of size d
-stores all the n(n-1)/2 suffixes of X in O(n) space
-supports arbitrary pattern matching and prefix matching queries in O(dm)
time, where m is the length of the pattern
-can be constructed in O(dn) time
Dr.L.Lakshmi 45
Application of Tries
• The index of a search engine (collection of all searchable words) is stored into a
compressed trie
• Each leaf of the trie is associated with a word and has a list of pages (URLs)
containing that word, called occurrence list
• The trie is kept in internal memory
• The occurrence lists are kept in external memory and are ranked by relevance
• Boolean queries for sets of words (e.g., Java and coffee) correspond to set operations
(e.g., intersection) on the occurrence lists
• Additional information retrieval techniques are used, such as
• stopword elimination (e.g., ignore “the” “a” “is”)
• stemming (e.g., identify “add” “adding” “added”)
• link analysis (recognize authoritative pages)
Dr.L.Lakshmi 46
Tries and Internet Routers
• Computers on the internet (hosts) are identified by a unique 32-bit IP (internet
protocol) addres, usually written in “dotted-quad-decimal” notation
• E.g., www.cs.brown.edu is 128.148.32.110
• Use nslookup on Unix to find out IP addresses
• An organization uses a subset of IP addresses with the same prefix, e.g., Brown
uses 128.148.*.*, Yale uses 130.132.*.*
• Data is sent to a host by fragmenting it into packets. Each packet carries the IP
address of its destination.
• The internet whose nodes are routers, and whose edges are communication links.
• A router forwards packets to its neighbors using IP prefix matching rules. E.g., a
packet with IP prefix 128.148. should be forwarded to the Brown gateway router.
• Routers use tries on the alphabet 0,1 to do prefix matching.
47