Unit 3-Pattern Matching
Unit 3-Pattern Matching
Unit 3
String Matching Problem
Motivations: text-editing, pattern matching in DNA sequences
32.1
x
Overlapping Suffix Lemma
32.1
32.3 32.1
String Matching Algorithms
Naive Algorithm
Naive String Matching
32.4
Naive String Matching
32.4
String Matching Algorithms
Rabin-Karp
Rabin-Karp Algorithm
• Rabin-Karp string searching algorithm calculates a numerical (hash) value for the
pattern p, and for each m-character substring of text t.
• Then it compares the numerical values instead of comparing the actual
symbols.
• The algorithm slides the pattern, one by one, and matches the hash value of the
substring of the text.
• If any match is found, it compares the pattern with the substring by naive
approach.
• Otherwise it shifts to next substring of t to compare with p.
• The use of hashing converts the string to a numeric value which speeds up the
process of matching.
• The algorithm exploits the fact that if two strings are equal then their hash values
are also equal.
• Thus, the string matching is reduced to computing the hash value of the search
pattern and then looking for substring with that hash value.
Rabin-Karp (1987)
• Consider (sub)strings as numbers. Characters in a string correspond to digits in a
number written in radix-d notation (where d = |Σ|).
Rabin-Karp (1987)
Compute remaining ti‘s in O(n-m) time
t s+1 = d(t s - d m-1T[s+1]) + T[s+m+1]
If pattern was 1000 chars then we need to multiply by 10^9 which would be a huge
number (integer overflow).
Therefore, divide with a prime number (eg 113 – now hash value will always be
under/less than 113)
Rabin-Karp
Example
• Example (1):
• Input: T = gtgatcagatcact, P = tca
• Output: ? shift=?
•
• Example (2):
• Input: T = 189342670893, P = 1673
• Output: ? shift=?
Example
• Example (1):
• Input: T = gtgatcagatcact, P = tca
• Output: Yes. gtgatcagatcact, shift=4, 9
•
• Example (2):
• Input: T = 189342670893, P = 1673
• Output: No.
Rabin-Karp Algorithm
• Consider (sub)strings as numbers. Characters in a string
correspond to digits in a number written in radix-d notation (where d = |Σ|).
• Assume each character is digit in radix-d notation (e.g. d=10)
• p = decimal value of pattern
• ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m
• Strategy:
• 31415%13 = 7
• New recurrence formula:
• ts+1 = (d (ts - h T[s+1]) + T[s+m+1]) mod q,
• where h = dm-1 mod q.
• q is a prime number so that we do not get a 0 in the mod operation.
• The comparison is not perfect and may have spurious hit (see next slide).
• So, we need a naïve string matching when the comparison succeeds in
modulo math.
Rabin-Karp Algorithm (continued)
m-1
ts+1 = d(ts - d T[s+1]) +
T[s+m+1]
The comparison is not perfect and may have spurious hit (see example below).
So, we need a naïve string matching when the comparison succeeds in modulo math.
p = 31415
spurious
hit
Rabin-Karp Algorithm (continued)
Preprocessing
Finite Automata
Finite Automata
32.6
Strategy: Build automaton for pattern, then examine each text character once.
Automaton accepts
strings ending in P
32.7
32.3
32.4
at each step: keeps track of longest pattern prefix that is a suffix of what has been read so far
source: 91.503 textbook Cormen et al.
String-Matching Automaton
Simulate behavior of string-matching automaton that finds
occurrences of pattern P of length m in T[1..n]
32.3
at each step: keeps track of longest pattern prefix that is a suffix of what has been read so far
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...
32.2
32.8
32.8 32.2
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...
32.3
32.9
32.2
32.1
32.9 32.3
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...
32.4
32.3
32.3