Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Unit 3-Pattern Matching

The document discusses various string matching algorithms, including the Naive Algorithm, Rabin-Karp, Finite Automaton, and Knuth-Morris-Pratt, highlighting their preprocessing and matching times. It explains the mechanics of the Rabin-Karp algorithm, which utilizes hashing for efficient substring comparison, and outlines the automaton-based approach for pattern matching. The document concludes with the time complexities associated with each algorithm, emphasizing their efficiency in different scenarios.

Uploaded by

ruthmp.cs22
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit 3-Pattern Matching

The document discusses various string matching algorithms, including the Naive Algorithm, Rabin-Karp, Finite Automaton, and Knuth-Morris-Pratt, highlighting their preprocessing and matching times. It explains the mechanics of the Rabin-Karp algorithm, which utilizes hashing for efficient substring comparison, and outlines the automaton-based approach for pattern matching. The document concludes with the time complexities associated with each algorithm, emphasizing their efficiency in different scenarios.

Uploaded by

ruthmp.cs22
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

String Matching Algorithms

Unit 3
String Matching Problem
Motivations: text-editing, pattern matching in DNA sequences

32.1

Text: array T[1...n] Pattern: array P[1...m]


Array Element: Character from finite alphabet Σ
Pattern P occurs with shift s in T if P[1...m] = T[s+1...s+m]
String Matching Algorithms
• Divide running time into preprocessing and matching time.
• Preprocessing: Setup some data structure based on pattern P.
• Matching: Perform actual matching by comparing characters from T
with P and precomputed data structure.
• Naive Algorithm
– Worst-case running time in O((n-m+1) m)
• Rabin-Karp
– Worst-case running time in O((n-m+1) m)
– Better than this on average and in practice
• Finite Automaton-Based
– Worst-case running time in O(n + m|Σ|)
• Knuth-Morris-Pratt
– Worst-case running time in O(n + m)
Notation & Terminology
• Σ* = set of all finite-length strings formed using
characters from alphabet Σ
• Empty string: ε
• |x| = length of string x
• w is a prefix of x: w ab abcca
x cca
•• w is a suffix of x: w
prefix, suffix are transitive
abcca

x
Overlapping Suffix Lemma
32.1

32.3 32.1
String Matching Algorithms

Naive Algorithm
Naive String Matching

worst-case running time is ?

32.4
Naive String Matching

worst-case running time is in Θ((n-m+1)m)

32.4
String Matching Algorithms

Rabin-Karp
Rabin-Karp Algorithm
• Rabin-Karp string searching algorithm calculates a numerical (hash) value for the
pattern p, and for each m-character substring of text t.
• Then it compares the numerical values instead of comparing the actual
symbols.
• The algorithm slides the pattern, one by one, and matches the hash value of the
substring of the text.
• If any match is found, it compares the pattern with the substring by naive
approach.
• Otherwise it shifts to next substring of t to compare with p.
• The use of hashing converts the string to a numeric value which speeds up the
process of matching.
• The algorithm exploits the fact that if two strings are equal then their hash values
are also equal.
• Thus, the string matching is reduced to computing the hash value of the search
pattern and then looking for substring with that hash value.
Rabin-Karp (1987)
• Consider (sub)strings as numbers. Characters in a string correspond to digits in a
number written in radix-d notation (where d = |Σ|).
Rabin-Karp (1987)
Compute remaining ti‘s in O(n-m) time
t s+1 = d(t s - d m-1T[s+1]) + T[s+m+1]

Check out: “fedc"


Rabin-Karp
• Assume each character is digit in radix-d notation (e.g. d=10)
• p = decimal value of pattern
• ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m

Compute remaining ti‘s in O(n-m) time


t s+1 = d(t s - d m-1T[s+1]) + T[s+m+1]
We can signify the position of each char by multiplying by some constant raised to
the power that corresponds (eg.10 ^n-1) to its position.
Now, H(1234)!=H(4321) or any other permutations
Rabin-Karp

If pattern was 1000 chars then we need to multiply by 10^9 which would be a huge
number (integer overflow).
Therefore, divide with a prime number (eg 113 – now hash value will always be
under/less than 113)
Rabin-Karp
Example

• Example (1):
• Input: T = gtgatcagatcact, P = tca
• Output: ? shift=?

• Example (2):
• Input: T = 189342670893, P = 1673
• Output: ? shift=?
Example

• Example (1):
• Input: T = gtgatcagatcact, P = tca
• Output: Yes. gtgatcagatcact, shift=4, 9

• Example (2):
• Input: T = 189342670893, P = 1673
• Output: No.
Rabin-Karp Algorithm
• Consider (sub)strings as numbers. Characters in a string
correspond to digits in a number written in radix-d notation (where d = |Σ|).
• Assume each character is digit in radix-d notation (e.g. d=10)
• p = decimal value of pattern
• ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m
• Strategy:

– compute p in O(m) time (which is in O(n))


– compute all ti values in total of O(n) time
– find all valid shifts s in O(n) time by comparing p with each ts
Rabin-Karp scheme
• Problem: in case each number (p and ts) is too large for comparison
• Solution: Hash, use modular arithmetic, with respect to a prime q.

• 31415%13 = 7
• New recurrence formula:
• ts+1 = (d (ts - h T[s+1]) + T[s+m+1]) mod q,
• where h = dm-1 mod q.
• q is a prime number so that we do not get a 0 in the mod operation.
• The comparison is not perfect and may have spurious hit (see next slide).
• So, we need a naïve string matching when the comparison succeeds in
modulo math.
Rabin-Karp Algorithm (continued)
m-1
ts+1 = d(ts - d T[s+1]) +
T[s+m+1]

The comparison is not perfect and may have spurious hit (see example below).
So, we need a naïve string matching when the comparison succeeds in modulo math.

p = 31415

spurious
hit
Rabin-Karp Algorithm (continued)

source: 91.503 textbook Cormen et al.


Rabin-Karp Algorithm
• Compute p in O(m) time using Horner’s rule:
– p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))
• Compute t0 similarly from T[1..m] in O(m) time
• Compute remaining ti‘s in O(n-m) time
– t = d(t - d m-1T[s+1]) + T[s+m+1]
s+1 s

• Advantage: Calculating strings can reuse old results.


• Consider decimals: 43592.. and 43592..
• 3592 = (4359 - 4*1000)*10 + 2
= (359)*10+2= 3590+2
=3592
• General formula: t s+1 = d (t s - dm-1 T[s+1]) + T[s+m+1], in
radix-d, where ts is the corresponding number for the
substring T[s..(s+m)]. Note, m is the size of P.
Rabin-Karp Algorithm (continued)
d is radix q is modulus

high-order digit position for m-digit window

Preprocessing

Matching loop invariant: when line 10 executed


ts=T[s+1..s+m] mod q
rule out spurious hit

worst-case running time is in Θ((n-m+1)m) average-case running time is in Ο(n+m)


Find the number of Spurious hits happened
during the following pattern matching process
using in Rabin Karp string matching approach
considering modulus as 11.
TEXT:31415926535
PATTERN:26
String Matching Algorithms

Finite Automata
Finite Automata

32.6

Strategy: Build automaton for pattern, then examine each text character once.

worst-case running time is in Θ(n) + automaton creation time


Finite Automata
String-Matching Automaton
Pattern = P = ababaca

Automaton accepts
strings ending in P

32.7

source: 91.503 textbook Cormen et al.


String-Matching Automaton
Suffix Function for P:
σ (x) = length of longest prefix of P that is a suffix of x

32.3

32.4

at each step: keeps track of longest pattern prefix that is a suffix of what has been read so far
source: 91.503 textbook Cormen et al.
String-Matching Automaton
Simulate behavior of string-matching automaton that finds
occurrences of pattern P of length m in T[1..n]

assuming automaton has already been created...

worst-case running time of matching is in Θ(n)

source: 91.503 textbook Cormen et al.


String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.

worst-case running time of entire string-matching strategy


is in Ο(m |Σ|) + Ο(n)

automaton creation time pattern matching time


String-Matching Automaton
Suffix Function for P:
σ (x) = length of longest prefix of P that is a suffix of x

32.3

Automaton’s operational invariant 32.4

at each step: keeps track of longest pattern prefix that is a suffix of what has been read so far
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...

32.2

32.8

32.8 32.2
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...

32.3

32.9
32.2
32.1

32.9 32.3
source: 91.503 textbook Cormen et al.
String-Matching Automaton (continued)
Correctness of matching procedure...
32.4

32.3
32.3

source: 91.503 textbook Cormen et al.


String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.

worst-case running time of automaton creation is in Ο(m2 |Σ|)

can be improved to: Ο(m |Σ|)


worst-case running time of entire string-matching strategy
is in Ο(m |Σ|) + Ο(n)

automaton creation time pattern matching time


The Knuth-Morris-Pratt algorithm
Time complexity : m + n
m : time taken to construct the pi table
n : size of the pattern

You might also like