G5 Advanced String Algorithms Lecture (No Code)
G5 Advanced String Algorithms Lecture (No Code)
String
Algorithms
Substring search
Lecture Outline
● Prerequisites
● Substring Search (The Naive Way)
● Rabin-Karp Algorithm
● Knuth-Morris-Pratt Algorithm
● Applications of Rabin-Karp and Knuth-Morris-Pratt Algorithm
● Additional String Algorithms
● Quote of the Day
2
Pre-requisites
● Math II
● String manipulation in Python
● Time and Space complexity analysis
What is a substring search?
Naive Method
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
Okay, let’s try again
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
Failed yet again.
AGAIN !
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
AGAINNN !!!
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
Hmm :/
Again ?
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
i
String : a b c d a b c d f
1 2 3 4 5 6 7 8 9
j
Pattern : abcdf
1 2 3 4 5
Okay we got somewhere, but how long
did it take us ?
O(n*m)
Practice Problem
For s = “abcad”,
For s = “abcad”,
`a` * 264 + `b` * 263 + `c` * 262 + `a` * 261 + `d` * 260
We need to find some values to
represent each of the above letters.
Any ideas ?
Encoding Strings
`a` = 0
`b` = 1
`c` = 2
`d` = 3
.
.
`z` = 25
This will result in an edge case if we represent strings
this way
“abc” + “x” = ?
“abc” => (1 * 𝞪2 + 2 * 𝞪1 + 3 * 𝞪0 )
“x” => (24 * 𝞪0 )
“abc” + “x” => (1 * 𝞪2 + 2 * 𝞪1 + 3 * 𝞪0 ) * 𝞪 + (24 * 𝞪0 )
“abcx” => 1 * 𝞪3 + 2 * 𝞪2 + 3 * 𝞪1 + 24 * 𝞪0
Operation: pollFirst
let 𝞪 = 26 + 1
“bcx” => 2 * 𝞪2 + 3 * 𝞪1 + 24 * 𝞪0
For Rabin-Karp, the above two
operations suffice for most cases
Operation: addFirst
let 𝞪 = 26 + 1
“x” + “abc” = ?
“x” => (24 * 𝞪0 )
“abc” => (1 * 𝞪2 + 2 * 𝞪1 + 3 * 𝞪0 )
“abc” => 1 * 𝞪2 + 2 * 𝞪1 + 3 * 𝞪0
Most of the time, the hash values are very large numbers
hence we need to use them under mod.
Therefore, the last operation is trickier than we made it
look like; since it involves knowing division under mod
TIP: Precompute all 𝞪k
TIP: Pick a Prime number for modulus.
Typically, 10 ** 9 + 7
String: abacdabazxywp
pattern: abaz
Rabin-Karp: Demonstration
pattern: abaz
String: abacdabazxywp
(1 * 𝞪3 + 2 * 𝞪2 + 1 * 𝞪1 + 3 * 𝞪0)
Rabin-Karp: Demonstration
pattern: abaz
String: abacdabazxywp
pol
lFir L ast
d
st ad
Practice Problem
Find the index of the first occurrence in a string
Note: If you have to do things under mod given your constraints,
a hash match doesn’t necessarily mean you found the string.
Note: You have to do a string equality check just to be sure.
Most people don’t feel confident after writing a
probabilistic algorithm such as Rabin-Karp,
but the way you should see it is, if you can bring down the
probability of your algorithm getting it wrong less than the
probability of the hardware failing while running your code….
you should be able to submit and be able to sleep at
night.
Knuth–Morris–Pratt
algorithm
Guaranteed O(n + m) Time
This algorithm was invented by Donald Knuth, Von Pratt
and independently by James Morris
Key Idea : Take advantage of the successful comparisons
we make between the string and the pattern.
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
The KMP algorithm wants to avoid going back in the
string S and revert our progress in matching the pattern.
So it looks for a suffix that is also a prefix in the matched
substring before the mismatch
dsgwads
We know the substring `ds` exists in our string S before
the mismatch. Due to this fact, the algorithm finds out
how far it needs to go back in the string P to continue
matching without reverting the progress that was made
In our example, we will jump back to `g` in the string P
and we will not go back in our string S.
dsgwads
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Since we don’t have any suffix that is prefix in the
substring `ds`, we will now go back to the beginning in P
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
Example
S = adsgwadsdsgwadsgz
P = dsgwadsgz
The algorithm mainly has two parts to achieve this
efficiently.
1. Preprocessing
2. Matching
1. Preprocessing
Prefix: Substring of a string that starts from the beginning of the string. Empty string ("") is a prefix of
every string.
● "", "a", "ab", "aba", "abac", "abaca", "abacab" are prefix of "abacab"
● "", "a", "ab", "aba", "abab", "ababa", "ababab", "abababa" are prefix of
"abababa"
Suffix: Substring of a string that ends at the end of the string. Empty string ("") is a suffix of every
string.
● "abacab", "bacab", "acab", "cab", "ab", "b", "" are suffix of "abacab"
● "abababa", "bababa", "ababa", "baba", "aba", "ba", "a", "" are suffix of
"abababa"
1. Preprocessing
● "", "a", "ab", "aba", "abac", "abaca" are proper prefix of "abacab"
● "", "a", "ab", "aba", "abab", "ababa", "ababab" are proper prefix of "abababa"
● "bacab", "acab", "cab", "ab", "b", "" are proper suffix of "abacab"
● "bababa", "ababa", "baba", "aba", "ba", "a", "" are proper suffix of "abababa"
Border: Substring of a string that is both proper prefix and proper suffix. The length of the border is
often called the Width of the Border. Although, the term Width is rarely used.
The Longest Border Array (LPS, π-table, or Prefix Table) is used in multiple algorithms. The
naïve approach to built it is of O(m3) by adhering to the mathematical formula and searching
for the longest proper prefix that is also a suffix, for every index.
for i = 1 to m-1
for k = 0 to i
if needle[0..k-1] == needle[i-(k-1)..i]
longest_border[i] = k
However, we can follow the greedy approach, and can build it in linear time.
1. Preprocessing
d s g w a d s g z
LPS
LPS 0
1. Preprocessing
i j
d s g w a d s g z
LPS 0 0
1. Preprocessing
i j
d s g w a d s g z
LPS 0 0 0
1. Preprocessing
i j
d s g w a d s g z
LPS 0 0 0 0
1. Preprocessing
i j
d s g w a d s g z
LPS 0 0 0 0 0 1
1. Preprocessing
i j
d s g w a d s g z
LPS 0 0 0 0 0 1 2
1. Preprocessing
i j
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
LPS 0 1 2 0 1 2 3
LPS 0 1 2 0 1 2 3 3
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
i = LPS[i - 1]
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
i = LPS[i - 1]
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
2. Matching
d s g w a d s g z
LPS 0 0 0 0 0 1 2 3 0
i
S = adsgwadsdsgwadsgz
j
MATCH
What is the time complexity of this
Matching process?
Once again it’s linear.
O(length of the text)
Rotate String
Efficiency of the KMP algorithm
● Since the two portions of the algorithm have, respectively, complexities
of O(m) and O(n), the complexity of the overall algorithm is O(m + n).
● These complexities are the same, no matter how many repetitive
patterns are in P or S.
Applications of RK and KMP
● Spell Checker
● Plagiarism Detection
● Text Editors
● Spam Filters
● Digital Forensics
● Matching DNA Sequences
● Intrusion Detection
● Search Engines
● Bioinformatics and Cheminformatics
● Information Retrieval System
● Language Syntax Checker
Additional String
Algorithms
Z Algorithm
● can be used to count all pairs (i, j) such that substring s[i…j] is a
palindrome in linear time.
Suffix Array