Text Processing (Complete)
Text Processing (Complete)
Pattern/String
Matching
Teh Je Sen (2018)
Learning Outcomes
• In this topic we will learn about
–Pattern/string matching
• Brute force
• Boyer-Moore
• Knuth-Morris-Pratt
Substring:
Any string that occurs in a larger string
A;wiejr;ajeonnv;aknsdg;aijwe;rija;dsfad
Prefix: Suffix:
A substring at the A substring at the
beginning of a end of a string
string
Knuth-Morris-Pratt Trie
Algorithm Data Structure
0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e
No match
n = 11, m = 3 k=0
i=0
0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e
No match
n = 11, m = 3 k=0
i=1
0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e
No match
n = 11, m = 3 k=0
i=2
0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e
Match
n = 11, m = 3 k=0
i=6
0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e
Match
n = 11, m = 3 k=1
i=6
0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e
Match
n = 11, m = 3 k=2
i=6
0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e
Looking-Glass Character-Jump
Heuristic Heuristic
Works as a team
a b c d e
Solution:
a b c d e
3 5 4 6 -1
𝑛=
What are the
𝑚=
corresponding
𝑖=
𝑘 = variables?
𝑛 = 15
𝑚=7
𝑖=6
𝑘=6
𝑛 = 15
𝑚=7
𝑖=6
𝑘=6
𝑛 = 15
𝑚=7
𝑖=5
𝑘=5
𝑛 = 15
𝑚=7
𝑖=4
𝑘=4
𝑛 = 15
𝑚=7
𝑖 = 𝑖 + 7 − min(4,1 + 𝑙𝑎𝑠𝑡. 𝑔𝑒𝑡(𝑒))
𝑖=4
𝑘 =7−1
𝑘=4
𝑛 = 15
𝑚=7
𝑖 = 𝑖 + 7 − min(4,1 + (−1))
𝑖=4
𝑘=6
𝑘=4
𝑛 = 15
𝑚=7
𝑖 = 11
𝑖=4
𝑘=6
𝑘=4
𝑛 = 15
𝑚=7 Complete the rest of it
𝑖 = 11 yourself to verify the
𝑘=6 correctness of the algorithm.
Continue search
here
Example string:
abacab
𝒍 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 0 0 0 0
if 𝑘 > 0, 𝑘 = 𝑓𝑎𝑖𝑙 𝑘 − 1
else 𝑗 + +
k j
𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 0 0 0 0
𝑓𝑎𝑖𝑙 𝑗 = 𝑘 + 1
𝑗++
𝑘++
k j
𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 0 0 0 0
if 𝑘 > 0, 𝑘 = 𝑓𝑎𝑖𝑙 𝑘 − 1
else 𝑗 + +
k j
𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 0 0
if 𝑘 > 0, 𝑘 = 𝑓𝑎𝑖𝑙 𝑘 − 1
else 𝑗 + +
k j
𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 0 0
𝑓𝑎𝑖𝑙 𝑗 = 𝑘 + 1
𝑗++
𝑘++
k j
𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 0 0
𝑓𝑎𝑖𝑙 𝑗 = 𝑘 + 1
𝑗++
𝑘++
k j
𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 1 0
Recall:
𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 1 2
Height of
tree = 5
8 leaves
m i n z e
i
i e
n z n m z
m
i e i i e
i
m m z
z Compress this
i i e
e trie!
z z
e e
Teh Je Sen (2018)
Example
Definition: Coding
0 1 0 1 0
0 0 1 0 1 0 1 0 1
1
h e l o w a r y u _
𝒉 = 𝟎𝟎𝟎𝟎
Teh Je Sen (2018)
Fixed Code - Example
String Char ASCII Code Freq
“hello how are you” h 1101000 0000 2
e 1100101 0001 2
Number of characters: l 1101100 0010 2
17 o 1101111 0011 3
Huffman Coding
119 − 51
= 0.571
119
Superior method