String Searching Algorithm
String Searching Algorithm
指導教授 : 黃三益 教授
組員 : 9142639 蔡嘉文
9142642 高振元
9142635 丁康迪
String Searching Algorithm
Outline:
The Naive Algorithm
The Knuth-Morris-Pratt Algorithm
The SHIFT-OR Algorithm
The Boyer-Moore Algorithm
The Boyer-Moore-Horspool Algorithm
The Karp-Rabin Algorithm
Conclusion
String Searching Algorithm
Preliminaries:
n: the length of the text
m: the length of the pattern(string)
c: the size of the alphabet
Cn: the expected number of comparisons
performed by an algorithm while searching
the pattern in a text of length n
The Naive Algorithm
Char text[], pat[] ;
int n, m ;
{
int i, j, k, lim ; lim=n-m+1 ;
for (i=1 ; i<=lim ; i++) /* search */
{
k=i ;
for (j=1 ; j<=m && text[k]==pat[j]; j++) k++;
if (j>m) Report_match_at_position(i-j+1);
}
}
The Naive Algorithm(cont.)
The idea consists of trying to match any
substring of length m in the text with the
pattern.
The Knuth-Morris-Pratt Algorithm
{
int j, k ;
int next[Max_Pattern_Size];
initnext(pat, m+1, next); /*preprocess pattern, 建立
j=k=1 ; next table*/
do{ /*search*/
if (j==0 || text[k]==pat[j] ) k++; j++;
else j=next[j] ;
if (j>m) Report_match_at_position(k-m);
} while (k<=n)
}
The Knuth-Morris-Pratt
Algorithm(cont.)
To accomplish this, the pattern is preprocessed
to obtain a table that gives the next position in
the pattern to be processed after a mismatch.
Ex:
position: 1 2 3 4 5 6 7 8 9 10 11
pattern: a b r a c a d a b r a
Next[j]: 0 1 1 0 2 0 2 0 1 1 0
text: a b r a c a f ……………
The Shift-Or Algorithm
The main idea is to represent the state of the
search as a number.
State=S1 . 20 + S2 . 21+…+Sm . 2m-1
Tx=δ(pat1=x) . 20 + δ(pat2=x) +…..+
δ(patm=x) . 2m-1
For every symbol x of the alphabet,
whereδ(C) is 0 if the condition C is true, and
1 otherwise.
The Shift-Or Algorithm(cont.)
Ex:{a,b,c,d} be the alphabet, and ababc the
pattern.
T[a]=11010,T[b]=10101,T[c]=01111,T[d]=11111
the initial state is 11111
The Shift-Or Algorithm(cont.)
Pattern: ababc
Text: a b d a b a b c
T : xyxabraxyzabracadabra
P : abracadabra
T:xyzabraxyzabracadabra
P:abracadabra
The Karp-Rabin Algorithm
Use hashing
Computing the signature function of
each possible m-character substring
Check if it is equal to the signature
function of the pattern
Signature function h(k)=k mod q, q is a
large prime
The Karp-Rabin
Algorithm(cont.)
rksearch( text, n, pat, m ) /* Search pat[1..m] in text[1..n] */
char text[], pat[]; /* (0 m = n) */
int n, m;
{
int h1, h2, dM, i, j;
dM = 1;
for( i=1; i<m; i++ ) dM = (dM << D) % Q; /* Compute the signature */
h1 = h2 = O; /* of the pattern and of */
for( i=1; i<=m; i++ ) /* the beginning of the */
{ /* text */
h1 = ((h1 << D) + pat[i] ) % Q;
h2 = ((h2 << D) + text[i] ) % Q;
}
The Karp-Rabin
Algorithm(cont.)
for( i = 1; i <= n-m+1; i++ ) /* Search */
{
if( h1 == h2 ) /* Potential match */
{
for(j=1; j<=m && text[i-1+j] == pat[j]; j++ ); /* check */
if( j > m ) /* true match */
Report_match_at_position( i );
}
h2 = (h2 + (Q << D) - text[i]*dM ) % Q; /* update the signature */
h2 = ((h2 << D) + text[i+m] ) % Q; /* of the text */
}
}
Conclusions
Test: Random pattern, random text and English
text
Best: The Boyer-Moore-Horspool Algorithm
Drawback: preprocessing time and space(depend
on alphabet/pattern size)
Small pattern: The Shift-Or Algorithm
Large alphabet: The Knuth-Morris-Pratt Algorithm
Others: The Boyer-Moore Algorithm
“don’t care”: The Shift-Or Algorithm