Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
97 views

String Search Algorithm

The document defines key terms used in exact string searching algorithms, including patterns, symbols, alphabets, and string matching. It then describes three exact string searching algorithms: (1) the naive string search algorithm, which has average time complexity of O(n+m) but worst case of O(nm); (2) the Knuth-Morris-Pratt algorithm, which improves on naive search with O(n) time complexity; and (3) the Boyer-Moore algorithm, which has average time complexity of O(n/m).

Uploaded by

Vaishali Ravi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

String Search Algorithm

The document defines key terms used in exact string searching algorithms, including patterns, symbols, alphabets, and string matching. It then describes three exact string searching algorithms: (1) the naive string search algorithm, which has average time complexity of O(n+m) but worst case of O(nm); (2) the Knuth-Morris-Pratt algorithm, which improves on naive search with O(n) time complexity; and (3) the Boyer-Moore algorithm, which has average time complexity of O(n/m).

Uploaded by

Vaishali Ravi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Definitions

COSC 348:
Computing for Bioinformatics • A pattern (keyword) is an ordered sequence of symbols.

• Symbols of the pattern and the searched text are chosen


Lecture 4: from a predetermined finite set, called an alphabet (Σ)
– In general alphabet can be any finite set of symbols/letters
Exact string searching algorithms
• In bioinformatics:
Lubica Benuskova – DNA alphabet Σ = {A,C,G,T},
– RNA alphabet Σ = {A,C,G,U};
– protein alphabet Σ = {A,R,N,…V} (20 amino acids)
http://www.cs.otago.ac.nz/cosc348/
1 2

Exact string searching or matching Exact string search algorithms


Preprocessing
• Much of data processing in bioinformatics involves in one way Algorithm Matching time
time
or another recognising certain patterns within DNA, RNA (1st
assignment) or protein sequences. Naïve string search algorithm 0 (no average O(n+m), worst
(brute force) preprocessing) O(n m)

• String-matching consists of finding one, or more or generally Knuth-Morris-Pratt algorithm O(m) O(n)
all the occurrences of a string of length m (called a pattern or Boyer-Moore algorithm O(m + |Σ|) O(n/m), O(n)
keyword) within a text of the total length n characters.
average O(n+m), worst
Rabin-Karp algorithm O(m)
O(n m)
• An example of an exact string search (match): Aho-Corasick algorithm (suffix
O(n) O(m+z)
– Pat: EXAMPLE trees)
– Txt: HERE IS A SIMPLE EXAMPLE z = number of matches
• 35 algorithms with codes at http://www-igm.univ-mlv.fr/~lecroq/string/
3 4

Naïve string search (brute force) Naïve string search (brute force)
• The most intuitive way is to slide a window of length m • If there is not a copy of the whole pattern in the first m
(pattern) over the text (of length n) from left to right one characters of the text, we look if there’s a copy of the
letter at a time. pattern starting at the second character of the text:

• Within the window compare successive characters:

txt: ABCABCDABABCDABCDABDE txt: ABCABCDABABCDABCDABDE


pat: BCD pat: BCD

5 6
Naïve string search (brute force) Naïve string search (brute force)

• If there is not a copy of the pattern starting at the second • until we hit a match; then we continue in the same
character of the text, we look if there’s a copy of the
pattern starting at the third character of the text, and so way along the text and count number of matches.
forth:

txt: ABCABCDABABCDABCDABDE txt: ABCABCDABABCDABCDABDE


pat: BCD pat: BCD

Match !
7 8

Properties of the naïve search Knuth–Morris–Pratt algorithm


• Integer i denotes the position within the searched txt,
• Can be used on-line (advantage) which is the beginning of the prospective match for pat
• Integer j denotes the character currently under consideration
in pat
• Usually takes O(n+m) steps – not so bad
• ‘-’ denotes a gap in the sequence

• The inner loop finds a mismatch quickly and moves i: 01234567890123456789012


on the next position quickly without going through all
the m steps txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD
• Worst case scenario O(nm) when searching for aaab
in aaaaaaaaaaaaaaaaaaaaaaaab j: 0123456
9 10

Knuth–Morris–Pratt algorithm Knuth–Morris–Pratt algorithm


• Slide a sliding window of length m (pattern) over the text (of • When a mismatch occurs, the pattern itself is used to determine
length n) from left to right. where to jump to the next meaningful position to continue, in
this case i = j = 4:
• Within the window compare successive characters from left to
right until a mismatch is hit.

i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
11 12
Knuth–Morris–Pratt algorithm Knuth–Morris–Pratt algorithm
• From the next meaningful position, i.e. i = 4, we proceed in the • We passed an "AB" which could be the beginning of a new
same way; match, so we simply reset i = 8, j = 2 and continue
matching the current character from left to right within a window.
• There is a nearly complete match ABCDAB when we hit a
mismatch again at pat[6] and txt[10].

i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
13 14

Knuth–Morris–Pratt algorithm Knuth–Morris–Pratt algorithm


• This search fails immediately, as the pat does not contain a gap, • So we have returned to the beginning of pat and begin
so we return to the beginning of pat, by resetting j = 0, and searching at i = 11, resetting j = 0.
begin searching at i = 11 in the text.
• Once again we immediately hit upon a match "ABCDAB" but the
next character, 'C', does not match the final character 'D' of the
pat.

i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
15 16

Knuth–Morris–Pratt algorithm Properties of Knuth-Morris-Pratt algorithm


• Thus we set i = 15, to start at the two-character string "AB",
set j = 2, in the pat, and continue matching from the current • Can be used on-line (advantage) like naïve search but it’s
position.
substantially improved.
• This time we are able to complete the match, whose first
character is at txt[15]. • Time to find match is only O(n) with O(m) preprocessing time.

i: 01234567890123456789012
• Partial match table should allow not to match any letter of txt
txt: ABC-ABCDAB-ABCDABCDABDE more than once.
pat: ABCDABD
• Can be modified to search for multiple patterns in a single
j: 0123456 search.
Match ! 17 18
Boyer-Moore algorithm Boyer-Moore algorithm
• is a particularly efficient string searching algorithm, and it
has been the standard benchmark for the practical string pat: EXAMPLE
searching
txt: HERE-IS-A-SIMPLE-EXAMPLE
• BM algorithm holds a window containing pat over txt,
much as the naïve search does. This window moves from
left to right, however, its improved performance is based • By fetching the S underlying the last character of the
around two clever ideas: pat we learn:
1. Inspect the window from right to left. – We are not standing on a match (because S isn't E).
2. Recognize the possibility of large shifts in the window – We wouldn't find a match even if we slid the pattern right
without missing a match. by 1 (because S isn't L), by 2 (because S isn't P), etc.

19 20

Boyer-Moore algorithm Boyer-Moore algorithm

pat: EXAMPLE pat: EXAMPLE


txt: HERE-IS-A-SIMPLE-EXAMPLE txt: HERE-IS-A-SIMPLE-EXAMPLE

• Focus your attention on the right end of the pattern. E is


• Since S doesn't occur in the pattern at all, we can slide the not P, L is not P, but P= P so let us shift the pat to the
pattern to the right by its own length without missing a right to align it with the P in the txt:
match.
pat: EXAMPLE
• This shift can be pre-calculated for every letter and stored
in a table. This table is called a bad character shift table. txt: HERE-IS-A-SIMPLE-EXAMPLE

21 22

Boyer-Moore algorithm Boyer-Moore algorithm

pat: EXAMPLE pat: MPLEEXAMPLE


txt: HERE-IS-A-SIMPLE-EXAMPLE txt: HERE-IS-A-SIMPLE-EXAMPLE

• We have discovered that MPLE occurs in the txt, let us put it • Now we can shift the pattern all way down to align this
in front of the pat like this: discovered occurrence in the txt with its last occurrence in
the pattern (which is partly imaginary), i.e.:

pat: MPLEEXAMPLE pat: MPLEEXAMPLE


txt: HERE-IS-A-SIMPLE-EXAMPLE txt: HERE-IS-A-SIMPLE-EXAMPLE

23 24
Boyer-Moore algorithm Boyer-Moore algorithm

• There are only seven terminal substrings of the pattern, so we pat: MPLEEXAMPLE
can pre-compute all these shifts too and store them in a table.
This is sometimes called the good suffix shift table. txt: HERE-IS-A-SIMPLE-EXAMPLE

• In general, if the algorithm has a choice of more than one


shifts, then it takes the largest one. • We've aligned the MPLE but focus on the end of the pattern.
E is not P, L is not P, but P=P so let us shift the pat to the
right to align it with the P in the txt:

pat: MPLEEXAMPLE pat: EXAMPLE


txt: HERE-IS-A-SIMPLE-EXAMPLE txt: HERE-IS-A-SIMPLE-EXAMPLE

25 Match ! 26

Boyer-Moore algorithm: properties Rabin-Karp algorithm: hashing


• Observe that we have found the pattern without looking at all
of the characters. • uses the naïve search method (i.e. sliding window) and
substantially speeds up the testing of equality of the pattern to the
• Its speed derives from the fact that it can determine all substrings in the text by using hashing.
occurrences of pat within txt without examining too many
characters in txt. • It is used for multiple pattern matching (in addition to single
pattern matching), because it has the unique advantage of being
• In fact, its average performance is O(n / m), that is, it gets able to find any one of k strings in O(n) time on average,
faster as the pattern gets longer. regardless of the magnitude of k.

• We say the algorithm is “sublinear” in the sense that it • The key to performance is the efficient computation of hash
values of the successive substrings of the text.
generally looks at fewer characters than it passes.
27 28

Rabin-Karp algorithm – hashing Rabin-Karp algorithm: properties


• A hash function converts every string into a numerical value, • One popular and effective hash function treats every substring as a
called its hash value (code, sum), using for instance the ASCII number in some base, the base being usually a large prime.
value of characters.
– For example, if the substring is "hi" and the base b = 101, then
– For example, hash(‘hello’) = 5. hash(‘hi’) = ‘h’*b^1 + ‘i’*b^0 = 104*101+105*1 = 10,609

• Algorithm exploits the fact that if two strings are equal, their hash • Rabin-Karp is inferior for single pattern searching to Boyer-Moore
values are also equal (there might be so-called hash collisions, algorithm because of its slow worst case behaviour.
though, that must be checked for letter by letter).

• All we have to do is to compute the hash value of the pattern • However, Rabin-Karp is an algorithm of choice for multiple pattern
we're searching for, and then look for substrings with the same search.
hash value within the text (and then check letter by letter). – That is, if we want to find many fixed length patterns in a text, say
of length k, we can create a simple variant of Rabin-Karp that
checks whether the hash of a given string in the text belongs to a set
• Different variants of the algorithm compute hash values in of hash values of patterns we are looking for.
different ways (adding, multiplying, etc.).
29 30
Aho-Corasick algorithm Aho-Corasick algorithm
• Used for multiple pattern matching tasks • In the first phase of the tree building, keywords are added to
the tree. (The root node is used only as a place holder and
• Decription from the article and code by Tomas Petricek at contains links to other letters. )
http://www.codeproject.com/KB/recipes/ahocorasick.aspx
• Links created in this first step represents the goto function,
• The algorithm consists of two parts: which returns the next state when a character is matching.
– Example of the tree for keywords: his, hers, she

• The first part is the building of the tree from keywords/patterns


you want to search for, and the second part is searching the text
for the keywords using the previously built tree (finite state
machine, FSM).
– FSM is a deterministic model of behaviour composed of a
finite number of states and transitions between those states

31 32

Aho-Corasick algorithm Aho-Corasick algorithm


• The fail function is used when a character is not matching. • During the second phase, the BFS (breadth first search)
algorithm is used for traversing through all the nodes.
• For example, in the text shis, the failure function is used to – At each stage, the node to be expanded is indicated by a marker
exit from the she branch to his branch after the first two – In general all the nodes are expanded at a given depths before
characters (because the third character is not matching). any nodes at the next level are expanded

Help: Find the tutorial on efficient string search with suffix


trees written by Mark Nelson at
33 http://marknelson.us/1996/08/01/suffix-trees/ 34

Aho-Corasick algorithm Conclusions

• Assume that generalised suffix tree has been built for the set of • Although data are memorized in various ways, text remains the
main form to exchange information.
patterns D = {S1, S2,..., SK} of total length n = | n1 | + | n2 | + ... + | nK |.
All patterns have the same alphabet. You can search for patterns in
such a way that: • String-matching is a very important subject in the wider domain
of text processing (i.e. keyword search), not just bioinformatics.

– Check if a pattern P of length m is a substring in O(m) time. • In bioinformatics, the patterns in strands of DNA, RNA and
– Find the first occurrence of the patterns P1,...,Pq of total length m proteins, have important biological meaning, e.g. they are
as substrings in O(m) time. promoters, enhancers, operators, genes, introns, exons, etc.
– Find all z occurrences of the patterns P1,...,Pq of total length m as
substrings in O(m + z) time. • Often these meaningful patterns undergo mutations at some
points, therefore we include in the patterns the so-called
wildcards, to replace some of the characters (as in the
assignment).

35 36

You might also like