String Search Algorithm
String Search Algorithm
COSC 348:
Computing for Bioinformatics • A pattern (keyword) is an ordered sequence of symbols.
• String-matching consists of finding one, or more or generally Knuth-Morris-Pratt algorithm O(m) O(n)
all the occurrences of a string of length m (called a pattern or Boyer-Moore algorithm O(m + |Σ|) O(n/m), O(n)
keyword) within a text of the total length n characters.
average O(n+m), worst
Rabin-Karp algorithm O(m)
O(n m)
• An example of an exact string search (match): Aho-Corasick algorithm (suffix
O(n) O(m+z)
– Pat: EXAMPLE trees)
– Txt: HERE IS A SIMPLE EXAMPLE z = number of matches
• 35 algorithms with codes at http://www-igm.univ-mlv.fr/~lecroq/string/
3 4
Naïve string search (brute force) Naïve string search (brute force)
• The most intuitive way is to slide a window of length m • If there is not a copy of the whole pattern in the first m
(pattern) over the text (of length n) from left to right one characters of the text, we look if there’s a copy of the
letter at a time. pattern starting at the second character of the text:
5 6
Naïve string search (brute force) Naïve string search (brute force)
• If there is not a copy of the pattern starting at the second • until we hit a match; then we continue in the same
character of the text, we look if there’s a copy of the
pattern starting at the third character of the text, and so way along the text and count number of matches.
forth:
Match !
7 8
i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
11 12
Knuth–Morris–Pratt algorithm Knuth–Morris–Pratt algorithm
• From the next meaningful position, i.e. i = 4, we proceed in the • We passed an "AB" which could be the beginning of a new
same way; match, so we simply reset i = 8, j = 2 and continue
matching the current character from left to right within a window.
• There is a nearly complete match ABCDAB when we hit a
mismatch again at pat[6] and txt[10].
i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
13 14
i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
15 16
i: 01234567890123456789012
• Partial match table should allow not to match any letter of txt
txt: ABC-ABCDAB-ABCDABCDABDE more than once.
pat: ABCDABD
• Can be modified to search for multiple patterns in a single
j: 0123456 search.
Match ! 17 18
Boyer-Moore algorithm Boyer-Moore algorithm
• is a particularly efficient string searching algorithm, and it
has been the standard benchmark for the practical string pat: EXAMPLE
searching
txt: HERE-IS-A-SIMPLE-EXAMPLE
• BM algorithm holds a window containing pat over txt,
much as the naïve search does. This window moves from
left to right, however, its improved performance is based • By fetching the S underlying the last character of the
around two clever ideas: pat we learn:
1. Inspect the window from right to left. – We are not standing on a match (because S isn't E).
2. Recognize the possibility of large shifts in the window – We wouldn't find a match even if we slid the pattern right
without missing a match. by 1 (because S isn't L), by 2 (because S isn't P), etc.
19 20
21 22
• We have discovered that MPLE occurs in the txt, let us put it • Now we can shift the pattern all way down to align this
in front of the pat like this: discovered occurrence in the txt with its last occurrence in
the pattern (which is partly imaginary), i.e.:
23 24
Boyer-Moore algorithm Boyer-Moore algorithm
• There are only seven terminal substrings of the pattern, so we pat: MPLEEXAMPLE
can pre-compute all these shifts too and store them in a table.
This is sometimes called the good suffix shift table. txt: HERE-IS-A-SIMPLE-EXAMPLE
25 Match ! 26
• We say the algorithm is “sublinear” in the sense that it • The key to performance is the efficient computation of hash
values of the successive substrings of the text.
generally looks at fewer characters than it passes.
27 28
• Algorithm exploits the fact that if two strings are equal, their hash • Rabin-Karp is inferior for single pattern searching to Boyer-Moore
values are also equal (there might be so-called hash collisions, algorithm because of its slow worst case behaviour.
though, that must be checked for letter by letter).
• All we have to do is to compute the hash value of the pattern • However, Rabin-Karp is an algorithm of choice for multiple pattern
we're searching for, and then look for substrings with the same search.
hash value within the text (and then check letter by letter). – That is, if we want to find many fixed length patterns in a text, say
of length k, we can create a simple variant of Rabin-Karp that
checks whether the hash of a given string in the text belongs to a set
• Different variants of the algorithm compute hash values in of hash values of patterns we are looking for.
different ways (adding, multiplying, etc.).
29 30
Aho-Corasick algorithm Aho-Corasick algorithm
• Used for multiple pattern matching tasks • In the first phase of the tree building, keywords are added to
the tree. (The root node is used only as a place holder and
• Decription from the article and code by Tomas Petricek at contains links to other letters. )
http://www.codeproject.com/KB/recipes/ahocorasick.aspx
• Links created in this first step represents the goto function,
• The algorithm consists of two parts: which returns the next state when a character is matching.
– Example of the tree for keywords: his, hers, she
31 32
• Assume that generalised suffix tree has been built for the set of • Although data are memorized in various ways, text remains the
main form to exchange information.
patterns D = {S1, S2,..., SK} of total length n = | n1 | + | n2 | + ... + | nK |.
All patterns have the same alphabet. You can search for patterns in
such a way that: • String-matching is a very important subject in the wider domain
of text processing (i.e. keyword search), not just bioinformatics.
– Check if a pattern P of length m is a substring in O(m) time. • In bioinformatics, the patterns in strands of DNA, RNA and
– Find the first occurrence of the patterns P1,...,Pq of total length m proteins, have important biological meaning, e.g. they are
as substrings in O(m) time. promoters, enhancers, operators, genes, introns, exons, etc.
– Find all z occurrences of the patterns P1,...,Pq of total length m as
substrings in O(m + z) time. • Often these meaningful patterns undergo mutations at some
points, therefore we include in the patterns the so-called
wildcards, to replace some of the characters (as in the
assignment).
35 36