0% found this document useful (0 votes)

97 views

String Search Algorithm

The document defines key terms used in exact string searching algorithms, including patterns, symbols, alphabets, and string matching. It then describes three exact string searching algorithms: (1) the naive string search algorithm, which has average time complexity of O(n+m) but worst case of O(nm); (2) the Knuth-Morris-Pratt algorithm, which improves on naive search with O(n) time complexity; and (3) the Boyer-Moore algorithm, which has average time complexity of O(n/m).

Uploaded by

Vaishali Ravi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views

String Search Algorithm

Uploaded by

Vaishali Ravi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Definitions

COSC 348:
Computing for Bioinformatics • A pattern (keyword) is an ordered sequence of symbols.

• Symbols of the pattern and the searched text are chosen

Lecture 4: from a predetermined finite set, called an alphabet (Σ)
– In general alphabet can be any finite set of symbols/letters
Exact string searching algorithms
• In bioinformatics:
Lubica Benuskova – DNA alphabet Σ = {A,C,G,T},
– RNA alphabet Σ = {A,C,G,U};
– protein alphabet Σ = {A,R,N,…V} (20 amino acids)
http://www.cs.otago.ac.nz/cosc348/
1 2

Exact string searching or matching Exact string search algorithms

Preprocessing
• Much of data processing in bioinformatics involves in one way Algorithm Matching time
time
or another recognising certain patterns within DNA, RNA (1st
assignment) or protein sequences. Naïve string search algorithm 0 (no average O(n+m), worst
(brute force) preprocessing) O(n m)

• String-matching consists of finding one, or more or generally Knuth-Morris-Pratt algorithm O(m) O(n)
all the occurrences of a string of length m (called a pattern or Boyer-Moore algorithm O(m + |Σ|) O(n/m), O(n)
keyword) within a text of the total length n characters.
average O(n+m), worst
Rabin-Karp algorithm O(m)
O(n m)
• An example of an exact string search (match): Aho-Corasick algorithm (suffix
O(n) O(m+z)
– Pat: EXAMPLE trees)
– Txt: HERE IS A SIMPLE EXAMPLE z = number of matches
• 35 algorithms with codes at http://www-igm.univ-mlv.fr/~lecroq/string/
3 4

Naïve string search (brute force) Naïve string search (brute force)
• The most intuitive way is to slide a window of length m • If there is not a copy of the whole pattern in the first m
(pattern) over the text (of length n) from left to right one characters of the text, we look if there’s a copy of the
letter at a time. pattern starting at the second character of the text:

• Within the window compare successive characters:

txt: ABCABCDABABCDABCDABDE txt: ABCABCDABABCDABCDABDE

pat: BCD pat: BCD

5 6
Naïve string search (brute force) Naïve string search (brute force)

• If there is not a copy of the pattern starting at the second • until we hit a match; then we continue in the same
character of the text, we look if there’s a copy of the
pattern starting at the third character of the text, and so way along the text and count number of matches.
forth:

txt: ABCABCDABABCDABCDABDE txt: ABCABCDABABCDABCDABDE

pat: BCD pat: BCD

Match !
7 8

Properties of the naïve search Knuth–Morris–Pratt algorithm

• Integer i denotes the position within the searched txt,
• Can be used on-line (advantage) which is the beginning of the prospective match for pat
• Integer j denotes the character currently under consideration
in pat
• Usually takes O(n+m) steps – not so bad
• ‘-’ denotes a gap in the sequence

• The inner loop finds a mismatch quickly and moves i: 01234567890123456789012

on the next position quickly without going through all
the m steps txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD
• Worst case scenario O(nm) when searching for aaab
in aaaaaaaaaaaaaaaaaaaaaaaab j: 0123456
9 10

Knuth–Morris–Pratt algorithm Knuth–Morris–Pratt algorithm

• Slide a sliding window of length m (pattern) over the text (of • When a mismatch occurs, the pattern itself is used to determine
length n) from left to right. where to jump to the next meaningful position to continue, in
this case i = j = 4:
• Within the window compare successive characters from left to
right until a mismatch is hit.

i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
11 12
Knuth–Morris–Pratt algorithm Knuth–Morris–Pratt algorithm
• From the next meaningful position, i.e. i = 4, we proceed in the • We passed an "AB" which could be the beginning of a new
same way; match, so we simply reset i = 8, j = 2 and continue
matching the current character from left to right within a window.
• There is a nearly complete match ABCDAB when we hit a
mismatch again at pat[6] and txt[10].

i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
13 14

Knuth–Morris–Pratt algorithm Knuth–Morris–Pratt algorithm

• This search fails immediately, as the pat does not contain a gap, • So we have returned to the beginning of pat and begin
so we return to the beginning of pat, by resetting j = 0, and searching at i = 11, resetting j = 0.
begin searching at i = 11 in the text.
• Once again we immediately hit upon a match "ABCDAB" but the
next character, 'C', does not match the final character 'D' of the
pat.

i: 01234567890123456789012 i: 01234567890123456789012
txt: ABC-ABCDAB-ABCDABCDABDE txt: ABC-ABCDAB-ABCDABCDABDE
pat: ABCDABD pat: ABCDABD
j: 0123456 j: 0123456
15 16

Knuth–Morris–Pratt algorithm Properties of Knuth-Morris-Pratt algorithm

• Thus we set i = 15, to start at the two-character string "AB",
set j = 2, in the pat, and continue matching from the current • Can be used on-line (advantage) like naïve search but it’s
position.
substantially improved.
• This time we are able to complete the match, whose first
character is at txt[15]. • Time to find match is only O(n) with O(m) preprocessing time.

i: 01234567890123456789012
• Partial match table should allow not to match any letter of txt
txt: ABC-ABCDAB-ABCDABCDABDE more than once.
pat: ABCDABD
• Can be modified to search for multiple patterns in a single
j: 0123456 search.
Match ! 17 18
Boyer-Moore algorithm Boyer-Moore algorithm
• is a particularly efficient string searching algorithm, and it
has been the standard benchmark for the practical string pat: EXAMPLE
searching
txt: HERE-IS-A-SIMPLE-EXAMPLE
• BM algorithm holds a window containing pat over txt,
much as the naïve search does. This window moves from
left to right, however, its improved performance is based • By fetching the S underlying the last character of the
around two clever ideas: pat we learn:
1. Inspect the window from right to left. – We are not standing on a match (because S isn't E).
2. Recognize the possibility of large shifts in the window – We wouldn't find a match even if we slid the pattern right
without missing a match. by 1 (because S isn't L), by 2 (because S isn't P), etc.

19 20

Boyer-Moore algorithm Boyer-Moore algorithm

pat: EXAMPLE pat: EXAMPLE

txt: HERE-IS-A-SIMPLE-EXAMPLE txt: HERE-IS-A-SIMPLE-EXAMPLE

• Focus your attention on the right end of the pattern. E is

• Since S doesn't occur in the pattern at all, we can slide the not P, L is not P, but P= P so let us shift the pat to the
pattern to the right by its own length without missing a right to align it with the P in the txt:
match.
pat: EXAMPLE
• This shift can be pre-calculated for every letter and stored
in a table. This table is called a bad character shift table. txt: HERE-IS-A-SIMPLE-EXAMPLE

21 22

Boyer-Moore algorithm Boyer-Moore algorithm

pat: EXAMPLE pat: MPLEEXAMPLE

txt: HERE-IS-A-SIMPLE-EXAMPLE txt: HERE-IS-A-SIMPLE-EXAMPLE

• We have discovered that MPLE occurs in the txt, let us put it • Now we can shift the pattern all way down to align this
in front of the pat like this: discovered occurrence in the txt with its last occurrence in
the pattern (which is partly imaginary), i.e.:

pat: MPLEEXAMPLE pat: MPLEEXAMPLE

txt: HERE-IS-A-SIMPLE-EXAMPLE txt: HERE-IS-A-SIMPLE-EXAMPLE

23 24
Boyer-Moore algorithm Boyer-Moore algorithm

• There are only seven terminal substrings of the pattern, so we pat: MPLEEXAMPLE
can pre-compute all these shifts too and store them in a table.
This is sometimes called the good suffix shift table. txt: HERE-IS-A-SIMPLE-EXAMPLE

• In general, if the algorithm has a choice of more than one

shifts, then it takes the largest one. • We've aligned the MPLE but focus on the end of the pattern.
E is not P, L is not P, but P=P so let us shift the pat to the
right to align it with the P in the txt:

pat: MPLEEXAMPLE pat: EXAMPLE

txt: HERE-IS-A-SIMPLE-EXAMPLE txt: HERE-IS-A-SIMPLE-EXAMPLE

25 Match ! 26

Boyer-Moore algorithm: properties Rabin-Karp algorithm: hashing

• Observe that we have found the pattern without looking at all
of the characters. • uses the naïve search method (i.e. sliding window) and
substantially speeds up the testing of equality of the pattern to the
• Its speed derives from the fact that it can determine all substrings in the text by using hashing.
occurrences of pat within txt without examining too many
characters in txt. • It is used for multiple pattern matching (in addition to single
pattern matching), because it has the unique advantage of being
• In fact, its average performance is O(n / m), that is, it gets able to find any one of k strings in O(n) time on average,
faster as the pattern gets longer. regardless of the magnitude of k.

• We say the algorithm is “sublinear” in the sense that it • The key to performance is the efficient computation of hash
values of the successive substrings of the text.
generally looks at fewer characters than it passes.
27 28

Rabin-Karp algorithm – hashing Rabin-Karp algorithm: properties

• A hash function converts every string into a numerical value, • One popular and effective hash function treats every substring as a
called its hash value (code, sum), using for instance the ASCII number in some base, the base being usually a large prime.
value of characters.
– For example, if the substring is "hi" and the base b = 101, then
– For example, hash(‘hello’) = 5. hash(‘hi’) = ‘h’*b^1 + ‘i’*b^0 = 104*101+105*1 = 10,609

• Algorithm exploits the fact that if two strings are equal, their hash • Rabin-Karp is inferior for single pattern searching to Boyer-Moore
values are also equal (there might be so-called hash collisions, algorithm because of its slow worst case behaviour.
though, that must be checked for letter by letter).

• All we have to do is to compute the hash value of the pattern • However, Rabin-Karp is an algorithm of choice for multiple pattern
we're searching for, and then look for substrings with the same search.
hash value within the text (and then check letter by letter). – That is, if we want to find many fixed length patterns in a text, say
of length k, we can create a simple variant of Rabin-Karp that
checks whether the hash of a given string in the text belongs to a set
• Different variants of the algorithm compute hash values in of hash values of patterns we are looking for.
different ways (adding, multiplying, etc.).
29 30
Aho-Corasick algorithm Aho-Corasick algorithm
• Used for multiple pattern matching tasks • In the first phase of the tree building, keywords are added to
the tree. (The root node is used only as a place holder and
• Decription from the article and code by Tomas Petricek at contains links to other letters. )
http://www.codeproject.com/KB/recipes/ahocorasick.aspx
• Links created in this first step represents the goto function,
• The algorithm consists of two parts: which returns the next state when a character is matching.
– Example of the tree for keywords: his, hers, she

• The first part is the building of the tree from keywords/patterns

you want to search for, and the second part is searching the text
for the keywords using the previously built tree (finite state
machine, FSM).
– FSM is a deterministic model of behaviour composed of a
finite number of states and transitions between those states

31 32

Aho-Corasick algorithm Aho-Corasick algorithm

• The fail function is used when a character is not matching. • During the second phase, the BFS (breadth first search)
algorithm is used for traversing through all the nodes.
• For example, in the text shis, the failure function is used to – At each stage, the node to be expanded is indicated by a marker
exit from the she branch to his branch after the first two – In general all the nodes are expanded at a given depths before
characters (because the third character is not matching). any nodes at the next level are expanded

Help: Find the tutorial on efficient string search with suffix

trees written by Mark Nelson at
33 http://marknelson.us/1996/08/01/suffix-trees/ 34

Aho-Corasick algorithm Conclusions

• Assume that generalised suffix tree has been built for the set of • Although data are memorized in various ways, text remains the
main form to exchange information.
patterns D = {S1, S2,..., SK} of total length n = | n1 | + | n2 | + ... + | nK |.
All patterns have the same alphabet. You can search for patterns in
such a way that: • String-matching is a very important subject in the wider domain
of text processing (i.e. keyword search), not just bioinformatics.

– Check if a pattern P of length m is a substring in O(m) time. • In bioinformatics, the patterns in strands of DNA, RNA and
– Find the first occurrence of the patterns P1,...,Pq of total length m proteins, have important biological meaning, e.g. they are
as substrings in O(m) time. promoters, enhancers, operators, genes, introns, exons, etc.
– Find all z occurrences of the patterns P1,...,Pq of total length m as
substrings in O(m + z) time. • Often these meaningful patterns undergo mutations at some
points, therefore we include in the patterns the so-called
wildcards, to replace some of the characters (as in the
assignment).

35 36

Microsoft Tree Questions
No ratings yet
Microsoft Tree Questions
29 pages
Java Interview Notes
No ratings yet
Java Interview Notes
56 pages
Two Pointers
No ratings yet
Two Pointers
3 pages
Data Structure and Algorithms For Interviews
No ratings yet
Data Structure and Algorithms For Interviews
4 pages
Download ebooks file Algorithm Design: A Methodological Approach 150 Problems and Detailed Solutions 1st Edition Patrick Bosc all chapters
100% (2)
Download ebooks file Algorithm Design: A Methodological Approach 150 Problems and Detailed Solutions 1st Edition Patrick Bosc all chapters
40 pages
90 Days Roadmap: Dsa Sheet
No ratings yet
90 Days Roadmap: Dsa Sheet
30 pages
Microservices Architecture: Mehmet Özkaya
No ratings yet
Microservices Architecture: Mehmet Özkaya
26 pages
LeetCode Question Difficulty Distribution PDF
No ratings yet
LeetCode Question Difficulty Distribution PDF
4 pages
60 LeetCode Problems To Solve For Coding Interiew - Sheet1
No ratings yet
60 LeetCode Problems To Solve For Coding Interiew - Sheet1
3 pages
Most Used Problem Solving Patterns
No ratings yet
Most Used Problem Solving Patterns
29 pages
DSA Kossine
No ratings yet
DSA Kossine
156 pages
Java8 Slides
No ratings yet
Java8 Slides
104 pages
Unit-4 (Dynamic Programming)
No ratings yet
Unit-4 (Dynamic Programming)
96 pages
Arrays
No ratings yet
Arrays
39 pages
Data Structure & Algorithm: Fundamental of DSA
No ratings yet
Data Structure & Algorithm: Fundamental of DSA
30 pages
OS Module5
No ratings yet
OS Module5
40 pages
Array Mastering in 5 Days
No ratings yet
Array Mastering in 5 Days
28 pages
Finalt Interview Prepearation
No ratings yet
Finalt Interview Prepearation
18 pages
Competitive Programming
No ratings yet
Competitive Programming
7 pages
Most Common Leetcode DSA Patterns
No ratings yet
Most Common Leetcode DSA Patterns
5 pages
OoAD Onlinebookshop
No ratings yet
OoAD Onlinebookshop
12 pages
Sliding Window Sum Algorithms For Deep Neural Networks
No ratings yet
Sliding Window Sum Algorithms For Deep Neural Networks
8 pages
How Hashmap Works in Java
No ratings yet
How Hashmap Works in Java
6 pages
Java Programs
100% (1)
Java Programs
30 pages
Oop Practice Problems
No ratings yet
Oop Practice Problems
6 pages
100 Days DSA Roadmap
No ratings yet
100 Days DSA Roadmap
15 pages
Java Web Services Interview Questions and Answers: Overview: Integration Styles?
No ratings yet
Java Web Services Interview Questions and Answers: Overview: Integration Styles?
25 pages
Mohanty S. Data Structure and Algorithms Using C++. A Practical Implementation 2021
No ratings yet
Mohanty S. Data Structure and Algorithms Using C++. A Practical Implementation 2021
428 pages
LeetCode Java Cheat Sheet For Interview
No ratings yet
LeetCode Java Cheat Sheet For Interview
9 pages
Unit-1.7 - Basic Concepts of OOP in C++
No ratings yet
Unit-1.7 - Basic Concepts of OOP in C++
8 pages
List of Algorithms Interview Questions
No ratings yet
List of Algorithms Interview Questions
9 pages
WWW Javatpoint Com Microservices Interview Questions
No ratings yet
WWW Javatpoint Com Microservices Interview Questions
12 pages
LLD Prep Notes
No ratings yet
LLD Prep Notes
6 pages
Java Collections Framework PDF
No ratings yet
Java Collections Framework PDF
62 pages
Study On Rational Rose: Use-Case Diagram
No ratings yet
Study On Rational Rose: Use-Case Diagram
4 pages
DSA Revision Guide
No ratings yet
DSA Revision Guide
102 pages
Java Imp
No ratings yet
Java Imp
31 pages
Leetcode Preparation
No ratings yet
Leetcode Preparation
14 pages
Data Structures and Algorithms in Java 6th Edition 201 270
No ratings yet
Data Structures and Algorithms in Java 6th Edition 201 270
70 pages
45 Tips To Improve Programming Information
100% (1)
45 Tips To Improve Programming Information
5 pages
For HR Rounds, They Will Test Your Analytical Skills and How You Approach Towards A
No ratings yet
For HR Rounds, They Will Test Your Analytical Skills and How You Approach Towards A
3 pages
Google
No ratings yet
Google
322 pages
Blind 75 Questions
No ratings yet
Blind 75 Questions
3 pages
DSA Sheet by Arsh Goyal Upto Linked List Solution
100% (1)
DSA Sheet by Arsh Goyal Upto Linked List Solution
80 pages
Venkata Sai Nirmal Kumar Meeshala - Java Microservices Developer
No ratings yet
Venkata Sai Nirmal Kumar Meeshala - Java Microservices Developer
8 pages
Competitive Programming Part-02
No ratings yet
Competitive Programming Part-02
74 pages
Pattern Recognition
No ratings yet
Pattern Recognition
45 pages
Leet Code
No ratings yet
Leet Code
134 pages
DAA With Ans Wheebox
No ratings yet
DAA With Ans Wheebox
485 pages
Java Interview Questions
No ratings yet
Java Interview Questions
43 pages
Answers To List of Java Unanswered Interview Questions
No ratings yet
Answers To List of Java Unanswered Interview Questions
35 pages
Paypal - LeetCode
100% (1)
Paypal - LeetCode
2 pages
Blind 75 PDF
No ratings yet
Blind 75 PDF
129 pages
Handwritten Hindi Character Recognition Using MultipleClassifiers in Machine Learning
No ratings yet
Handwritten Hindi Character Recognition Using MultipleClassifiers in Machine Learning
6 pages
Using Netcore Docker and Kubernetes Succinctly
100% (1)
Using Netcore Docker and Kubernetes Succinctly
91 pages
DWH Int Questions
100% (1)
DWH Int Questions
9 pages
Codebix: Live Class Syllabus
No ratings yet
Codebix: Live Class Syllabus
21 pages
Abstract
No ratings yet
Abstract
12 pages
ADS UNIT5
No ratings yet
ADS UNIT5
26 pages
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Fundamentals of Biology
No ratings yet
Fundamentals of Biology
27 pages
Pervasive Computing Lecture Notes
No ratings yet
Pervasive Computing Lecture Notes
6 pages
Smart Systems
No ratings yet
Smart Systems
9 pages
Make A Good Sentence of Each of These:: Grade 06
No ratings yet
Make A Good Sentence of Each of These:: Grade 06
4 pages
4+1 View Model of Software Architecture: "Software Architecture" Course Presented By: Mazeiar Salehie October 2004
No ratings yet
4+1 View Model of Software Architecture: "Software Architecture" Course Presented By: Mazeiar Salehie October 2004
25 pages
Direct and Indirect Speech
100% (1)
Direct and Indirect Speech
4 pages
Dbms
No ratings yet
Dbms
25 pages
Steps, Tools, and Techniques: Management Information Systems For The Information Age Second Canadian Edition
No ratings yet
Steps, Tools, and Techniques: Management Information Systems For The Information Age Second Canadian Edition
85 pages
Red Hat Enterprise Linux-5-5.4 Technical Notes-En-US
100% (1)
Red Hat Enterprise Linux-5-5.4 Technical Notes-En-US
375 pages
Sorting and Hashing
100% (1)
Sorting and Hashing
35 pages
Svetlin Nakov - Books - Svetlin Nakov's Blog
No ratings yet
Svetlin Nakov - Books - Svetlin Nakov's Blog
11 pages
MySQL Partitioning
100% (12)
MySQL Partitioning
43 pages
M.C.a. (Sem - IV) Paper - III - Network Security
No ratings yet
M.C.a. (Sem - IV) Paper - III - Network Security
197 pages
Accenture Wheebox - Technical
No ratings yet
Accenture Wheebox - Technical
113 pages
Datastage
100% (1)
Datastage
69 pages
Imp - Data-Structures Questions
No ratings yet
Imp - Data-Structures Questions
16 pages
SAP ABAP On HANA - Top Interview Questions
No ratings yet
SAP ABAP On HANA - Top Interview Questions
19 pages
Cs61b Homework 1
100% (1)
Cs61b Homework 1
7 pages
REcon Montreal 2024 Smoke and Mirrors - Driver Signatures Are Optional
No ratings yet
REcon Montreal 2024 Smoke and Mirrors - Driver Signatures Are Optional
58 pages
data-structures-python-programming-lab-manual
No ratings yet
data-structures-python-programming-lab-manual
83 pages
Built-in Types — Python 3.11.4 documentation
No ratings yet
Built-in Types — Python 3.11.4 documentation
75 pages
Teradata Interview Questions
No ratings yet
Teradata Interview Questions
6 pages
Perl
No ratings yet
Perl
99 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Dsa JS
No ratings yet
Dsa JS
30 pages
a-privacy-preserving-distributed-filtering-framework-for-nlp-30r6g0qti3
No ratings yet
a-privacy-preserving-distributed-filtering-framework-for-nlp-30r6g0qti3
10 pages
String Algorithm
No ratings yet
String Algorithm
17 pages
Viva Questions
No ratings yet
Viva Questions
12 pages
Search Courses: My Courses AWS Certi Ed Solutions Architect Associate Simple Storage Service (S3) - Quiz
No ratings yet
Search Courses: My Courses AWS Certi Ed Solutions Architect Associate Simple Storage Service (S3) - Quiz
55 pages
Plagiarism Detection Research
No ratings yet
Plagiarism Detection Research
23 pages
June 2019 QP - Paper 1 OCR Computer Science A-Level
No ratings yet
June 2019 QP - Paper 1 OCR Computer Science A-Level
28 pages
Netezza Questions and Answers
No ratings yet
Netezza Questions and Answers
5 pages
Ipay88 API 2.0.6 Technical Documentation
No ratings yet
Ipay88 API 2.0.6 Technical Documentation
27 pages
Real World Algorithms a Beginner’s Guide Panos Louridas Z Library
No ratings yet
Real World Algorithms a Beginner’s Guide Panos Louridas Z Library
527 pages
System Design Interview Complete Guide - Aman Barnwal
No ratings yet
System Design Interview Complete Guide - Aman Barnwal
28 pages
Hackathon Report
No ratings yet
Hackathon Report
11 pages
Titlu Lucrare
No ratings yet
Titlu Lucrare
50 pages
Data Structures
No ratings yet
Data Structures
5 pages