String Matching With Finite Automata: by Caroline Moore
String Matching With Finite Automata: by Caroline Moore
Automata
by Caroline Moore
String Matching
Whenever you use a search engine, or a
find function like sed or grep, you are
utilizing a string matching program. Many
of these programs create finite automata in
order to effectively search for your string.
Finite Automata
A finite automaton is a quintuple (Q, E, o, s, F):
Q: the finite set of states
E: the finite input alphabet
o: the transition function from QxE to Q
s e Q: the start state
F c Q: the set of final (accepting) states
How it works
A finite automaton accepts
strings in a specific language. It
begins in state q
0
and reads
characters one at a time from
the input string. It makes
transitions (|) based on these
characters, and if when it
reaches the end of the tape it is
in one of the accept states, that
string is accepted by the
language.
Graphic: Eppstein, David.
http://www.ics.uci.edu/~eppstein/161/9
60222.html
The Suffix Function
In order to properly
search for the string, the
program must define a
suffix function (o)
which checks to see
how much of what it is
reading matches the
search string at any
given moment.
Graphic: Reif, John.
http://www.cs.duke.edu/education/courses/c
ps130/fall98/lectures/lect14/node31.html
Example: nano
n a o other
empty: n c c c
n: n na c c
na: nan c c c
nan: n na nano c
nano: nano nano nano nano
Graphic & Example: Eppstein, David. http://www.ics.uci.edu/~eppstein/161/960222.html
String-Matching Automata
For any pattern P of length m, we can
define its string matching automata:
Q = {0,,m} (states)
q
0
= 0 (start state)
F = {m} (accepting state)
o(q,a) = o(P
q
a)
The transition function chooses the next state to
maintain the invariant:
|(T
i
) = o(T
i
)
After scanning in i characters, the state number is the
longest prefix of P that is also a suffix of T
i
.
Finite-Automaton-Matcher
The simple loop structure
implies a running time
for a string of length n is
O(n).
However: this is only the
running time for the
actual string matching. It
does not include the time
it takes to compute the
transition function.
Graphic: http://www.cs.duke.edu/education/courses/cps130/fall98/lectures/lect14/node33.html
Computing the Transition Function
Compute-Transition-Function (P,E)
m length[P]
For q 0 to m
do for each character a e E
do k min(m+1, q+2)
repeat k k-1
until P
k
P
q
a
o(q,a) k
return o
This procedure computes
o(q,a) according to its
definition. The loop on line
2 cycles through all the
states, while the nested loop
on line 3 cycles through the
alphabet. Thus all state-
character combinations are
accounted for. Lines 4-7 set
o(q,a) to be the largest k such
that P
k
P
q
a.
Running Time of
Compute-Transition-Function
Running Time: O(m
3
|E|)
Outer loop: m |E|
Inner loop: runs at most m+1
P
k
P
q
a: requires up to m comparisons
Improving Running Time
Much faster procedures for computing the transition
function exist. The time required to compute P can be
improved to O(m|E|).
The time it takes to find the string is linear: O(n).
This brings the total runtime to:
O(n + m|E|)
Not bad if your string is fairly small relative to the text
you are searching in.
Sources
Cormen, et al. Introduction to Algorithms. 1990
MIT Press, Cambridge. 862-868.
Reif, John.
http://www.cs.duke.edu/education/courses/cps130/fall
98/lectures/lect14/node28.html
Eppstein, David.
http://www.ics.uci.edu/~eppstein/161/960222.html