Longest Common Substring
Longest Common Substring
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
1 / 22
Outline
Problem Denition
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
2 / 22
Outline
1 2
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
2 / 22
Outline
1 2 3
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
2 / 22
Outline
1 2 3 4
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
2 / 22
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
3 / 22
LCS
Problem (LCS, Longest Common Substring): Given a collection of N strings A = {1 , . . . , N } and an integer K (2 K N) nd the longest string that is a substring of at least K strings in A.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
4 / 22
LCS
Problem (LCS, Longest Common Substring): Given a collection of N strings A = {1 , . . . , N } and an integer K (2 K N) nd the longest string that is a substring of at least K strings in A. Tools: Sufx Arrays Time and Space: Linear and alphabet-independent Model of Computation: RAM
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
4 / 22
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
5 / 22
Useful Denitions
Denition (Sufx): Let = 1 2 . . . n be an arbitrary string of length n. For each i (1 i n) [i..] = i i+1 . . . n is a sufx of . Denition (Lexicographic order): Suppose we have some order on letters of the alphabet . This order can be extended in a standard way to strings over : < iff either is proper prex of or [1] = [1], . . . , [i] = [i], [i + 1] < [i + 1].
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
6 / 22
Sufx Arrays
Denition (Sufx Array): Let be an arbitrary string of length n. Consider its non-empty sufxes [1..], [2..], . . . , [n..]. and order them lexicographically. Let SA(i) denote the starting position of the sufx appearing on the i-th place (1 i n): [SA(1)..] < [SA(2)..] < . . . < [SA(n)..].
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
7 / 22
1 2 3 4 5 6 7 8 9 10 11
Figure: String mississippi, its sufxes, and the corresponding sufx array.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
8 / 22
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
9 / 22
A simple data structure containing all the necessary information. Many nice and simple efcient construction algoritms (e.g. Krkinen, Sanders [2003]) with alphabet-independent time and space complexity.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
9 / 22
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
10 / 22
Theorem Let the total length of strings 1 , . . . , N be equal to L. Then the answer to the LCS problem can be computed in O(L) time and in O(L) space.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
11 / 22
LCS Example
Consider the following example with N = 3, K = 2: 1 = abb 2 = cb 3 = abc Clearly, the answer is ab.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
12 / 22
Observation
The longest common substring for K strings of our set is the longest common prex of some sufxes of these strings.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
13 / 22
Observation
The longest common substring for K strings of our set is the longest common prex of some sufxes of these strings. We calculate the longest common prex of every K sufxes of different strings and take the longest one; the latter is the answer to the LCS problem.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
13 / 22
Preprocessing: Step 1
Combine the strings in A as follows: = 1 $1 2 $2 . . . N $N . Here $i are special symbols (sentinels) that are different and lexicographically less than other symbols of the initial alphabet
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
14 / 22
Preprocessing: Step 1
Combine the strings in A as follows: = 1 $1 2 $2 . . . N $N . Here $i are special symbols (sentinels) that are different and lexicographically less than other symbols of the initial alphabet Example: = abb$1 cb$2 abc$3
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
14 / 22
Preprocessing: Step 2
Denition (Longest Common Prexes (LCP) array): The array containing lengths of the longest common prexes for every pair of consecutive sufxes (w.r.t. lexicographical order). LCP array can be easily constructed in linear time and space.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
15 / 22
Preprocessing: Step 2
Denition (Longest Common Prexes (LCP) array): The array containing lengths of the longest common prexes for every pair of consecutive sufxes (w.r.t. lexicographical order). LCP array can be easily constructed in linear time and space. We construct the sufx array and the LCP array for .
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
15 / 22
1 2 3 4 5 6 7 8 9 10 11
sufxes abb$1 cb$2 abc$3 bb$1 cb$2 abc$3 b$1 cb$2 abc$3 $1 cb$2 abc$3 cb$2 abc$3 b$2 abc$3 $2 abc$3 abc$3 bc$3 c$3 $3
sorted sufxes $1 cb$2 abc$3 $2 abc$3 $3 abb$1 cb$2 abc$3 abc$3 b$1 cb$2 abc$3 b$2 abc$3 bb$1 cb$2 abc$3 bc$3 c$3 cb$2 abc$3
CSR 2008
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
Further Ideas
The longest prex of sufxes of K different strings in A is the longest common prex of sufxes of K different colors in .
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
17 / 22
Further Ideas
The longest prex of sufxes of K different strings in A is the longest common prex of sufxes of K different colors in . Consider K sufxes at positions i1 , . . . , iK and assume that SA[i1 ] < SA[i2 ] < . . . < SA[iK ]. The length of the longest common prex of these K sufxes is equal to the minimum of LCP[i1 ], . . . , LCP[iK 1].
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
17 / 22
Further Ideas
The longest prex of sufxes of K different strings in A is the longest common prex of sufxes of K different colors in . Consider K sufxes at positions i1 , . . . , iK and assume that SA[i1 ] < SA[i2 ] < . . . < SA[iK ]. The length of the longest common prex of these K sufxes is equal to the minimum of LCP[i1 ], . . . , LCP[iK 1]. Example: SA: LCP: Sufxes: 4 0 7 0 11 0 1 2 8 0 3 1 6 1 2 1 9 0 10 1 5
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
17 / 22
Extensions
Theorem Problem: Given a collection of N strings A = {1 , . . . , N }, for each K (2 K N) nd the longest string that is a substring of at least K strings in A.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
18 / 22
Extensions
Theorem Problem: Given a collection of N strings A = {1 , . . . , N }, for each K (2 K N) nd the longest string that is a substring of at least K strings in A. Let the total length of strings 1 , . . . , N be equal to L. Then the answer to the above problem can be computed in O(L log L) time and in O(L) space.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
18 / 22
Part 4 Conclusions
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
19 / 22
Open Problem
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
20 / 22
Acknowledgements
The authors are thankful to the students of Department of Mathematical Logic and Theory of Algorithms and to Maxim Ushakov and Victor Khimenko (Google Moscow) for many helpful discussions.
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
21 / 22
Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Sufx Arrays
CSR 2008
22 / 22