Compressed suffix arrays and suffix trees with applications to text indexing and string matching

R Grossi, JS Vitter - Proceedings of the thirty-second annual ACM …, 2000 - dl.acm.org
Proceedings of the thirty-second annual ACM symposium on Theory of computing, 2000dl.acm.org
The proliferation of online text, such as on the World Wide Web and in databases, motivates
the need for space-efficient index methods that support fast search. Consider a text T of n
binary symbols to index. Given any query pattern P of m binary symbols, the goal is to
search f? r P in T quickly, with T being fully scanned only once, nafiaely, when the index is
created. All indexing schemes published in the last thirty years support searching in O (m)
worst-case time and require O (n) memory words (or O (nlogn) bits), which is significantly …
Abstract
The proliferation of online text, such as on the World Wide Web and in databases, motivates the need for space-efficient index methods that support fast search. Consider a text T of n binary symbols to index. Given any query pattern P of m binary symbols, the goal is to search f? r P in T quickly, with T being fully scanned only once, nafiaely, when the index is created. All indexing schemes published in the last thirty years support searching in O (m) worst-case time and require O (n) memory words (or O (nlogn) bits), which is significantly larger than the text itself. In this paper we provide a breakthrough both in searching time and index space under the same model of computation as the one adopted in previous work. Based upon new compressed representations of suffix arrays and suffix trees, we construct an index structure that occupies only O (n) bits and compares favorably with inverted lists in space. We can search any binary pattern P, stored in O (m/log n) words, in only o (m) time. Specifically, searching takes O (1) time for m= o (log n), and O (m/logn+ log n)= o (m) time for m-~(logn) and any fixed 0< e< 1. That is, we achieve optimal O (m/log n) search time for sufficiently large m=~(log a+~ n). We can list all the occ pattern occurrences in optimal O (occ) additional time when m= f~(polylog (n)) or when occ----f~(n~); otherwise, listing takes O (occ log n) additional time.
ACM Digital Library