Article

Inverted files versus suffix arrays for locating patterns in primary memory

Authors:

Simon J. Puglisi,

Andrew TurpinAuthors Info & Claims

SPIRE'06: Proceedings of the 13th international conference on String Processing and Information Retrieval

Pages 122 - 133

https://doi.org/10.1007/11880561_11

Published: 11 October 2006 Publication History

Abstract

Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.

References

[1]

M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Optimal exact string matching based on suffix arrays. In SPIRE 2002, number 2476 in LNCS, pages 31-43. Springer-Verlag, Berlin, 2002.

Digital Library

[2]

S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403-410, 1990.

[3]

V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8:151-166, 2005.

Digital Library

[4]

D. Benson, D.J. Lipman, and J. Ostell. GenBank. Nucleic Acids Research, 21(13):2963-2965, 1993.

[5]

D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. Genbank. Nucleic Acids Research, 33:D34-D38, 2005.

[6]

M. Burrows and D. J.Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994.

[7]

M. Cameron, H.E. Williams, and A. Cannane. Improved gapped alignment in blast. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(3):116-129, 2004.

Digital Library

[8]

Y. Choi and K. Park. Time and space efficient search with suffix arrays. In S. Hong, editor, Proceedings of AWOCA'04, pages 230-238, Ballina, Australia, 2004.

[9]

E. S. De Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113-139, 2000.

Digital Library

[10]

Ensembl. Ensembl Genome Browser, 2006. http://www.ensembl.org.

[11]

P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Found. of Comp. Sci., pages 390-398, Redondo Beach, CA, 2000. IEEE Computer Society.

Digital Library

[12]

P. Ferragina and G. Navarro. Pizza& Chili Corpus - Compressed Indexes and their Testbeds, 2005. http://pizzachili.dcc.uchile.cl.

[13]

R. Grossi, J. S. Vitter, and A. Gupta. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms, pages 636-645, 2004.

Digital Library

[14]

D. K. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing and Management, 31(3):271-289, 1995.

Digital Library

[15]

J. Kärkkäinen. Ziv-Lempel index for q-grams. Algorithmica, 21(1):137-154, 1998.

[16]

S. Kurtz. Reducing the space requirement of suffix trees. Software, Practice and Experience, 29(13):1149-1171, 1999.

Digital Library

[17]

V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(2):40-66, 2005.

Digital Library

[18]

V. Mäkinen and G. Navarro. Compressed full text indexes. Technical Report TR/DCC-2005-7, Department of Computer Science, University of Chile, June 2006.

[19]

V. Mäkinen, G. Navarro, and K. Sadakane. Advantages of backward searching - efficient secondary memory and distributed implementation of compressed suffix arrays. In Algorithms and Computation: 15th International Symposium, ISAAC 2004, number 3341 in LNCS, pages 681-692. Springer-Verlag, Berlin, 2004.

Digital Library

[20]

U. Manber and G. W. Myers. Suffix arrays: a new model for on-line string searches. SIAM Journal of Computing, 22(5):935-948, 1993.

Digital Library

[21]

U. Manber and S. Wu. Glimpse: A tool to search through entire file systems. In Proceedings of the USENIX Technical Conference, pages 23-32, Berkeley, CA, 1994. USENIX Association.

Digital Library

[22]

G. Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):407-430, 2001.

Digital Library

[23]

E. M. McCreight. A space-economical suffix tree construction algroithm. Journal of the ACM, 23(2):262-272, 1976.

Digital Library

[24]

G. Navarro, E. S. De Moura, M. Neubert, N. Ziviani, and R. Baeza-Yates. Adding compression to block addressing inverted indexes. Information Retrieval, 3:49-77, 2000.

Digital Library

[25]

NCBI. NCBI Blast, 2006. http://www.ncbi.nlm.nih.gov/BLAST/.

[26]

Simon J. Puglisi, W. F. Smyth, and Andrew H. Turpin. A taxonomy of suffix array construction algorithms. In Proceedings of the Prague Stringology Conference, pages 1-30, Prague, August 2005. Czech Technical University.

[27]

K. Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Algorithms and Computation: 11th International Conference, ISAAC 2000, number 1969 in LNCS, pages 410-421. Springer-Verlag, Berlin, 2000.

Digital Library

[28]

K. Sadakane. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms, pages 225-232, San Francisco, CA, 2002.

Digital Library

[29]

J. S. Sim, D. K. Kim, H. Park, and K. Park. Linear-time search in suffix arrays. In M. Miller and K. Park, editors, Proceedings of AWOCA'03, pages 139-146, Seoul, Korea, 2003.

[30]

W. F. Smyth. Computing Patterns in Strings. Addison-Wesley-Pearson Education Limited, Essex, England, 2003.

[31]

P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th annual Symposium on Foundations of Computer Science, pages 1-11, 1973.

Digital Library

[32]

H. E. Williams and J. Zobel. Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering, 14(1):63-78, 2002.

Digital Library

[33]

Hugh Williams and Justin Zobel. Compression of nucleotide databases for fast searching. CABIOS Computer Applications in the Biological Sciences, 13(5):549- 554, October 1997.

[34]

I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images - 2nd Edition. Morgan Kaufmann Publishing, San Francisco, 1999.

Digital Library

[35]

J. Zobel, A. Moffat, and R. Sacks-Davis. Searching large lexicons for partially specified terms using compressed inverted files. In R. Agrawal, S. Baker, and D. Bell, editors, Proceedings of the International Conference on Very Large Data Bases, pages 290-301, Dublin, Ireland, August 1993.

Digital Library

Cited By

Grabowski SSusik RRaniszewski M(2017)A Bloom filter based semi-index on q-gramsSoftware—Practice & Experience10.1002/spe.243147:6(799-811)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1002/spe.2431
Grabowski SRaniszewski M(2015)Sampling the Suffix Array with MinimizersProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_28(287-298)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1007/978-3-319-23826-5_28
Fariña ABrisaboa NNavarro GClaude FPlaces ÁRodríguez E(2012)Word-based self-indexes for natural language textACM Transactions on Information Systems10.1145/2094072.209407330:1(1-34)Online publication date: 6-Mar-2012
https://dl.acm.org/doi/10.1145/2094072.2094073
Show More Cited By

Recommendations

Compressing Inverted Files
Abstract
Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
The premise “...
Sigma encoded inverted files
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Compression of term frequency lists and very long document-id lists within an inverted file search engine are examined. Several compression schemes are compared including Elias γ and δ codes, Golomb Encoding, Variable Byte Encoding, and a class of word-...
Inverted files versus signature files for text indexing

Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

SPIRE'06: Proceedings of the 13th international conference on String Processing and Information Retrieval

October 2006

366 pages

ISBN:3540457747

Editors:
Fabio Crestani
Department of Computer and Information Science, University of Strathclyde, Scotland
,
Paolo Ferragina
Dipartimento di Informatica, University of Pisa, Largo B. Pontecorvo 3, Pisa, Italy
,
Mark Sanderson
Department of Information Studies, University of Sheffield, Largo B. Pontecorvo 3, Sheffield, UK

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 11 October 2006

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Grabowski SSusik RRaniszewski M(2017)A Bloom filter based semi-index on q-gramsSoftware—Practice & Experience10.1002/spe.243147:6(799-811)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1002/spe.2431
Grabowski SRaniszewski M(2015)Sampling the Suffix Array with MinimizersProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_28(287-298)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1007/978-3-319-23826-5_28
Fariña ABrisaboa NNavarro GClaude FPlaces ÁRodríguez E(2012)Word-based self-indexes for natural language textACM Transactions on Information Systems10.1145/2094072.209407330:1(1-34)Online publication date: 6-Mar-2012
https://dl.acm.org/doi/10.1145/2094072.2094073
Claude FNavarro GPeltola HSalmela LTarhio J(2012)String matching with alphabet samplingJournal of Discrete Algorithms10.1016/j.jda.2010.09.00411(37-50)Online publication date: 1-Feb-2012
https://dl.acm.org/doi/10.1016/j.jda.2010.09.004
Culpepper JNavarro GPuglisi STurpin A(2010)Top-k ranked document search in general text databasesProceedings of the 18th annual European conference on Algorithms: Part II10.5555/1882123.1882145(194-205)Online publication date: 6-Sep-2010
https://dl.acm.org/doi/10.5555/1882123.1882145
Hon WShah RVitter J(2010)Compression, indexing, and retrieval for massive string dataProceedings of the 21st annual conference on Combinatorial pattern matching10.5555/1875737.1875761(260-274)Online publication date: 21-Jun-2010
https://dl.acm.org/doi/10.5555/1875737.1875761
Transier FSanders P(2010)Engineering basic algorithms of an in-memory text search engineACM Transactions on Information Systems10.1145/1877766.187776829:1(1-37)Online publication date: 27-Dec-2010
https://dl.acm.org/doi/10.1145/1877766.1877768
Vitter J(2008)Algorithms and data structures for external memoryFoundations and Trends® in Theoretical Computer Science10.1561/04000000142:4(305-474)Online publication date: 1-Jan-2008
https://dl.acm.org/doi/10.1561/0400000014
Bast HMortensen CWeber I(2008)Output-sensitive autocompletion searchInformation Retrieval10.1007/s10791-008-9048-x11:4(269-286)Online publication date: 1-Aug-2008
https://dl.acm.org/doi/10.1007/s10791-008-9048-x
Välimäki NMäkinen V(2007)Space-efficient algorithms for document retrievalProceedings of the 18th annual conference on Combinatorial Pattern Matching10.5555/2394373.2394402(205-215)Online publication date: 9-Jul-2007
https://dl.acm.org/doi/10.5555/2394373.2394402

View Options

View options

Media

Figures

Other

Tables

View Table of Contents