Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/11880561_11guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Inverted files versus suffix arrays for locating patterns in primary memory

Published: 11 October 2006 Publication History

Abstract

Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.

References

[1]
M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Optimal exact string matching based on suffix arrays. In SPIRE 2002, number 2476 in LNCS, pages 31-43. Springer-Verlag, Berlin, 2002.
[2]
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403-410, 1990.
[3]
V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8:151-166, 2005.
[4]
D. Benson, D.J. Lipman, and J. Ostell. GenBank. Nucleic Acids Research, 21(13):2963-2965, 1993.
[5]
D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. Genbank. Nucleic Acids Research, 33:D34-D38, 2005.
[6]
M. Burrows and D. J.Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994.
[7]
M. Cameron, H.E. Williams, and A. Cannane. Improved gapped alignment in blast. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(3):116-129, 2004.
[8]
Y. Choi and K. Park. Time and space efficient search with suffix arrays. In S. Hong, editor, Proceedings of AWOCA'04, pages 230-238, Ballina, Australia, 2004.
[9]
E. S. De Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113-139, 2000.
[10]
Ensembl. Ensembl Genome Browser, 2006. http://www.ensembl.org.
[11]
P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Found. of Comp. Sci., pages 390-398, Redondo Beach, CA, 2000. IEEE Computer Society.
[12]
P. Ferragina and G. Navarro. Pizza& Chili Corpus - Compressed Indexes and their Testbeds, 2005. http://pizzachili.dcc.uchile.cl.
[13]
R. Grossi, J. S. Vitter, and A. Gupta. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms, pages 636-645, 2004.
[14]
D. K. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing and Management, 31(3):271-289, 1995.
[15]
J. Kärkkäinen. Ziv-Lempel index for q-grams. Algorithmica, 21(1):137-154, 1998.
[16]
S. Kurtz. Reducing the space requirement of suffix trees. Software, Practice and Experience, 29(13):1149-1171, 1999.
[17]
V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(2):40-66, 2005.
[18]
V. Mäkinen and G. Navarro. Compressed full text indexes. Technical Report TR/DCC-2005-7, Department of Computer Science, University of Chile, June 2006.
[19]
V. Mäkinen, G. Navarro, and K. Sadakane. Advantages of backward searching - efficient secondary memory and distributed implementation of compressed suffix arrays. In Algorithms and Computation: 15th International Symposium, ISAAC 2004, number 3341 in LNCS, pages 681-692. Springer-Verlag, Berlin, 2004.
[20]
U. Manber and G. W. Myers. Suffix arrays: a new model for on-line string searches. SIAM Journal of Computing, 22(5):935-948, 1993.
[21]
U. Manber and S. Wu. Glimpse: A tool to search through entire file systems. In Proceedings of the USENIX Technical Conference, pages 23-32, Berkeley, CA, 1994. USENIX Association.
[22]
G. Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):407-430, 2001.
[23]
E. M. McCreight. A space-economical suffix tree construction algroithm. Journal of the ACM, 23(2):262-272, 1976.
[24]
G. Navarro, E. S. De Moura, M. Neubert, N. Ziviani, and R. Baeza-Yates. Adding compression to block addressing inverted indexes. Information Retrieval, 3:49-77, 2000.
[25]
NCBI. NCBI Blast, 2006. http://www.ncbi.nlm.nih.gov/BLAST/.
[26]
Simon J. Puglisi, W. F. Smyth, and Andrew H. Turpin. A taxonomy of suffix array construction algorithms. In Proceedings of the Prague Stringology Conference, pages 1-30, Prague, August 2005. Czech Technical University.
[27]
K. Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Algorithms and Computation: 11th International Conference, ISAAC 2000, number 1969 in LNCS, pages 410-421. Springer-Verlag, Berlin, 2000.
[28]
K. Sadakane. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms, pages 225-232, San Francisco, CA, 2002.
[29]
J. S. Sim, D. K. Kim, H. Park, and K. Park. Linear-time search in suffix arrays. In M. Miller and K. Park, editors, Proceedings of AWOCA'03, pages 139-146, Seoul, Korea, 2003.
[30]
W. F. Smyth. Computing Patterns in Strings. Addison-Wesley-Pearson Education Limited, Essex, England, 2003.
[31]
P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th annual Symposium on Foundations of Computer Science, pages 1-11, 1973.
[32]
H. E. Williams and J. Zobel. Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering, 14(1):63-78, 2002.
[33]
Hugh Williams and Justin Zobel. Compression of nucleotide databases for fast searching. CABIOS Computer Applications in the Biological Sciences, 13(5):549- 554, October 1997.
[34]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images - 2nd Edition. Morgan Kaufmann Publishing, San Francisco, 1999.
[35]
J. Zobel, A. Moffat, and R. Sacks-Davis. Searching large lexicons for partially specified terms using compressed inverted files. In R. Agrawal, S. Baker, and D. Bell, editors, Proceedings of the International Conference on Very Large Data Bases, pages 290-301, Dublin, Ireland, August 1993.

Cited By

View all
  • (2017)A Bloom filter based semi-index on q-gramsSoftware—Practice & Experience10.1002/spe.243147:6(799-811)Online publication date: 1-Jun-2017
  • (2015)Sampling the Suffix Array with MinimizersProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_28(287-298)Online publication date: 1-Sep-2015
  • (2012)Word-based self-indexes for natural language textACM Transactions on Information Systems10.1145/2094072.209407330:1(1-34)Online publication date: 6-Mar-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
SPIRE'06: Proceedings of the 13th international conference on String Processing and Information Retrieval
October 2006
366 pages
ISBN:3540457747
  • Editors:
  • Fabio Crestani,
  • Paolo Ferragina,
  • Mark Sanderson

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 11 October 2006

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2017)A Bloom filter based semi-index on q-gramsSoftware—Practice & Experience10.1002/spe.243147:6(799-811)Online publication date: 1-Jun-2017
  • (2015)Sampling the Suffix Array with MinimizersProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_28(287-298)Online publication date: 1-Sep-2015
  • (2012)Word-based self-indexes for natural language textACM Transactions on Information Systems10.1145/2094072.209407330:1(1-34)Online publication date: 6-Mar-2012
  • (2012)String matching with alphabet samplingJournal of Discrete Algorithms10.1016/j.jda.2010.09.00411(37-50)Online publication date: 1-Feb-2012
  • (2010)Top-k ranked document search in general text databasesProceedings of the 18th annual European conference on Algorithms: Part II10.5555/1882123.1882145(194-205)Online publication date: 6-Sep-2010
  • (2010)Compression, indexing, and retrieval for massive string dataProceedings of the 21st annual conference on Combinatorial pattern matching10.5555/1875737.1875761(260-274)Online publication date: 21-Jun-2010
  • (2010)Engineering basic algorithms of an in-memory text search engineACM Transactions on Information Systems10.1145/1877766.187776829:1(1-37)Online publication date: 27-Dec-2010
  • (2008)Algorithms and data structures for external memoryFoundations and Trends® in Theoretical Computer Science10.1561/04000000142:4(305-474)Online publication date: 1-Jan-2008
  • (2008)Output-sensitive autocompletion searchInformation Retrieval10.1007/s10791-008-9048-x11:4(269-286)Online publication date: 1-Aug-2008
  • (2007)Space-efficient algorithms for document retrievalProceedings of the 18th annual conference on Combinatorial Pattern Matching10.5555/2394373.2394402(205-215)Online publication date: 9-Jul-2007

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media