Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An efficient, versatile approach to suffix sorting

Published: 12 June 2008 Publication History

Abstract

Sorting the suffixes of a string into lexicographical order is a fundamental task in a number of contexts, most notably lossless compression (Burrows--Wheeler transformation) and text indexing (suffix arrays). Most approaches to suffix sorting produce a sorted array of suffixes directly, continually moving suffixes into their final place in the array until the ordering is complete. In this article, we describe a novel and resource-efficient (time and memory) approach to suffix sorting, which works in a complementary way—by assigning each suffix its rank in the final ordering, before converting to a sorted array, if necessary, once all suffixes are ranked. We layer several powerful extensions on this basic idea and show experimentally that our approach is superior to other leading algorithms in a variety of real-world contexts.

References

[1]
Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 1, 53--86.
[2]
Andersson, A. and Nilsson, S. 1998. Implementing radix sort. ACM Journal of Experimental Algorithmics 3.
[3]
Bentley, J. L. and McIlroy, M. D. 1993. Engineering a sort function. Software---Practice and Experience 23, 11, 1249--1265.
[4]
Bentley, J. L. and Sedgewick, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, New Orleans, LA. 360--369.
[5]
Burrows, M. and Wheeler, D. J. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Palo Alto, CA.
[6]
Ferragina, P. and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS 00). IEEE Computer Society, Redondo Beach, CA. 390--398.
[7]
González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of Fourth Workshop on Efficient and Experimental Algorithms (WEA'05). CTI Press and Ellinika Grammata, Greece. 27--38.
[8]
Grossi, R. and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35, 2, 378--407.
[9]
Itoh, H. and Tanaka, H. 1999. An efficient method for in memory construction of suffix arrays. In Proceedings of the sixth Symposium on String Processing and Information Retrieval. IEEE Computer Society, Cancun, Mexico, 81--88.
[10]
Kärkkäinen, J. and Burkhardt, S. 2003. Fast lightweight suffix array construction and checking. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin. 55--69.
[11]
Kärkkäinen, J. and Sanders, P. 2003. Simple linear work suffix array construction. In Proceedings of the 30th International Colloq. Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 2971. Springer-Verlag, Berlin. 943--955.
[12]
Karp, R. M., Miller, R. E., and Rosenberg, A. L. 1972. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings of the Fourth Annual ACM Symposium on Theory of Computing. Denver, Colorado. 125--136.
[13]
Kim, D. K., S., S. J., Park, H., and Park, K. 2003. Linear-time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin. 186--199.
[14]
Ko, P. and Aluru, S. 2003. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin. 200--210.
[15]
Larsson, N. J. and Sadakane, K. 1999. Faster suffix--sorting. Tech. Rep. LU-CS-TR:99-214 {LUNFD6/(NFCS-3140)/1-20/(1999)}, Department of Computer Science, Lund University, Sweden.
[16]
Mäkinen, V. and Navarro, G. 2005. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12, 2, 40--66.
[17]
Manber, U. and Myers, G. W. 1993. Suffix arrays: A new model for on-line string searches. SIAM Journal of Computing 22, 5, 935--948.
[18]
Manzini, G. 2004. Two space saving tricks for linear time lcp computation. In SWAT 2004: 9th Scandinavian Workshop on Algorithm Theory, T. Hagerup and J. Katajainen, Eds. Lecture Notes in Computer Science, vol. 3111. Springer-Verlag, Berlin. 372--383.
[19]
Manzini, G. and Ferragina, P. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33--50.
[20]
McIlroy, P. M., Bostic, K., and McIlroy, M. D. 1993. Engineering radix sort. Computing Systems 6, 1, 5--27.
[21]
Moffat, A. and Isal, R. Y. K. 2005. Word-based compression using the Burrows-Wheeler transform. Information Processing and Management 41, 1175--1192.
[22]
Musser, D. R. 1997. Introspective sorting and selection algorithms. Software---Practice and Experience 27, 8, 983--993.
[23]
Puglisi, S. J. 2005. Exposition and analysis of a suffix sorting algorithm. Tech. Rep. CAS-05-02-WS, Department of Computing and Software, McMaster University, Canada.
[24]
Puglisi, S. J., Smyth, W. F., and Turpin, A. H. 2005. The performance of linear time suffix sorting algorithms. In Proceedings of the Data Compression Conference DCC '05, M. Cohn and J. Storer, Eds. IEEE Computer Society Press, Los Alamitos, CA. 358--368.
[25]
Puglisi, S. J., Smyth, W. F., and Turpin, A. H. 2007. A taxonomy of suffix array construction algorithms. ACM Computing Surveys. 39, 2, Article 1.
[26]
Sadakane, K. 1998. A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation. In Proceedings of the Data Compression Conference DCC '98. IEEE Computer Society, Los Alamitos, CA. 129--138.
[27]
Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the Thirteenth ACM-SIAM Symposium on Discrete Algorithms. 225--232.
[28]
Schindler, M. 2002. szip homepage. http://www.compressconsult.com/szip/. Available Online.
[29]
Schürmann, K. and Stoye, J. 2005. An incomplex algorithm for fast suffix array construction. In Proceedings of The Seventh Workshop on Algorithm Engineering and Experiments (ALENEX'05). SIAM. 77--85.
[30]
Seward, J. 2000. On the performance of BWT sorting algroithms. In Proceedings of the Data Compression Conference DCC '00. IEEE Computer Society, Los Alamitos, CA. 173--182.
[31]
Seward, J. 2004. The bzip2 and libbzip2 home page. http://sources.redhat.com/bzip2/. Available Online.
[32]
Sinha, R. and Zobel, J. 2004. Cache-conscious sorting of large sets of strings with dynamic tries. ACM Journal of Experimental Algorithmics 9.
[33]
Smyth, W. F. 2003. Computing Patterns in Strings. Addison-Wesley-Pearson Education Limited, Essex, England.
[34]
Vines, P. and Zobel, J. 1998. Compression techniques for Chinese text. Software---Practice and Experience 28, 12, 1299--1314.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal of Experimental Algorithmics
ACM Journal of Experimental Algorithmics  Volume 12, Issue
2008
507 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/1227161
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2008
Accepted: 01 June 2006
Received: 01 December 2005
Revised: 01 June 2005
Published in JEA Volume 12

Author Tags

  1. Burrows--Wheeler transform
  2. Suffix sorting
  3. suffix array
  4. suffix tree

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)4
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Optimal In-Place Suffix SortingString Processing and Information Retrieval10.1007/978-3-030-00479-8_22(268-284)Online publication date: 14-Sep-2018
  • (2016)Inducing Suffix and LCP Arrays in External MemoryACM Journal of Experimental Algorithmics10.1145/297559321(1-27)Online publication date: 15-Sep-2016
  • (2014)Faster semi-external suffix sortingInformation Processing Letters10.1016/j.ipl.2013.11.017114:4(174-178)Online publication date: 1-Apr-2014
  • (2013)PASQUALIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2012.19024:5(977-986)Online publication date: 1-May-2013
  • (2012)IN-PLACE UPDATE OF SUFFIX ARRAY WHILE RECODING WORDSInternational Journal of Foundations of Computer Science10.1142/S012905410900702920:06(1025-1045)Online publication date: 30-Apr-2012
  • (2012)Revisiting bounded context block-sorting transformationsSoftware—Practice & Experience10.1002/spe.111242:8(1037-1054)Online publication date: 1-Aug-2012
  • (2011)Backwards Search in Context Bound Text TransformationsProceedings of the 2011 First International Conference on Data Compression, Communications and Processing10.1109/CCP.2011.18(82-91)Online publication date: 21-Jun-2011
  • (2010)Speeding up tandem mass spectrometry-based database searching by longest common prefixBMC Bioinformatics10.1186/1471-2105-11-57711:1Online publication date: 25-Nov-2010
  • (2008)Linear-Time Computation of Similarity Measures for Sequential DataThe Journal of Machine Learning Research10.5555/1390681.13906839(23-48)Online publication date: 1-Jun-2008
  • (2007)Suffix arrays on wordsProceedings of the 18th annual conference on Combinatorial Pattern Matching10.5555/2394373.2394417(328-339)Online publication date: 9-Jul-2007
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media