Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Grammar Compression by Induced Suffix Sorting

Published: 26 August 2022 Publication History

Abstract

A grammar compression algorithm, called GCIS, is introduced in this work. GCIS is based on the induced suffix sorting algorithm SAIS, presented by Nong et al. in 2009. The proposed solution builds on the factorization performed by SAIS during suffix sorting. A context-free grammar is used to replace factors by non-terminals. The algorithm is then recursively applied on the shorter sequence of non-terminals. The resulting grammar is encoded by exploiting some redundancies, such as common prefixes between right-hands of rules, sorted according to SAIS. GCIS excels for its low space and time required for compression while obtaining competitive compression ratios. Our experiments on regular and repetitive, moderate and very large texts, show that GCIS stands as a very convenient choice compared to well-known compressors such as Gzip 7-Zip; and RePair the gold standard in grammar compression; and recent compressors such as SOLCA, LZRR, and LZD. In exchange, GCIS is slow at decompressing. Yet, grammar compressors are more convenient than Lempel-Ziv compressors in that one can access text substrings directly in compressed form without ever decompressing the text. We demonstrate that GCIS is an excellent candidate for this scenario, because it shows to be competitive among its RePair based alternatives. We also show that the relation with SAIS makes GCIS a good intermediate structure to build the suffix array and the LCP array during decompression of the text.

References

[1]
Tooru Akagi, Dominik Köppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. 2021. Grammar index by induced suffix sorting. CoRR abs/2105.13744 (2021).
[2]
Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words. Softw., Pract. Exper. 40, 2 (2010), 131–147.
[3]
Philip Bille, Mikko Berggren Ettienne, Inge Li Gørtz, and Hjalte Wedel Vildhøj. 2018. Time-space trade-offs for Lempel-Ziv compressed indexing. Theor. Comput. Sci. 713 (2018), 66–77. DOI:
[4]
Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. 2015. Random access to grammar-compressed strings and trees. SIAM J. Comput. 44, 3 (2015), 513–539.
[5]
Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2013. DACs: Bringing direct access to variable-length codes. Inf. Process. Lett. 49, 1 (2013), 392–404.
[6]
Michael Burrows and David J. Wheeler. 1994. A Block-sorting Lossless Data Compression Algorithm. Technical Report. Digital SRC Research Report.
[7]
Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. 2005. The smallest grammar problem. IEEE Trans. Inf. Theor. 51, 7 (2005), 2554–2576.
[8]
Francisco Claude and Gonzalo Navarro. 2010. Self-indexed grammar-based compression. Fundamenta Informaticae 111, 3 (2010), 313–337.
[9]
Francisco Claude and Gonzalo Navarro. 2012. Improved grammar-based compressed indexes. In 19th International Symposium on String Processing and Information Retrieval (SPIRE) (LNCS 7608). Springer, 180–192.
[10]
Genome Reference Consortium. 2009. Genome Reference Consortium Human Reference 37. Retrieved from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/.
[11]
Sebastian Deorowicz. 2003. Silesia Corpus. Retrieved from http://sun.aei.polsl.pl/ sdeor/index.php?page=silesia.
[12]
Jasbir Dhaliwal, Simon J. Puglisi, and Andrew Turpin.2012. Trends in suffix sorting: A survey of low memory algorithms. In Australasian Computer Science Conference (ACSC). Australian Computer Society Inc., 91–98.
[13]
D. Díaz-Domínguez and G. Navarro. 2021. A grammar compressor for collections of reads with applications to the construction of the BWT. In 31st Data Compression Conference (DCC). 93–102.
[14]
D. Díaz-Domínguez, G. Navarro, and A. Pacheco. 2021. An LMS-based grammar self-index with local consistency properties. In 28th International Symposium on String Processing and Information Retrieval (SPIRE). Retrieved from https://users.dcc.uchile.cl/gnavarro/ps/spire21.2.pdf.
[15]
Paolo Ferragina and Gonzalo Navarro. 2005a. Pizza-Chili Corpus. Retrieved from http://pizzachili.dcc.uchile.cl/texts.html.
[16]
Paolo Ferragina and Gonzalo Navarro. 2005b. Pizza-Chili Repetitive Corpus. Retrieved from http://pizzachili.dcc.uchile.cl/repcorpus.html.
[17]
Johannes Fischer. 2011. Inducing the LCP-array. In Workshop on Algorithms and Data Structures (WADS)(Lecture Notes in Computer Science, Vol. 6844). Springer, Berlin, 374–385.
[18]
Johannes Fischer and Florian Kurpicz. 2017. Dismantling DivSufSort. In Proceedings of the Prague Stringology Conference. Department of Theoretical Computer Science, Faculty of Information Technology, 62–76.
[19]
Travis Gagie, Tomohiro I., Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Louisa Seelbach Benkner, and Yoshimasa Takabatake. 2020. Practical Random Access to SLP-Compressed Texts. (2020). Accepted short paper, SPIRE.
[20]
Travis Gagie, Tomohiro I., Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, and Yoshimasa Takabatake. 2019. Rpair: Scaling up RePair with rsync. In 26th International Symposium on String Processing and Information Retrieval (SPIRE)(Lecture Notes in Computer Science, Vol. 11811). Springer-Verlag, Berlin, 35–44.
[21]
Jean-Loup Gailly and Mark Adler. 2011. Accessed: 3/2017. The gzip home page. Retrieved from http://www.gzip.org/.
[22]
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. 2014. From theory to practice: Plug and play with succinct data structures. In Symposium on Experimental and Efficient Algorithms (SEA)(Lecture Notes in Computer Science, Vol. 8504). Springer, Cham, 326–337.
[23]
Simon Gog and Enno Ohlebusch. 2011. Fast and lightweight LCP-array construction algorithms. In Workshop on Algorithm Engineering and Experimentation (ALENEX). ACM Digital Library, 25–34.
[24]
Gaston H. Gonnet, Ricardo A. Baeza-Yates, and Tim Snider. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval. Prentice-Hall, Inc., Upper Saddle River, NJ, 66–82.
[25]
Keisuke Goto and Hideo Bannai. 2014. Space efficient linear time Lempel-Ziv factorization for small alphabets. In IEEE Data Compression Conference (DCC). IEEE, NY, 163–172.
[26]
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. 2015. LZD factorization: Simple and practical online grammar compression with variable-to-fixed encoding. In Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings(Lecture Notes in Computer Science, Vol. 9133). Springer, 219–230. DOI:
[27]
Tomohiro I. 2020. Shaped SLP implementation. Retrieved from https://github.com/itomomoti/ShapedSlp.
[28]
Hideo Itoh and Hozumi Tanaka. 1999. An efficient method for in memory construction of suffix arrays. In International Symposium on String Processing and Information Retrieval (SPIRE). IEEE, NY, 81–88.
[29]
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. 2013. Linear time Lempel-Ziv factorization: Simple, fast, small. In Annual Symposium on Combinatorial Pattern Matching (CPM)(Lecture Notes in Computer Science, Vol. 7922). Springer, Berlin, 189–200.
[30]
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. 2009. Permuted longest-common-prefix array. In Annual Symposium on Combinatorial Pattern Matching (CPM)(Lecture Notes in Computer Science, Vol. 5577). Springer, Berlin, 181–192.
[31]
John C. Kieffer and En-Hui Yang. 2000. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theor. 46, 3 (2000), 737–754.
[32]
Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Annual Symposium on Combinatorial Pattern Matching (CPM)(Lecture Notes in Computer Science, Vol. 2676). Springer, Berlin, 200–210.
[33]
Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. 2020. Lempel-Ziv like parsing in small space. DOI:
[34]
Florian Kurpicz. 2015. Sais-lite suffix and LCP arrays construction algorithm. Retrieved from https://github.com/kurpicz/sais-lite-lcp.
[35]
Florian Kurpicz. 2016. DivSufSort suffix and LCP arrays construction algorithm. Retrieved from https://github.com/kurpicz/libdivsufsort.
[36]
N. Jesper Larsson and Alistair Moffat. 1999. Offline dictionary-based compression. In IEEE Data Compression Conference (DCC). IEEE, NY, 296–305.
[37]
Weijun Liu, Ge Nong, Wai Hong Chan, and Yi Wu. 2016. Improving a lightweight LZ77 computation algorithm for running faster. Softw. Pract. Exp. 46, 9 (2016), 1201–1217.
[38]
Felipe A. Louza, Simon Gog, and Guilherme P. Telles. 2017a. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678 (2017), 22–39.
[39]
Felipe A. Louza, Simon Gog, and Guilherme P. Telles. 2017b. Optimal suffix sorting and LCP array construction for constant alphabets. Inf. Process. Lett. 118 (2017), 30–34.
[40]
Matt Mahoney. 2006. Large Text Compression Benchmark. Retrieved from http://mattmahoney.net/dc/text.html.
[41]
Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935–948.
[42]
Giovani Manzini. 2003. Manzini’s Lightweight Corpus. Retrieved from http://people.unipmn.it/ manzini/lightweight/.
[43]
Shirou Maruyama and Yasuo Tabei. 2013. Fully online grammar compression in constant space. In IEEE Data Compression Conference (DCC). IEEE, NY, 173–182. DOI:
[44]
Yuta Mori. 2008. DivSufSort suffix array construction algorithm. Retrieved from https://github.com/y-256/libdivsufsort.
[45]
Yuta Mori. 2010. Sais-lite suffix sorting algorithm. Retrieved from https://sites.google.com/site/yuta256/sais.
[46]
NCBI. 2007. Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7, complete sequence. Retrieved from https://www.ncbi.nlm.nih.gov/nuccore/NC_010102.
[47]
NCBI. 2020. Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. Retrieved from https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.
[48]
Takaaki Nishimoto, Tomohiro I., Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. 2020a. Dynamic index and LZ factorization in compressed space. Discret. Appl. Math. 274 (2020), 116–129. DOI:
[49]
Takaaki Nishimoto and Yasuo Tabei. 2019. LZRR: LZ77 parsing with right reference. In IEEE Data Compression Conference (DCC). IEEE, 211–220. DOI:
[50]
Takaaki Nishimoto, Yoshimasa Takabatake, and Yasuo Tabei. 2020b. A compressed dynamic self-index for highly repetitive text collections. Inf. Comput. 273 (2020), 104518. DOI:
[51]
Ge Nong. 2013. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inform. Syst. 31, 3 (2013), 1–15.
[52]
Ge Nong, Sen Zhang, and Wai H. Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In IEEE Data Compression Conference (DCC). IEEE, NY, 193–202.
[53]
Ge Nong, Sen Zhang, and Wai H. Chan. 2011. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60, 10 (2011), 1471–1484.
[54]
Daniel Saad Nogueira Nunes, Felipe Alves da Louza, Simon Gog, Mauricio Ayala-Rincón, and Gonzalo Navarro. 2018. A grammar compression algorithm based on induced suffix sorting. In IEEE Data Compression Conference (DCC). IEEE, NY, 42–51.
[55]
Enno Ohlebusch and Simon Gog. 2011. Lempel-Ziv factorization revisited. In Annual Symposium on Combinatorial Pattern Matching (CPM)(Lecture Notes in Computer Science, Vol. 6661). Springer, Berlin, 15–26.
[56]
Daisuke Okanohara and Kunihiko Sadakane. 2009. A linear-time burrows-wheeler transform using induced sorting. In International Symposium on String Processing and Information Retrieval (SPIRE)(Lecture Notes in Computer Science, Vol. 5721). Springer, Berlin, 90–101.
[57]
Igor Pavlov. 2016. Accessed: 10/2017. The 7zip home page. Retrieved from http://www.7-zip.org/.
[58]
Simon J. Puglisi, William F. Smyth, and Andrew H. Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comp. Surv. 39, 2 (2007), 1–31.
[59]
Julian Seward. 1996. The bzip home page. Retrieved from http://www.bzip.org/.
[60]
Dmitry Shkarin. 2006. PPMd algorithm variant j revision 1. Retrieved from http://www.compression.ru/ds/.
[61]
Yoshimasa Takabatake, Tomohiro I., and Hiroshi Sakamoto. 2017. A space-optimal grammar compression. In 25th Annual European Symposium on Algorithms, ESA 2017, September 4–6, 2017, Vienna, Austria (LIPIcs), Vol. 87. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 67:1–67:15. DOI:
[62]
Yoshimasa Takabatake, Yasuo Tabei, and Hiroshi Sakamoto. 2014. Improved ESP-index: A practical self-index for highly repetitive texts. In Experimental Algorithms - 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29–July 1, 2014. Proceedings (Lecture Notes in Computer Science), Joachim Gudmundsson and Jyrki Katajainen (Eds.), Vol. 8504. Springer, 338–350. DOI:
[63]
Yoshimasa Takabatake, Yasuo Tabei, and Hiroshi Sakamoto. 2015. Online self-indexed grammar compression. In String Processing and Information Retrieval - 22nd International Symposium, SPIRE 2015, London, UK, September 1–4, 2015, Proceedings (Lecture Notes in Computer Science), Costas S. Iliopoulos, Simon J. Puglisi, and Emine Yilmaz (Eds.), Vol. 9309. Springer, 258–269. DOI:
[64]
Andrew Trigell. 1998. Andrew Trigell’s Large Corpus. Retrieved from https://www.samba.org/ftp/tridge/large-corpus/.
[65]
Sebastiano Vigna. 2013. Quasi-succinct indices. In 6th ACM International Conference on Web Search and Data Mining. ACM Digital Library, 83–92.
[66]
Raymond Wan. 2014. Offline Dictionary-based Compression (RePair, Recursive Pairing). Retrieved from https://github.com/rwanwork/Re-Pair.
[67]
Wikipedia. 2019. Wikipedia’s Pages and Articles XML Dump. Retrieved from http://wikipedia.c3sl.ufpr.br/enwiki/20191120/.
[68]
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes (2nd Ed.): Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA.
[69]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 3 (1977), 337–343.

Cited By

View all
  • (2024)Computing MEMs and Relatives on Repetitive Text CollectionsACM Transactions on Algorithms10.1145/370156121:1(1-33)Online publication date: 17-Dec-2024
  • (2024)Algorithm design and performance evaluation of sparse induced suffix sortingInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10377761:5Online publication date: 1-Sep-2024
  • (2023)Text Compression Algorithms Employed Vowels and Article Pattern2023 7th International Conference on Information Technology (InCIT)10.1109/InCIT60207.2023.10413179(346-351)Online publication date: 16-Nov-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal of Experimental Algorithmics
ACM Journal of Experimental Algorithmics  Volume 27, Issue
December 2022
776 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/3505192
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 August 2022
Accepted: 01 September 2021
Revised: 01 June 2021
Received: 01 July 2019
Published in JEA Volume 27

Author Tags

  1. Data compression
  2. suffix sorting
  3. extract
  4. suffix-array
  5. LCP-array

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • FAP-DF
  • FAL
  • São Paulo Research Foundation (FAPESP)
  • FAP-DF and CNPq
  • Basal Funds
  • Fondecyt

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Computing MEMs and Relatives on Repetitive Text CollectionsACM Transactions on Algorithms10.1145/370156121:1(1-33)Online publication date: 17-Dec-2024
  • (2024)Algorithm design and performance evaluation of sparse induced suffix sortingInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10377761:5Online publication date: 1-Sep-2024
  • (2023)Text Compression Algorithms Employed Vowels and Article Pattern2023 7th International Conference on Information Technology (InCIT)10.1109/InCIT60207.2023.10413179(346-351)Online publication date: 16-Nov-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media