research-article

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

Author:

Gonzalo NavarroAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 54, Issue 2

Article No.: 26, Pages 1 - 32

https://doi.org/10.1145/3432999

Published: 09 February 2021 Publication History

Abstract

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures.

In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations. We conclude with the current challenges in this fascinating field.

References

[1]

A. Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words (NATO ISI Series). Springer-Verlag, 85--96.

[2]

R. Baeza-Yates and B. Ribeiro-Neto. 2011. Modern Information Retrieval (2nd ed.). Addison-Wesley.

[3]

H. Bannai, T. Gagie, and T. I. 2020. Refining the r-index. Theor. Comput. Sci. 812 (2020), 96--108.

[4]

T. Batu, F. Ergün, and S. C. Sahinalp. 2006. Oblivious string embeddings and edit distance approximations. In Proceedings of the 17th Symposium on Discrete Algorithms (SODA’06). 792--801.

[5]

D. Belazzougui, Paolo B., R. Pagh, and S. Vigna. 2010. Fast prefix search in little space, with applications. In Proceedings of the 18th Annual European Symposium on Algorithms (ESA’10). 427--438.

[6]

D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. 2009. Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses. In Proceedings of the 20th Annual Symposium on Discrete Mathematics (SODA’09). 785--794.

[7]

D. Belazzougui and F. Cunial. 2017a. Fast label extraction in the CDAWG. In Proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE’17). 161--175.

[8]

D. Belazzougui and F. Cunial. 2017b. Representing the suffix tree with the CDAWG. In Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM’17). 7:1--7:13.

[9]

D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2015a. Composite repetition-aware data structures. In Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM’15). 26--39.

[10]

D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2017. Flexible indexing of repetitive collections. In Proceedings of the 13th Conference on Computability in Europe (CiE’17). 162--174.

[11]

D. Belazzougui, F. Cunial, J. Kärkkäinen, and V. Mäkinen. 2020. Linear-time string indexing and analysis in small space. ACM Trans. Algor. 16, 2 (2020), article 17.

[12]

D. Belazzougui, T. Gagie, P. Gawrychowski, J. Kärkkäinen, A. Ordóñez, S. J. Puglisi, and Y. Tabei. 2015b. Queries on LZ-bounded encodings. In Proceedings of the 25th Data Compression Conference (DCC’15). 83--92.

[13]

D. Belazzougui and G. Navarro. 2015. Optimal lower and upper bounds for representing sequences. ACM Trans. Algor. 11, 4 (2015), article 31.

[14]

D. Belazzougui and S. J. Puglisi. 2016. Range predecessor and lempel-ziv parsing. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’16). 2053--2071.

[15]

T. Beller, M. Zwerger, S. Gog, and E. Ohlebusch. 2013. Space-efficient construction of the burrows-wheeler transform. In Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE’13). 5--16.

[16]

M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. 2005. Lowest common ancestors in trees and directed acyclic graphs. J. Algor. 57, 2 (2005), 75--94.

Digital Library

[17]

P. Bille, M. B. Ettienne, I. L. Gørtz, and H. W. Vildhøj. 2017a. Time-space trade-offs for lempel-ziv compressed indexing. In Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM’17). 16:1--16:17.

[18]

P. Bille, M. B. Ettienne, I. L. Gørtz, and H. W. Vildhøj. 2018. Time-space trade-offs for lempel-ziv compressed indexing. Theor. Comput. Sci. 713 (2018), 66--77.

[19]

P. Bille, I. L. Gørtz, and N. Prezza. 2017b. Space-efficient re-pair compression. In Proceedings of the 27th Data Compression Conference (DCC’17). 171--180.

[20]

P. Bille, I. L. Gørtz, B. Sach, and H. W. Vildhøj. 2014. Time-space trade-offs for longest common extensions. J. Discr. Algor. 25 (2014), 42--50.

Digital Library

[21]

A. Blumer, J. Blumer, D. Haussler, R. M. McConnell, and A. Ehrenfeucht. 1987. Complete inverted files for efficient text retrieval and analysis. J. ACM 34, 3 (1987), 578--595.

Digital Library

[22]

C. Boucher, T. Gagie, A. Kuhnle, B. Langmead, G. Manzini, and T. Mun. 2019. Prefix-free parsing for building big BWTs. Algor. Molec. Biol. 14, 1 (2019), 13:1--13:15.

[23]

S. Büttcher, C. L. A. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press.

Digital Library

[24]

T. M. Chan, K. G. Larsen, and M. Pătraşcu. 2011. Orthogonal range searching on the RAM, revisited. In Proceedings of the 27th ACM Symposium on Computational Geometry (SoCG’11). 1--10.

[25]

G. Chen, S. J. Puglisi, and W. F. Smyth. 2008. Lempel-ziv factorization using less time 8 space. Math. Comput. Sci. 1 (2008), 605--623.

[26]

A. R. Christiansen and M. B. Ettienne. 2018. Compressed indexing with signature grammars. In Proceedings of the13th Latin American Symposium on Theoretical Informatics (LATIN’18). 331--345.

[27]

A. R. Christiansen, M. B. Ettienne, T. Kociumaka, G. Navarro, and N. Prezza. 2020. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms 17, 1, Article 8 (2020), 207--219.

[28]

F. Claude, A. Fariña, M. Martínez-Prieto, and G. Navarro. 2016. Universal indexes for highly repetitive document collections. Inf. Syst. 61 (2016), 1--23.

Digital Library

[29]

F. Claude and G. Navarro. 2009. Self-indexed text compression using straight-line programs. In Proceedings of the 34th International Symposium on Mathematical Foundations of Computer Science (MFCS’09). 235--246.

[30]

F. Claude and G. Navarro. 2011. Self-indexed grammar-based compression. Fundam. Inf. 111, 3 (2011), 313--337.

Digital Library

[31]

F. Claude and G. Navarro. 2012. Improved grammar-based compressed indexes. In Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE’12). 180--192.

[32]

F. Claude, G. Navarro, and A. Pacheco. 2021. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences 118 (2021), 53--74.

[33]

R. Cole and U. Vishkin. 1986. Deterministic coin tossing with applications to optimal parallel list ranking. Inf. Contr. 70, 1 (1986), 32--53.

Digital Library

[34]

M. Crochemore and C. Hancart. 1997. Automata for matching patterns. In Handbook of Formal Languages. Springer, 399--462.

[35]

M. Crochemore and W. Rytter. 2002. Jewels of Stringology. World Scientific.

[36]

M. Farach and M. Thorup. 1995. String matching in lempel-ziv compressed strings. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing (STOC’95). 703--712.

[37]

M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. 2000. On the sorting-complexity of suffix tree construction. J. ACM 47, 6 (2000), 987--1011.

Digital Library

[38]

H. Ferrada, T. Gagie, T. Hirvola, and S. J. Puglisi. 2014. Hybrid indexes for repetitive datasets. Philos. Trans. Roy. Soc. A 372, 2016 (2014), article 20130137.

[39]

H. Ferrada, D. Kempa, and S. J. Puglisi. 2018. Hybrid indexing revisited. In Proceedings of the 20th Workshop on Algorithm Engineering and Experiments (ALENEX’18). 1--8.

[40]

P. Ferragina, T. Gagie, and G. Manzini. 2012. Lightweight data indexing and compression in external memory. Algorithmica 63, 3 (2012), 707--730.

Digital Library

[41]

P. Ferragina and R. Grossi. 1999. The string b-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2 (1999), 236--280.

Digital Library

[42]

P. Ferragina and G. Manzini. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS’00). 390--398.

[43]

P. Ferragina and G. Manzini. 2005. Indexing compressed texts. J. ACM 52, 4 (2005), 552--581.

Digital Library

[44]

P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3, 2 (2007), article 20.

[45]

J. Fischer, T. Gagie, P. Gawrychowski, and T. Kociumaka. 2015a. Approximating LZ77 via small-space multiple-pattern matching. In Proceedings of the 23rd Annual European Symposium on Algorithms (ESA). 533--544.

[46]

J. Fischer and V. Heun. 2011. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40, 2 (2011), 465--492.

Digital Library

[47]

J. Fischer, T. I, and D. Köppl. 2015b. Lempel ziv computation in small space (LZ-CISS). In Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM’15). 172--184.

[48]

J. Fischer, T. I. D. Köppl, and K. Sadakane. 2018. Lempel-ziv factorization powered by space efficient suffix trees. Algorithmica 80, 7 (2018), 2048--2081.

Digital Library

[49]

J. Fuentes-Sepúlveda, G. Navarro, and Y. Nekrich. 2020. Parallel computation of the burrows wheeler transform in compact space. Theor. Comput. Sci. 812 (2020), 123--136.

[50]

T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2012. A faster grammar-based self-index. In Proceedings of the 6th International Conference on Language and Automata Theory and Applications (LATA’12). 240--251.

[51]

T. Gagie, P Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2014. LZ77-based self-indexing with faster pattern matching. In Proceedings of the 11th Latin American Symposium on Theoretical Informatics (LATIN’14). 731--742.

[52]

T. Gagie, T. I, G. Manzini, G. Navarro, H. Sakamoto, and Y. Takabatake. 2019. Rpair: Scaling up repair with rsync. In Proceedings of the 26th International Symposium on String Processing and Information Retrieval (SPIRE’19). 35--44.

[53]

T. Gagie, G. Navarro, and N. Prezza. 2018. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’18). 1459--1477.

[54]

T. Gagie, G. Navarro, and N. Prezza. 2020. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1 (2020), article 2.

Digital Library

[55]

T. Gagie and S. J. Puglisi. 2015. Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol. 3 (2015), article 12.

[56]

K. Goto and H. Bannai. 2013. Simpler and faster lempel ziv factorization. In Proceedings of the 23rd Data Compression Conference (DCC’13). 133--142.

[57]

K. Goto and H. Bannai. 2014. Space efficient linear time lempel-ziv Factorization for Small Alphabets. In Proceedings of the 24th Data Compression Conference (DCC’14). 163--172.

[58]

R. Grossi. 2011. A quick tour on suffix arrays and compressed suffix arrays. Theor. Comput. Sci. 412, 27 (2011), 2964--2973.

Digital Library

[59]

R. Grossi and J. S. Vitter. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing (STOC’00). 397--406.

[60]

D. Gusfield. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.

Digital Library

[61]

P. Gawrychowski, A. Karczmarz, T. Kociumaka, J. Lacki, and P. Sankowski. 2015. Optimal dynamic strings. CoRR 1511.02612 (2015).

[62]

W.-K. Hon, T.-W. Lam, K. Sadakane, W.-K. Sung, and S.-M. Yiu. 2007. A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48, 1 (2007), 23--36.

Digital Library

[63]

W.-K. Hon, K. Sadakane, and W.-K. Sung. 2009. Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 6 (2009), 2162--2178.

Digital Library

[64]

A. Jeż. 2015. Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592 (2015), 115--134.

Digital Library

[65]

A. Jeż. 2016. A really simple approximation of smallest grammar. Theor. Comput. Sci. 616 (2016), 141--150.

Digital Library

[66]

J. Kärkkäinen. 2007. Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. 387, 3 (2007), 249--257.

Digital Library

[67]

J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2013. Lightweight lempel-ziv parsing. In Proceedings of the 12th International Symposium on Experimental Algorithms (SEA’13). 139--150.

[68]

J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2014. Lempel-ziv parsing in external memory. In Proceedings of the 24th Data Compression Conference (DCC’14). 153--162.

[69]

J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2016. Lazy lempel-ziv factorization algorithms. ACM J. Exp. Algor. 21, 1 (2016), 2.4:1--2.4:19.

[70]

J. Kärkkäinen, P. Sanders, and S. Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918--936.

Digital Library

[71]

J. Kärkkäinen and E. Ukkonen. 1996. Lempel-ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP’96). 141--155.

[72]

R. M. Karp and M. O. Rabin. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 2 (1987), 249--260.

Digital Library

[73]

D. Kempa. 2019. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’19). 1344--1357.

[74]

D. Kempa and T. Kociumaka. 2019. String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC’19). 756--767.

[75]

D. Kempa and D. Kosolobov. 2017. LZ-end parsing in compressed space. In Proceedings of the 27th Data Compression Conference (DCC’17). 350--359.

[76]

D. Kempa and S. J. Puglisi. 2013. Lempel-ziv factorization: Simple, fast, practical. In Proceedings of the 15th Workshop on Algorithm Engineering and Experiments (ALENEX’13). 103--112.

[77]

D. K. Kim, J. S. Sim, H. Park, and K. Park. 2005. Constructing suffix arrays in linear time. J. Discr. Algor. 3, 2–4 (2005), 126--142.

[78]

P. Ko and S. Aluru. 2005. Space efficient linear time construction of suffix arrays. J. Discr. Algor. 3, 2–4 (2005), 143--156.

[79]

T. Kociumaka, G. Navarro, and N. Prezza. 2020. Towards a definitive measure of repetitiveness. In Proceedings of the 14th Latin American Symposium on Theoretical Informatics (LATIN’20).

[80]

D. Köppl, T. I. I. Furuya, Y. Takabatake, K. Sakai, and K. Goto. 2020. Re-pair in small space. In Proceedings of the 30th Data Compression Conference (DCC’20). 377.

[81]

D. Köppl and K. Sadakane. 2016. Lempel-ziv computation in compressed space (LZ-CICS). In Proceedings of the 26th Data Compression Conference (DCC’16). 3--12.

[82]

S. Kreft and G. Navarro. 2011. Self-indexing based on LZ77. In Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching (CPM’11). 41--54.

[83]

S. Kreft and G. Navarro. 2013. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115--133.

Digital Library

[84]

A. Kuhnle, T. Mun, C. Boucher, T. Gagie, B. Langmead, and G. Manzini. 2020. Efficient construction of a complete index for pan-genomics read alignment. J. Comput. Biol. 27, 4 (2020), 500--513.

[85]

J. Larsson and A. Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722--1732.

[86]

E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Min. Knowl. Discov. 18, 2 (2009), 300--336.

Digital Library

[87]

B. Liu. 2007. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer.

Digital Library

[88]

V. Mäkinen, D. Belazzougui, F. Cunial, and A. I. Tomescu. 2015. Genome-Scale Algorithm Design. Cambridge University Press.

[89]

V. Mäkinen and G. Navarro. 2005. Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12, 1 (2005), 40--66.

Digital Library

[90]

V. Mäkinen and G. Navarro. 2008. Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algor. 4, 3 (2008), article 32.

[91]

V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281--308.

[92]

U. Manber and G. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935--948.

Digital Library

[93]

S. Maruyama, M. Nakahara, N. Kishiue, and H. Sakamoto. 2011. ESP-Index: A compressed index based on edit-sensitive parsing. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval (SPIRE’11). 398--409.

[94]

S. Maruyama, M. Nakahara, N. Kishiue, and H. Sakamoto. 2013a. ESP-index: A compressed index based on edit-sensitive parsing. J. Discr. Algor. 18 (2013), 100--112.

Digital Library

[95]

S. Maruyama, H. Sakamoto, and M. Takeda. 2012. An online algorithm for lightweight grammar-based compression. Algorithms 5, 2 (2012), 213--235.

[96]

S. Maruyama, Y. Tabei, H. Sakamoto, and K. Sadakane. 2013b. Fully-online grammar compression. In Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE’13). 218--â229.

[97]

E. McCreight. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (1976), 262--272.

Digital Library

[98]

K. Mehlhorn, R. Sundar, and C. Uhrig. 1997. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica 17, 2 (1997), 183--198.

[99]

D. Morrison. 1968. PATRICIA—Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4 (1968), 514--534.

Digital Library

[100]

J. I. Munro, G. Navarro, and Y. Nekrich. 2017. Space-efficient construction of compressed indexes in deterministic linear time. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’17). 408--424.

[101]

J. I. Munro and Y. Nekrich. 2015. Compressed data structures for dynamic sequences. In Proceedings of the 23rd Annual European Symposium on Algorithms (ESA’15). 891--902.

[102]

G. Navarro. 2017. A self-index on block trees. In Proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE’17). 278--289.

[103]

G. Navarro. 2020. Indexing highly repetitive string collections, Part I: Repetitiveness measures CoRR 2004.02781 (2020).

[104]

G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. Comput. Surv. 39, 1 (2007), article 2.

[105]

G. Navarro and Y. Nekrich. 2017. Time-optimal top-k document retrieval. SIAM J. Comput. 46, 1 (2017), 89--113.

Digital Library

[106]

G. Navarro and N. Prezza. 2019. Universal compressed text indexing. Theor. Comput. Sci. 762 (2019), 41--50.

[107]

T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. 2015. Dynamic index, LZ factorization, and LCE queries in compressed space. CoRR 1504.06954 (2015).

[108]

T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. 2020. Dynamic index and LZ factorization in compressed space. Discr. Appl. Math. 274 (2020), 116--129.

[109]

T. Nishimoto and Y. Tabei. 2019. LZRR: LZ77 parsing with right reference. In Proceedings of the 29th Data Compression Conference (DCC’19). 211--220.

[110]

T. Nishimoto and Y. Tabei. 2020. Faster queries on BWT-runs compressed indexes. CoRR 2006.05104 (2020).

[111]

T. Nishimoto, Y. Takabatake, and Y. Tabei. 2018. A dynamic compressed self-index for highly repetitive text collections. In Proceedings of the 28th Data Compression Conference (DCC’18). 287--296.

[112]

E. Ohlebusch. 2013. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag.

[113]

E. Ohlebusch and S. Gog. 2011. Lempel-ziv factorization revisited. In Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching (CPM’11). 15--26.

[114]

T. Ohno, K. Sakai, Y. Takabatake, T. I, and H. Sakamoto. 2018. A faster implementation of online RLBWT and its application to LZ77 parsing. J. Discr. Algor. 52–53 (2018), 18--28.

[115]

D. Okanohara and K. Sadakane. 2009. A linear-time burrows-wheeler transform using induced sorting. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE’09), Lecture Notes in Computer Science, Vol. 5721. 90--101.

[116]

A. Policriti and N. Prezza. 2015. Fast online lempel-ziv factorization in compressed space. In Proceedings of the 22nd String Processing and Information Retrieval (SPIRE’15). 13--20.

[117]

A. Policriti and N. Prezza. 2018. LZ77 computation based on the run-length encoded BWT. Algorithmica 80, 7 (2018), 1986--2011.

Digital Library

[118]

M. Rodeh, V. R. Pratt, and S. Even. 1981. Linear algorithm for data compression via string matching. J. ACM 28, 1 (1981), 16--24.

Digital Library

[119]

L. M. S. Russo, A. Correia, G. Navarro, and A. P. Francisco. 2020. Approximating optimal bidirectional macro schemes. In Proceedings of the 30th Data Compression Conference (DCC’20). 153--162.

[120]

S. C. Sahinalp and U. Vishkin. 1995. Data Compression Using Locally Consistent Parsing. Technical Report. Department of Computer Science, University of Maryland.

[121]

K. Sakai, T. Ohno, K. Goto, Y. Takabatake, T. I, and H. Sakamoto. 2019. RePair in compressed space and time. In Proceedings of the 29th Data Compression Conference (DCC’19). 518--527.

[122]

H. Sakamoto. 2005. A fully linear-time approximation algorithm for grammar-based compression. J. Discr. Algor. 3, 2â4 (2005), 416--430.

[123]

F. Silvestri. 2010. Mining query logs: Turning search usage data into knowledge. Found. Trends Inf. Retriev. 4, 1--2 (2010), 1--174.

Digital Library

[124]

J. Sirén. 2016. Burrows-wheeler transform for terabases. In Proceedings of the 26th Data Compression Conference (DCC’16). 211--220.

[125]

J. Sirén, N. Välimäki, V. Mäkinen, and G. Navarro. 2008. Run-length compressed indexes are superior for highly repetitive sequence collections. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE’08). 164--175.

[126]

J. A. Storer and T. G. Szymanski. 1982. Data compression via textual substitution. J. ACM 29, 4 (1982), 928--951.

Digital Library

[127]

J.-H. Su, Y.-T. Huang, H.-H. Yeh, and V. S. Tseng. 2010. Effective content-based video retrieval using pattern-indexing and matching techniques. Expert Syst. Appl. 37, 7 (2010), 5068--5085.

Digital Library

[128]

Y. Takabatake, T. I, and H. Sakamoto. 2017. A space-optimal grammar compression. In Proceedings of the 25th Annual European Symposium on Algorithms (ESA’17). 67:1--67:15.

[129]

Y. Takabatake, Y. Tabei, and H. Sakamoto. 2014. Improved ESP-index: A practical self-index for highly repetitive texts. In Proceedings of the 13th International Symposium on Experimental Algorithms (SEA’14). 338--350.

[130]

T. Takagi, K. Goto, Y. Fujishige, S. Inenaga, and H. Arimura. 2017. Linear-size CDAWG: New repetition-aware indexing and grammar compression. In Proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE’17). 304--316.

[131]

K. Tsuruta, D. Köppl, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. 2020. Grammar-compressed Self-index with lyndon words. CoRR 2004.05309 (2020).

[132]

R. Typke, F. Wiering, and R. Veltkamp. 2005. A survey of music information retrieval systems. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR’05). 153--160.

[133]

E. Ukkonen. 1995. On-line construction of suffix trees. Algorithmica 14, 3 (1995), 249--260.

Digital Library

[134]

D. Valenzuela, D. Kosolobov, G. Navarro, and S. J. Puglisi. 2020. Lempel-Ziv like parsing in small space. Algorithmica 82, 11 (2020), 3195--3215.

Digital Library

[135]

P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory (FOCS’73). 1--11.

Digital Library

[136]

J. Yamamoto, T. I, H. Bannai, S. Inenaga, and M. Takeda. 2014. Faster compact on-line lempel-ziv factorization. In Proceedings of the 31st International Symposium on Theoretical Aspects of Computer Science (STACS’14). 675--686.

[137]

J. Ziv and A. Lempel. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 5 (1978), 530--536.

Digital Library

Cited By

Navarro G(2024)Computing MEMs and Relatives on Repetitive Text CollectionsACM Transactions on Algorithms10.1145/3701561Online publication date: 25-Oct-2024
https://doi.org/10.1145/3701561
Kawamoto AI TKöppl DBannai H(2024)On the Hardness of Smallest RLSLPs and Collage Systems2024 Data Compression Conference (DCC)10.1109/DCC58796.2024.00032(243-252)Online publication date: 19-Mar-2024
https://doi.org/10.1109/DCC58796.2024.00032
Boffa AFerragina PTosoni FVinciguerra G(2024)CoCo-trieInformation Systems10.1016/j.is.2023.102316120:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.is.2023.102316
Show More Cited By

Index Terms

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes
1. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis

Recommendations

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like ...
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ ...
Document retrieval on repetitive string collections

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 54, Issue 2

March 2022

800 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3450359

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2021

Accepted: 01 October 2020

Revised: 01 October 2020

Received: 01 April 2020

Published in CSUR Volume 54, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Millennium Science Initiative Program - Code ICN17_002
Fondecyt
ANID Basal Funds

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
612
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)15

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Navarro G(2024)Computing MEMs and Relatives on Repetitive Text CollectionsACM Transactions on Algorithms10.1145/3701561Online publication date: 25-Oct-2024
https://doi.org/10.1145/3701561
Kawamoto AI TKöppl DBannai H(2024)On the Hardness of Smallest RLSLPs and Collage Systems2024 Data Compression Conference (DCC)10.1109/DCC58796.2024.00032(243-252)Online publication date: 19-Mar-2024
https://doi.org/10.1109/DCC58796.2024.00032
Boffa AFerragina PTosoni FVinciguerra G(2024)CoCo-trieInformation Systems10.1016/j.is.2023.102316120:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.is.2023.102316
Kociumaka TNavarro GOlivares F(2024)Near-Optimal Search Time in -Optimal Space, and Vice VersaAlgorithmica10.1007/s00453-023-01186-086:4(1031-1056)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s00453-023-01186-0
Cleary AWinjum JDood JInenaga S(2024)Revisiting the Folklore Algorithm for Random Access to Grammar-Compressed StringsString Processing and Information Retrieval10.1007/978-3-031-72200-4_7(88-101)Online publication date: 19-Sep-2024
https://doi.org/10.1007/978-3-031-72200-4_7
Carfagna LManzini GRomana GSciortino MUrbina C(2024)Generalization of Repetitiveness Measures for Two-Dimensional StringsString Processing and Information Retrieval10.1007/978-3-031-72200-4_5(57-72)Online publication date: 19-Sep-2024
https://doi.org/10.1007/978-3-031-72200-4_5
Navarro GUrbina C(2024)Iterated Straight-Line ProgramsLATIN 2024: Theoretical Informatics10.1007/978-3-031-55598-5_5(66-80)Online publication date: 6-Mar-2024
https://doi.org/10.1007/978-3-031-55598-5_5
Baláž AGagie TGoga AHeumos SNavarro GPetescia ASirén J(2024)Wheeler MapsLATIN 2024: Theoretical Informatics10.1007/978-3-031-55598-5_12(178-192)Online publication date: 6-Mar-2024
https://doi.org/10.1007/978-3-031-55598-5_12
Guerrini VConte AGrossi RLiti GRosone GTattini L(2023)phyBWT2: phylogeny reconstruction via eBWT positional clusteringAlgorithms for Molecular Biology10.1186/s13015-023-00232-418:1Online publication date: 3-Aug-2023
https://doi.org/10.1186/s13015-023-00232-4
Kempa DKociumaka T(2023)Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00114(1877-1886)Online publication date: 6-Nov-2023
https://doi.org/10.1109/FOCS57990.2023.00114
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents