Abstract
We show that both the Lempel–Ziv-77 and the Lempel–Ziv-78 factorization of a text of length n on an integer alphabet of size \(\sigma \) can be computed in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) time with either \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n \lg \sigma \right) \) bits of working space, or \((1+\epsilon ) n \lg n + \mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) bits (for a constant \(\epsilon >0\)) of working space (including the space for the output, but not the text).
Similar content being viewed by others
Notes
In the initial submission, the \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) deterministic time suffix tree construction algorithm of Munro et al. [43] was not yet published. Our former results were \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) randomized or \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n \lg \lg \sigma \right) \) deterministic time based on the suffix tree construction algorithm of [2].
More precisely, we use the permuted longest common prefix array that can access \(\mathsf {LCP}\) only in conjunction with \(\mathsf {SA}\).
The time bound for computing the suffix array has recently been improved to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) by two in-place suffix sorting algorithms [19, 40]. Our succinct suffix tree is composed of both \(\mathsf {SA}\) and \(\mathsf {ISA}\), yielding \((1+\epsilon )n \lg n\) bits and \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon \right) \) construction time. This construction time is the bottleneck of the succinct suffix tree construction and the later described algorithms. Hence, we can lower the time \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon ^2\right) \) to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon \right) \) in Theorem 2.8 and Corollaries 3.2, 3.7, and 4.10.
References
Amir, A., Farach, M., Idury, R.M., Poutré, J.A.L., Schäffer, A.A.: Improved dynamic dictionary matching. Inf. Comput. 119(2), 258–282 (1995)
Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the STOC, pp. 148–193. ACM (2014)
Belazzougui, D., Puglisi, S.J.: Range predecessor and Lempel–Ziv parsing. In: Proceedings of the SODA, pp. 2053–2071. ACM/SIAM(2016)
Belazzougui, D., Mäkinen, V., Valenzuela, D.: Compressed suffix array. In: Encyclopedia of Algorithms, pp. 386–390. Springer (2016)
Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)
Clark, D.R.: Compact Pat Trees. Ph.D. Thesis. University of Waterloo (1996)
Crochemore, M.: Transducers and repetitions. Theor. Comput. Sci. 45(1), 63–86 (1986)
Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32(6), 1654–1673 (2003)
Duval, J., Kolpakov, R., Kucherov, G., Lecroq, T., Lefebvre, A.: Linear-time computation of local periods. Theor. Comput. Sci. 326(1–3), 229–240 (2004)
El-Zein, H., Munro, J.I., Robertson, M.: Raising permutations to powers in place. In: Proceedings of the ISAAC, volume 64 of LIPIcs, pp. 29:1–29:12. Schloss Dagstuhl (2016)
Farach, M.: Optimal suffix tree construction with large alphabets. In: Foundations of Computer Science, pp. 137–143. IEEE Computer Society (1997)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of the CPM, volume 9133 of LNCS, pp. 160–171. Springer (2015)
Fischer, J., Heun, V.: Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)
Fischer, J., Heun, V.: Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)
Franceschini, G., Muthukrishnan, S., Pǎtraşcu, M.: Radix sorting with no extra space. In: Proceedings of the ESA, volume 4698 of LNCS, pp. 194–205. Springer (2007)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Proceedings of the LATA, volume 7183 of LNCS, pp. 240–251. Springer (2012)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Proceedings of the Latin, 8392 of LNCS, pp. 731–742. Springer (2014)
Goto, K.: Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets. ArXiv CoRR, arXiv:1703.01009 (2017)
Goto, K., Bannai, H.: Simpler and faster Lempel Ziv factorization. In: Proceedings of the DCC, pp. 133–142. IEEE Computer Society (2013)
Goto, K., Bannai, H.: Space efficient linear time Lempel–Ziv factorization for small alphabets. In: Proceedings of the DCC, pp. 163–172. IEEE Computer Society (2014)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Gusfield, D., Stoye, J.: Linear time algorithms for finding and representing all the tandem repeats in a string. J. Comput. Syst. Sci. 69(4), 525–546 (2004)
Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. In: Proceedings of the FOCS, pp. 251–260. IEEE Computer Society (2003)
Jacobson, G.J.: Space-efficient static trees and graphs. In: Proceedings of the FOCS, pp. 549–554. IEEE Computer Society (1989)
Jansson, J., Sadakane, K., Sung, W.-K.: Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2), 619–631 (2012)
Jansson, J., Sadakane, K., Sung, W.-K.: Linked dynamic tries with applications to LZ-compression in sublinear time and space. Algorithmica 71(4), 969–988 (2015)
Kärkkäinen, J., Sutinen, E.: Lempel–Ziv index for q-grams. Algorithmica 21(1), 137–154 (1998)
Kärkkäinen, J., Ukkonen, E.: Lempel–Ziv parsing and sublinear-size index structures for string matching. In: South American Workshop on String Processing (WSP), pp. 141–155. Carleton University Press (1996)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 1–19 (2006)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: simple, fast, small. In: Proceedings of the CPM, volume 7922 of LNCS, pp. 189–200. Springer (2013)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lightweight Lempel–Ziv parsing. In: Proceedings of the SEA, volume 7933 of LNCS, pp. 139–150. Springer (2013)
Kempa, D., Puglisi, S.J.: Lempel–Ziv factorization: simple, fast, practical. In: Proceedings of the ALENEX, pp. 103–112. SIAM (2013)
Kociumaka, T., Kubica, M., Radoszewski, J., Rytter, W., Walen, T.: A linear time algorithm for seeds computation. In: Proceedings of the SODA, pp. 1095–1112. ACM/SIAM (2012)
Kolpakov, R.M., Kucherov, G.: Finding maximal repetitions in a word in linear time. In: Proceedings of the FOCS, pp. 596–604 (1999)
Kolpakov, R.M., Kucherov, G.: Finding repeats with fixed gap. In: Proceedings of the SPIRE, pp. 162–168. IEEE Computer Society (2000)
Köppl, D., Sadakane, K.: Lempel–Ziv computation in compressed space (LZ-CICS). In: Proceedings of the DCC, pp. 3–12. IEEE Computer Society (2016)
Li, M., Sleep, R.: An LZ78 based string kernel. In: Proceedings of the ADMA, volume 3584 of LNCS, pp. 678–689. Springer (2005)
Li, M., Zhu, Y.: Image classification via LZ78 based string kernel: a comparative study. In: Proceedings of the PAKDD, volume 3918 of LNCS, pp. 704–712. Springer (2006)
Li, Z., Li, J., Huo, H.: Optimal in-place suffix sorting. ArXiv CoRR, arXiv:1610.08305 (2016)
Main, M.G.: Detecting leftmost maximal periodicities. Discrete Appl. Math. 25(1–2), 145–153 (1989)
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Munro, J.I., Navarro, G., Nekrich, Y.: Space-efficient construction of compressed indexes in deterministic linear time. In: Proceedings of the SODA, pp. 408–424. SIAM (2017)
Nakashima, Y., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Constructing LZ78 tries and position heaps in linear time for large alphabets. Inf. Process. Lett. 115(9), 655–659 (2015)
Navarro, G.: Indexing text using the Ziv–Lempel trie. J. Discrete Algorithms 2(1), 87–114 (2004)
Navarro, G.: Compact Data Structures: A practical approach. Cambridge University Press, Cambridge (2016)
Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. SIAM J. Comput. 43(5), 1781–1806 (2014)
Navarro, G., Sadakane, K.: Fully functional static and dynamic succinct trees. ACM Trans. Algorithms 10(3), 16 (2014)
Nong, G.: Practical linear-time \(\cal{O}(1)\)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), 15 (2013)
Ohlebusch, E., Fischer, J., Gog, S.: CST++. In: Proceedings of the SPIRE, volume 6393 of LNCS, pp. 322–333. Springer (2010)
Ouyang, J., Luo, H., Wang, Z., Tian, J., Liu, C., Sheng, K.: FPGA implementation of GZIP compression and decompression for IDC services. In: Proceedings of the FPT, pp. 265–268. IEEE Computer Society (2010)
Richard, G.G., Case, A.: In lieu of swap: analyzing compressed RAM in Mac OS X and Linux. Digit. Investig. 11, 3–12 (2014)
Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully-compressed suffix trees. In: Proceedings of the LATIN, volume 4957 of LNCS, pp. 362–373. Springer (2008)
Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proceedings of the SODA, pp. 225–237. ACM/SIAM (2002)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of the SODA, pp. 1230–1239. ACM/SIAM (2006)
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)
Välimäki, N., Mäkinen, V., Gerlach, W., Dixit, K.: Engineering a compressed suffix tree implementation. ACM J. Exp. Algorithm. 14, 2 (2009)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Acknowledgements
We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions. We are especially grateful for the reviewer pointing out a simplification of our original solution on how to store the exploration counters for the LZ78 factorizations (Sect. 4.1). Further, we are grateful to Sean Tohidi, who spell-checked the initial submission of this paper during his DAAD RISE internship at TU Dortmund. This research was supported by CREST, JST.
Author information
Authors and Affiliations
Corresponding author
Appendix: List of Identifiers
Appendix: List of Identifiers
While describing both factorization algorithms, we used several data structures, among others bit vectors, some with rank or select-support, to achieve the small space bounds. We denote bit vectors with \(B_{\alpha }\) for some letter \(\alpha \).
For all types of LZ-factorizations we use
-
\(B_{W}\) marking all witness nodes,
-
the array W mapping witness ids to
-
(LZ77) text positions, or
-
(LZ78) factor indices.
-
In LZ77 we use
-
\(B_{V}\) marking visited nodes
In LZ78 we use
-
\(B_{C}\) counts \(n_v\) of each partially explored node v,
-
\(B_{V}\) marking suffix tree nodes represented in the LZ trie (their ingoing edges are fully explored),
-
\(B_{LZ}\) marking explicit LZ nodes, and
-
the array \(W'\) mapping LZ nodes to factor indices,
-
\(B_{E}\) marking the edge witnesses.
The algorithms based on the SST additionally use
-
\(B_{T}\) marking the factor positions, used also for representing the length of a factor.
We count the number of
-
factors by z
-
witnesses by \({z_{\text {W}}}\)
-
referencing factors by \({z_{\text {R}}}\)
-
fresh factors by \({z_{\text {F}}}\).
Figure 14 highlights the kind of suffix tree representation (either compressed or succinct suffix tree) used in each subsection of the algorithmic part of the article. Table 2 lists the particular data structures of each such subsection.
Rights and permissions
About this article
Cite this article
Fischer, J., I, T., Köppl, D. et al. Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees. Algorithmica 80, 2048–2081 (2018). https://doi.org/10.1007/s00453-017-0333-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-017-0333-1