Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We show that both the Lempel–Ziv-77 and the Lempel–Ziv-78 factorization of a text of length n on an integer alphabet of size \(\sigma \) can be computed in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) time with either \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n \lg \sigma \right) \) bits of working space, or \((1+\epsilon ) n \lg n + \mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) bits (for a constant \(\epsilon >0\)) of working space (including the space for the output, but not the text).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. In the initial submission, the \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) deterministic time suffix tree construction algorithm of Munro et al. [43] was not yet published. Our former results were \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) randomized or \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n \lg \lg \sigma \right) \) deterministic time based on the suffix tree construction algorithm of [2].

  2. More precisely, we use the permuted longest common prefix array that can access \(\mathsf {LCP}\) only in conjunction with \(\mathsf {SA}\).

  3. The time bound for computing the suffix array has recently been improved to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) by two in-place suffix sorting algorithms [19, 40]. Our succinct suffix tree is composed of both \(\mathsf {SA}\) and \(\mathsf {ISA}\), yielding \((1+\epsilon )n \lg n\) bits and \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon \right) \) construction time. This construction time is the bottleneck of the succinct suffix tree construction and the later described algorithms. Hence, we can lower the time \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon ^2\right) \) to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon \right) \) in Theorem 2.8 and Corollaries 3.2, 3.7, and 4.10.

References

  1. Amir, A., Farach, M., Idury, R.M., Poutré, J.A.L., Schäffer, A.A.: Improved dynamic dictionary matching. Inf. Comput. 119(2), 258–282 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  2. Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the STOC, pp. 148–193. ACM (2014)

  3. Belazzougui, D., Puglisi, S.J.: Range predecessor and Lempel–Ziv parsing. In: Proceedings of the SODA, pp. 2053–2071. ACM/SIAM(2016)

  4. Belazzougui, D., Mäkinen, V., Valenzuela, D.: Compressed suffix array. In: Encyclopedia of Algorithms, pp. 386–390. Springer (2016)

  5. Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. Clark, D.R.: Compact Pat Trees. Ph.D. Thesis. University of Waterloo (1996)

  7. Crochemore, M.: Transducers and repetitions. Theor. Comput. Sci. 45(1), 63–86 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  8. Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32(6), 1654–1673 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  9. Duval, J., Kolpakov, R., Kucherov, G., Lecroq, T., Lefebvre, A.: Linear-time computation of local periods. Theor. Comput. Sci. 326(1–3), 229–240 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  10. El-Zein, H., Munro, J.I., Robertson, M.: Raising permutations to powers in place. In: Proceedings of the ISAAC, volume 64 of LIPIcs, pp. 29:1–29:12. Schloss Dagstuhl (2016)

  11. Farach, M.: Optimal suffix tree construction with large alphabets. In: Foundations of Computer Science, pp. 137–143. IEEE Computer Society (1997)

  12. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  13. Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of the CPM, volume 9133 of LNCS, pp. 160–171. Springer (2015)

  14. Fischer, J., Heun, V.: Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  15. Fischer, J., Heun, V.: Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  16. Franceschini, G., Muthukrishnan, S., Pǎtraşcu, M.: Radix sorting with no extra space. In: Proceedings of the ESA, volume 4698 of LNCS, pp. 194–205. Springer (2007)

  17. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Proceedings of the LATA, volume 7183 of LNCS, pp. 240–251. Springer (2012)

  18. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Proceedings of the Latin, 8392 of LNCS, pp. 731–742. Springer (2014)

  19. Goto, K.: Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets. ArXiv CoRR, arXiv:1703.01009 (2017)

  20. Goto, K., Bannai, H.: Simpler and faster Lempel Ziv factorization. In: Proceedings of the DCC, pp. 133–142. IEEE Computer Society (2013)

  21. Goto, K., Bannai, H.: Space efficient linear time Lempel–Ziv factorization for small alphabets. In: Proceedings of the DCC, pp. 163–172. IEEE Computer Society (2014)

  22. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  23. Gusfield, D., Stoye, J.: Linear time algorithms for finding and representing all the tandem repeats in a string. J. Comput. Syst. Sci. 69(4), 525–546 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  24. Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. In: Proceedings of the FOCS, pp. 251–260. IEEE Computer Society (2003)

  25. Jacobson, G.J.: Space-efficient static trees and graphs. In: Proceedings of the FOCS, pp. 549–554. IEEE Computer Society (1989)

  26. Jansson, J., Sadakane, K., Sung, W.-K.: Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2), 619–631 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  27. Jansson, J., Sadakane, K., Sung, W.-K.: Linked dynamic tries with applications to LZ-compression in sublinear time and space. Algorithmica 71(4), 969–988 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  28. Kärkkäinen, J., Sutinen, E.: Lempel–Ziv index for q-grams. Algorithmica 21(1), 137–154 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  29. Kärkkäinen, J., Ukkonen, E.: Lempel–Ziv parsing and sublinear-size index structures for string matching. In: South American Workshop on String Processing (WSP), pp. 141–155. Carleton University Press (1996)

  30. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 1–19 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  31. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: simple, fast, small. In: Proceedings of the CPM, volume 7922 of LNCS, pp. 189–200. Springer (2013)

  32. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lightweight Lempel–Ziv parsing. In: Proceedings of the SEA, volume 7933 of LNCS, pp. 139–150. Springer (2013)

  33. Kempa, D., Puglisi, S.J.: Lempel–Ziv factorization: simple, fast, practical. In: Proceedings of the ALENEX, pp. 103–112. SIAM (2013)

  34. Kociumaka, T., Kubica, M., Radoszewski, J., Rytter, W., Walen, T.: A linear time algorithm for seeds computation. In: Proceedings of the SODA, pp. 1095–1112. ACM/SIAM (2012)

  35. Kolpakov, R.M., Kucherov, G.: Finding maximal repetitions in a word in linear time. In: Proceedings of the FOCS, pp. 596–604 (1999)

  36. Kolpakov, R.M., Kucherov, G.: Finding repeats with fixed gap. In: Proceedings of the SPIRE, pp. 162–168. IEEE Computer Society (2000)

  37. Köppl, D., Sadakane, K.: Lempel–Ziv computation in compressed space (LZ-CICS). In: Proceedings of the DCC, pp. 3–12. IEEE Computer Society (2016)

  38. Li, M., Sleep, R.: An LZ78 based string kernel. In: Proceedings of the ADMA, volume 3584 of LNCS, pp. 678–689. Springer (2005)

  39. Li, M., Zhu, Y.: Image classification via LZ78 based string kernel: a comparative study. In: Proceedings of the PAKDD, volume 3918 of LNCS, pp. 704–712. Springer (2006)

  40. Li, Z., Li, J., Huo, H.: Optimal in-place suffix sorting. ArXiv CoRR, arXiv:1610.08305 (2016)

  41. Main, M.G.: Detecting leftmost maximal periodicities. Discrete Appl. Math. 25(1–2), 145–153 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  42. Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  43. Munro, J.I., Navarro, G., Nekrich, Y.: Space-efficient construction of compressed indexes in deterministic linear time. In: Proceedings of the SODA, pp. 408–424. SIAM (2017)

  44. Nakashima, Y., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Constructing LZ78 tries and position heaps in linear time for large alphabets. Inf. Process. Lett. 115(9), 655–659 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  45. Navarro, G.: Indexing text using the Ziv–Lempel trie. J. Discrete Algorithms 2(1), 87–114 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  46. Navarro, G.: Compact Data Structures: A practical approach. Cambridge University Press, Cambridge (2016)

    Book  Google Scholar 

  47. Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. SIAM J. Comput. 43(5), 1781–1806 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  48. Navarro, G., Sadakane, K.: Fully functional static and dynamic succinct trees. ACM Trans. Algorithms 10(3), 16 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  49. Nong, G.: Practical linear-time \(\cal{O}(1)\)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), 15 (2013)

    Article  MathSciNet  Google Scholar 

  50. Ohlebusch, E., Fischer, J., Gog, S.: CST++. In: Proceedings of the SPIRE, volume 6393 of LNCS, pp. 322–333. Springer (2010)

  51. Ouyang, J., Luo, H., Wang, Z., Tian, J., Liu, C., Sheng, K.: FPGA implementation of GZIP compression and decompression for IDC services. In: Proceedings of the FPT, pp. 265–268. IEEE Computer Society (2010)

  52. Richard, G.G., Case, A.: In lieu of swap: analyzing compressed RAM in Mac OS X and Linux. Digit. Investig. 11, 3–12 (2014)

    Article  Google Scholar 

  53. Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully-compressed suffix trees. In: Proceedings of the LATIN, volume 4957 of LNCS, pp. 362–373. Springer (2008)

  54. Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proceedings of the SODA, pp. 225–237. ACM/SIAM (2002)

  55. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  56. Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of the SODA, pp. 1230–1239. ACM/SIAM (2006)

  57. Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  58. Välimäki, N., Mäkinen, V., Gerlach, W., Dixit, K.: Engineering a compressed suffix tree implementation. ACM J. Exp. Algorithm. 14, 2 (2009)

    MathSciNet  MATH  Google Scholar 

  59. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  60. Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions. We are especially grateful for the reviewer pointing out a simplification of our original solution on how to store the exploration counters for the LZ78 factorizations (Sect. 4.1). Further, we are grateful to Sean Tohidi, who spell-checked the initial submission of this paper during his DAAD RISE internship at TU Dortmund. This research was supported by CREST, JST.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominik Köppl.

Appendix: List of Identifiers

Appendix: List of Identifiers

While describing both factorization algorithms, we used several data structures, among others bit vectors, some with rank or select-support, to achieve the small space bounds. We denote bit vectors with \(B_{\alpha }\) for some letter \(\alpha \).

Fig. 14
figure 14

Connection between the introduced algorithms and the used suffix tree representations. The figure shows (by marking with a circle) which suffix tree representation is used by an algorithm (introduced in the respective section)

Table 2 List of data structures with names

For all types of LZ-factorizations we use

  • \(B_{W}\) marking all witness nodes,

  • the array W mapping witness ids to

    • (LZ77) text positions, or

    • (LZ78) factor indices.

In LZ77 we use

  • \(B_{V}\) marking visited nodes

In LZ78 we use

  • \(B_{C}\) counts \(n_v\) of each partially explored node v,

  • \(B_{V}\) marking suffix tree nodes represented in the LZ trie (their ingoing edges are fully explored),

  • \(B_{LZ}\) marking explicit LZ nodes, and

  • the array \(W'\) mapping LZ nodes to factor indices,

  • \(B_{E}\) marking the edge witnesses.

The algorithms based on the SST additionally use

  • \(B_{T}\) marking the factor positions, used also for representing the length of a factor.

We count the number of

  • factors by z

  • witnesses by \({z_{\text {W}}}\)

  • referencing factors by \({z_{\text {R}}}\)

  • fresh factors by \({z_{\text {F}}}\).

Figure 14 highlights the kind of suffix tree representation (either compressed or succinct suffix tree) used in each subsection of the algorithmic part of the article. Table 2 lists the particular data structures of each such subsection.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fischer, J., I, T., Köppl, D. et al. Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees. Algorithmica 80, 2048–2081 (2018). https://doi.org/10.1007/s00453-017-0333-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-017-0333-1

Keywords