article

Sublinear Algorithms for Approximating String Compressibility

Authors:

Sofya Raskhodnikova,

Ronitt Rubinfeld,

Adam SmithAuthors Info & Claims

Algorithmica, Volume 65, Issue 3

Pages 685 - 709

https://doi.org/10.1007/s00453-012-9618-6

Published: 01 March 2013 Publication History

Abstract

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and a variant of Lempel-Ziv (LZ77), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.

Our investigation of LZ77 yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to LZ77 to the number of distinct short substrings contained in it (its ℓth subword complexity, for small ℓ). In addition, we show that approximating the compressibility with respect to LZ77 is related to approximating the support size of a distribution.

References

[1]

Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 23(1), 90---93 (1974)

Digital Library

[2]

Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137---147 (1999)

Digital Library

[3]

Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: lower bounds and applications. In: Proceedings of the Thirty-Third Annual ACM Symposium on the Theory of Computing (STOC), pp. 266---275 (2001)

Digital Library

[4]

Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132---150 (2005)

Digital Library

[5]

Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002). See comment by Khmelev D.V., Teahan W.J.: Phys. Rev. Lett. 90(8), 089803 (2003); and the reply: Phys. Rev. Lett. 90(8), 089804 (2003)

[6]

Brautbar, M., Samorodnitsky, A.: Approximating entropy from sublinear samples. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 366---375 (2007)

Digital Library

[7]

Bunge, J.: Bibliography on estimating the number of classes in a population. www.stat.cornell.edu/~bunge/bibliography.htm

[8]

Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)

[9]

Cai, H., Kulkarni, S.R., Verdú, S.: Universal entropy estimation via block sorting. IEEE Trans. Inf. Theory 50(7), 1551---1561 (2004)

Digital Library

[10]

Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 268---279. ACM, New York (2000)

Digital Library

[11]

Chui, C.K.: An Introduction to Wavelets. Academic Press, San Diego (1992)

Digital Library

[12]

Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523---1545 (2005)

Digital Library

[13]

Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J., Cooper, S.B., Li, A. (eds.) Proceedings of the Third International Conference on Theory and Applications of Models of Computation (TAMC). Lecture Notes in Computer Science, vol. 3959, pp. 21---45. Springer, Berlin (2006)

Digital Library

[14]

Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396---402 (1984)

[15]

Cormode, G., Muthukrishnan, S.: Substring compression problems. In: Proceedings of the Thirty-Third Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 321---330 (2005)

Digital Library

[16]

Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)

Digital Library

[17]

Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics 2007, 8:252 (2007)

[18]

de Luca, A.: On the combinatorics of finite words. Theor. Comput. Sci. 218(1), 13---39 (1999)

Digital Library

[19]

Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Proceedings of the Data Compression Conference (DCC), p. 555 (2000)

Digital Library

[20]

Gheorghiciuc, I., Ward, M.: On correlation polynomials and subword complexity. In: Discrete Math and Theoretical Computer Science (DMTCS), Proceedings of the Conference on Analysis of Algorithms (AofA), pp. 1---18 (2007)

[21]

Ilie, L., Yu, S., Zhang, K.: Repetition complexity of words. In: Ibarra, O.H., Zhang, L. (eds.) Proceedings of the 8th Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 2387, pp. 320---329. Springer, Berlin (2002)

Digital Library

[22]

Janson, S., Lonardi, S., Szpankowski, W.: On average sequence complexity. Theor. Comput. Sci. 326(1---3), 213---227 (2004)

Digital Library

[23]

Kása, Z.: On the d-complexity of strings. Pure Math. Appl. 9(1---2), 119---128 (1998)

[24]

Keller, O., Kopelowitz, T., Landau, S., Lewenstein, M.: Generalized substring compression. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 26---38 (2009)

[25]

Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 206---215 (2004)

Digital Library

[26]

Keogh, E.J., Keogh, L., Handley, J.: Compression-based data mining. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, pp. 278---285. IGI Global (2009)

[27]

Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problemy Peredachi Inf. 37(2), 96---98 (2000) (Problems of Information Transmission (Engl. Transl.) 37, 172---184 (2001))

Digital Library

[28]

Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205---212 (2002)

Digital Library

[29]

Levé, F., Séébold, P.: Proof of a conjecture on word complexity. Bull. Belg. Math. Soc. 8(2), 277---291 (2001)

[30]

Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250---3264 (2004)

Digital Library

[31]

Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Berlin (1997)

Digital Library

[32]

Loewenstern, D., Hirsh, H., Noordewier, M., Yianilos, P.: DNA sequence classification using compression-based induction. Tech. Rep. 95-04, Rutgers University, DIMACS (1995)

Digital Library

[33]

Paninski, L.: Estimation of entropy and mutual information. Neural Comput. 15(6), 1191---1253 (2003)

Digital Library

[34]

Paninski, L.: Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 50(9), 2200---2203 (2004)

Digital Library

[35]

Pierce, L. II, Shields, P.C.: Sequences incompressible by SLZ (LZW), yet fully compressible by ULZ. In: Numbers, Information and Complexity, I, pp. 385---390. Kluwer, Norwell (2000)

[36]

Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.: Sublinear algorithms for approximating string compressibility. In: Proceedings of the Eleventh International Workshop on Randomization and Computation (RANDOM), pp. 609---623 (2007)

Digital Library

[37]

Raskhodnikova, S., Ron, D., Shpilka, A., Smith, A.: Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput. 39(3), 813---842 (2009)

Digital Library

[38]

Sculley, D., Brodley, C.E.: Compression and machine learning: a new perspective on feature space vectors. In: Proceedings of the Data Compression Conference (DCC), pp. 332---341 (2006)

Digital Library

[39]

Shallit, J.: On the maximum number of distinct factors of a binary string. Graphs Comb. 9(2), 197---200 (1993)

[40]

Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory 41(3), 653---664 (1995)

Digital Library

[41]

Witten, I.H., Bray, Z., Mahoui, M., Teahan, W.J.: Text mining: a new frontier for lossless compression. In: Proceedings of the Data Compression Conference (DCC), pp. 198---207 (1999)

Digital Library

[42]

Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337---343 (1977)

Digital Library

[43]

Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 530---536 (1978)

Digital Library

Cited By

Navarro GOlivares FUrbina C(2025)Generalized straight-line programsActa Informatica10.1007/s00236-025-00481-362:1Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s00236-025-00481-3
Navarro G(2024)Computing MEMs and Relatives on Repetitive Text CollectionsACM Transactions on Algorithms10.1145/370156121:1(1-33)Online publication date: 17-Dec-2024
https://dl.acm.org/doi/10.1145/3701561
Hoag ELee KMestre JPupyrev SZhu Y(2024)Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-UpACM Transactions on Embedded Computing Systems10.1145/366063523:4(1-54)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3660635
Show More Cited By

Sublinear Algorithms for Approximating String Compressibility
1. Theory of computation

Recommendations

Sublinear Algorithms for Approximating String Compressibility
APPROX '07/RANDOM '07: Proceedings of the 10th International Workshop on Approximation and the 11th International Workshop on Randomization, and Combinatorial Optimization. Algorithms and Techniques

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv ...
Sublinear Time Lempel-Ziv (LZ77) Factorization
String Processing and Information Retrieval
Abstract
The Lempel-Ziv (LZ77) factorization of a string is a widely-used algorithmic tool that plays a central role in data compression and indexing. For a length-n string over integer alphabet with , and on a word RAM of width , it can be computed in ...
New Advances in Rightmost Lempel-Ziv
String Processing and Information Retrieval
Abstract
The Lempel-Ziv (LZ) 77 factorization of a string is a widely-used algorithmic tool that plays a central role in compression and indexing. For a length-n string over a linearly-sortable alphabet, e.g., with , it can be computed in time. It is ...

Comments

Information & Contributors

Information

Published In

cover image Algorithmica

Algorithmica Volume 65, Issue 3

March 2013

229 pages

ISSN:0178-4617

Issue’s Table of Contents

Copyright © Copyright © 2013 Springer Science+Business Media New York.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 March 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Navarro GOlivares FUrbina C(2025)Generalized straight-line programsActa Informatica10.1007/s00236-025-00481-362:1Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s00236-025-00481-3
Navarro G(2024)Computing MEMs and Relatives on Repetitive Text CollectionsACM Transactions on Algorithms10.1145/370156121:1(1-33)Online publication date: 17-Dec-2024
https://dl.acm.org/doi/10.1145/3701561
Hoag ELee KMestre JPupyrev SZhu Y(2024)Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-UpACM Transactions on Embedded Computing Systems10.1145/366063523:4(1-54)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3660635
Kociumaka TNavarro GOlivares F(2024)Near-Optimal Search Time in -Optimal Space, and Vice VersaAlgorithmica10.1007/s00453-023-01186-086:4(1031-1056)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s00453-023-01186-0
Navarro GUrbina C(2024)Iterated Straight-Line ProgramsLATIN 2024: Theoretical Informatics10.1007/978-3-031-55598-5_5(66-80)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-55598-5_5
Hoag ELee KMestre JPupyrev SEgger BLee D(2023)Optimizing Function Layout for Mobile ApplicationsProceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3589610.3596277(52-63)Online publication date: 13-Jun-2023
https://dl.acm.org/doi/10.1145/3589610.3596277
Kociumaka TNavarro GPrezza N(2023)Toward a Definitive Compressibility Measure for Repetitive SequencesIEEE Transactions on Information Theory10.1109/TIT.2022.322438269:4(2074-2092)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TIT.2022.3224382
Carfagna LManzini G(2023)Compressibility Measures for Two-Dimensional DataString Processing and Information Retrieval10.1007/978-3-031-43980-3_9(102-113)Online publication date: 26-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-43980-3_9
Pissis SShekelyan MLiu CLoukides G(2023)Frequency-Constrained Substring ComplexityString Processing and Information Retrieval10.1007/978-3-031-43980-3_28(345-352)Online publication date: 26-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-43980-3_28
Ellert J(2023)Sublinear Time Lempel-Ziv (LZ77) FactorizationString Processing and Information Retrieval10.1007/978-3-031-43980-3_14(171-187)Online publication date: 26-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-43980-3_14
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents