Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Sublinear Algorithms for Approximating String Compressibility

Published: 01 March 2013 Publication History

Abstract

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and a variant of Lempel-Ziv (LZ77), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly.
Our investigation of LZ77 yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to LZ77 to the number of distinct short substrings contained in it (its ℓth subword complexity, for small ℓ). In addition, we show that approximating the compressibility with respect to LZ77 is related to approximating the support size of a distribution.

References

[1]
Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 23(1), 90---93 (1974)
[2]
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137---147 (1999)
[3]
Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: lower bounds and applications. In: Proceedings of the Thirty-Third Annual ACM Symposium on the Theory of Computing (STOC), pp. 266---275 (2001)
[4]
Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132---150 (2005)
[5]
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002). See comment by Khmelev D.V., Teahan W.J.: Phys. Rev. Lett. 90(8), 089803 (2003); and the reply: Phys. Rev. Lett. 90(8), 089804 (2003)
[6]
Brautbar, M., Samorodnitsky, A.: Approximating entropy from sublinear samples. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 366---375 (2007)
[7]
Bunge, J.: Bibliography on estimating the number of classes in a population. www.stat.cornell.edu/~bunge/bibliography.htm
[8]
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)
[9]
Cai, H., Kulkarni, S.R., Verdú, S.: Universal entropy estimation via block sorting. IEEE Trans. Inf. Theory 50(7), 1551---1561 (2004)
[10]
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 268---279. ACM, New York (2000)
[11]
Chui, C.K.: An Introduction to Wavelets. Academic Press, San Diego (1992)
[12]
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523---1545 (2005)
[13]
Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J., Cooper, S.B., Li, A. (eds.) Proceedings of the Third International Conference on Theory and Applications of Models of Computation (TAMC). Lecture Notes in Computer Science, vol. 3959, pp. 21---45. Springer, Berlin (2006)
[14]
Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396---402 (1984)
[15]
Cormode, G., Muthukrishnan, S.: Substring compression problems. In: Proceedings of the Thirty-Third Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 321---330 (2005)
[16]
Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)
[17]
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinformatics 2007, 8:252 (2007)
[18]
de Luca, A.: On the combinatorics of finite words. Theor. Comput. Sci. 218(1), 13---39 (1999)
[19]
Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Proceedings of the Data Compression Conference (DCC), p. 555 (2000)
[20]
Gheorghiciuc, I., Ward, M.: On correlation polynomials and subword complexity. In: Discrete Math and Theoretical Computer Science (DMTCS), Proceedings of the Conference on Analysis of Algorithms (AofA), pp. 1---18 (2007)
[21]
Ilie, L., Yu, S., Zhang, K.: Repetition complexity of words. In: Ibarra, O.H., Zhang, L. (eds.) Proceedings of the 8th Annual International Conference on Computing and Combinatorics (COCOON). Lecture Notes in Computer Science, vol. 2387, pp. 320---329. Springer, Berlin (2002)
[22]
Janson, S., Lonardi, S., Szpankowski, W.: On average sequence complexity. Theor. Comput. Sci. 326(1---3), 213---227 (2004)
[23]
Kása, Z.: On the d-complexity of strings. Pure Math. Appl. 9(1---2), 119---128 (1998)
[24]
Keller, O., Kopelowitz, T., Landau, S., Lewenstein, M.: Generalized substring compression. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 26---38 (2009)
[25]
Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 206---215 (2004)
[26]
Keogh, E.J., Keogh, L., Handley, J.: Compression-based data mining. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, pp. 278---285. IGI Global (2009)
[27]
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problemy Peredachi Inf. 37(2), 96---98 (2000) (Problems of Information Transmission (Engl. Transl.) 37, 172---184 (2001))
[28]
Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205---212 (2002)
[29]
Levé, F., Séébold, P.: Proof of a conjecture on word complexity. Bull. Belg. Math. Soc. 8(2), 277---291 (2001)
[30]
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250---3264 (2004)
[31]
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, Berlin (1997)
[32]
Loewenstern, D., Hirsh, H., Noordewier, M., Yianilos, P.: DNA sequence classification using compression-based induction. Tech. Rep. 95-04, Rutgers University, DIMACS (1995)
[33]
Paninski, L.: Estimation of entropy and mutual information. Neural Comput. 15(6), 1191---1253 (2003)
[34]
Paninski, L.: Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 50(9), 2200---2203 (2004)
[35]
Pierce, L. II, Shields, P.C.: Sequences incompressible by SLZ (LZW), yet fully compressible by ULZ. In: Numbers, Information and Complexity, I, pp. 385---390. Kluwer, Norwell (2000)
[36]
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.: Sublinear algorithms for approximating string compressibility. In: Proceedings of the Eleventh International Workshop on Randomization and Computation (RANDOM), pp. 609---623 (2007)
[37]
Raskhodnikova, S., Ron, D., Shpilka, A., Smith, A.: Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput. 39(3), 813---842 (2009)
[38]
Sculley, D., Brodley, C.E.: Compression and machine learning: a new perspective on feature space vectors. In: Proceedings of the Data Compression Conference (DCC), pp. 332---341 (2006)
[39]
Shallit, J.: On the maximum number of distinct factors of a binary string. Graphs Comb. 9(2), 197---200 (1993)
[40]
Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory 41(3), 653---664 (1995)
[41]
Witten, I.H., Bray, Z., Mahoui, M., Teahan, W.J.: Text mining: a new frontier for lossless compression. In: Proceedings of the Data Compression Conference (DCC), pp. 198---207 (1999)
[42]
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337---343 (1977)
[43]
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 530---536 (1978)

Cited By

View all
  1. Sublinear Algorithms for Approximating String Compressibility

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Algorithmica
    Algorithmica  Volume 65, Issue 3
    March 2013
    229 pages

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 01 March 2013

    Author Tags

    1. Lempel-Ziv
    2. Lossless compression
    3. Run-length encoding
    4. Sublinear algorithms

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media