Abstract
We present a dictionary based lossless text compression scheme where we keep frequent words in separate lists (list_n contains words of length n). We pursued two alternatives in terms of the lengths of the lists. In the "fixed" approach all lists have equal number of words whereas in the "flexible" approach no such constraint is imposed. Results clearly show that the "flexible" scheme is much better in all test cases possibly due to the fact that it can accomodate short, medium or long word lists reflecting on the word length distributions of a particular language. Our approach encodes a word as a prefix (the length of the word) and the body of the word (as an index in the corresponding list). For prefix encoding we have employed both a static encoding and a dynamic encoding (Huffman) using the word length statistics of the source language. Dynamic prefix encoding clearly outperformed its static counterpart in all cases. A language with a higher average word length can, theoretically, benefit more from a word-list based compression approach as compared to one with a lower average word length. We have put this hypothesis to test using Turkish and English languages with average word lengths of 6.1 and 4.4, respectively. Our results strongly support the validity of this hypothesis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Witten, I., Moffat, A., Bell, T.C.: Managing Gigabytes – Compressing and Indexing Documents and Images, San Francisco, CA, USA (1999)
Nelson, M.: The Data Compression Book. NewYork, USA, ch. 3 (1996)
Diri, B.: A Text Compression System Based on the Morphology of Turkish Language. In: International Symposium on Computer and Information Sciences (ISCIS) XV, October 11-13. Yildiz Technical University, Istanbul (2000)
Bentley, J.L., Sleator, D.D., Tarjan, R.E., Wei, V.K.: A Locally Adaptive Data Compression Scheme. Communications of the ACM 29(4), 320–330 (1986)
Teahan, W.J.: Modelling English Text. In: The Entropy of English Using PPM Based Models, ch. 8, p. 140 (1998)
Celikel, E., Dincer, B.T.: Improving the Compression Performance of Turkish Texts with PoS Tags. In: International Conference on Information and Knowledge Engineering (IKE 2004), Las Vegas, NV, USA, pp. 519–523 (2004)
Dalkılıç, M.E., Dalkılıç, G.: Some Measurable Language Characteristics of Printed Turkish. In: International Symposium on Computer and Information Sciences (ISCIS) XVI, Antalya, November 5-7 (2001)
Diri, B.: A System for Turkish Texts Based on the Analysis of Turkish Language Structure and Providing Dynamic Compression with Word-based Lossless Recovery (in Turkish) PhD thesis. Yildiz Technical University, Istanbul (1999)
Koltuksuz, A.H.: Cryptanalitic Measures of Turkish for Symmetrical Cryptosystems (in Turkish) PhD Thesis. Ege University Department of Computer Engineering, Izmir, Turkey (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Celikel, E., Dalkilic, M.E., Dalkilic, G. (2005). Word-Based Fixed and Flexible List Compression. In: Yolum, p., Güngör, T., Gürgen, F., Özturan, C. (eds) Computer and Information Sciences - ISCIS 2005. ISCIS 2005. Lecture Notes in Computer Science, vol 3733. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11569596_80
Download citation
DOI: https://doi.org/10.1007/11569596_80
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29414-6
Online ISBN: 978-3-540-32085-2
eBook Packages: Computer ScienceComputer Science (R0)