Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Word-based text compression using the Burrows-Wheeler transform

Published: 01 September 2005 Publication History

Abstract

Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. It involves three steps: permuting the input one block at a time through the use of the Burrows-Wheeler transform (bwt); applying a move-to-front (mtf) transform to each of the permuted blocks; and then entropy coding the output with a Huffman or arithmetic coder. Until now, block-sorting implementations have assumed that the input message is a sequence of characters. In this paper we extend the block-sorting mechanism to word-based models. We also consider other recency transformations, and are able to show improved compression results compared to mtf and uniform arithmetic coding. For large files of text, the combination of word-based modeling, bwt, and mtf-like transformations allows excellent compression effectiveness to be attained within reasonable resource costs.

References

[1]
Move-to-front and inversion coding. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 2000 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 193-202.
[2]
Block sorting and compression. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 1997 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 181-190.
[3]
Modifications of the Burrows and Wheeler data compression algorithm. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 1999 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 188-197.
[4]
A locally adaptive data compression scheme. Communications of the ACM. v29 i4. 320-330.
[5]
New techniques in context modeling and arithmetic coding. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 1996 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 426
[6]
Is Huffman coding dead?. Computing. v50 i4. 279-296.
[7]
Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, CA
[8]
Switching between two on-line list update algorithms for higher compression of Burrows-Wheeler transformed data. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 2000 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 183-192.
[9]
Unbounded length contexts for PPM. The Computer Journal. v40 i2/3. 67-75.
[10]
Fast and flexible word searching on compressed text. ACM Transactions on Information Systems. v18 i2. 113-139.
[11]
Improvements to Burrows-Wheeler compression algorithm. Software-Practice and Experience. v30 i13. 1465-1483.
[12]
Second step algorithms in the Burrows-Wheeler compression algorithm. Software-Practice and Experience. v32 i9. 99-111.
[13]
Universal lossless source coding with the Burrows-Wheeler transform. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 1999 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 178-187.
[14]
PPM performance with BWT complexity: a new method for lossless data compression. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 2000 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 203-212.
[15]
Fenwick, P. (1996a). Block sorting text compression. In K. Ramamohanarao (Ed.), Proceedings of 19th Australasian Computer Science Conference (pp. 193-202). Melbourne
[16]
The Burrows-Wheeler transform for block sorting text compression: principles and improvements. The Computer Journal. v39 i9. 731-740.
[17]
Gailly, J. L. (1993). Gzip program and documentation. Source code available from ftp://prep.ai.mit.edu/pub/gnu/gzip-*.tar
[18]
Overview of the second text retrieval conference (TREC-2). Information Processing & Management. v31 i3. 271-289.
[19]
Constructing word-based text compression algorithms. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 1992 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 62-71.
[20]
Isal, R. Y. K., Moffat, A., & Ngai, A. C. H. (2002). Enhanced word-based block-sorting text compression. In M. Oudshoorn (Ed.), Proceedings of 25th Australasian Computer Science Conference (pp. 129-138). Melbourne, Australia
[21]
Algorithms and data structures: design, correctness, analysis. Addison-Wesley, Reading, MA.
[22]
Word based text compression. Software-Practice and Experience. v19 i2. 185-198.
[23]
Arithmetic coding revisited. ACM Transactions on Information Systems. v16 i3. 256-294.
[24]
Compression and coding algorithms. Kluwer Academic Publishers, Boston, MA.
[25]
Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering. v9 i2. 302-313.
[26]
A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 1998 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 129-138.
[27]
Sadakane, K. (1999). Unifying text search and compression: suffix sorting, block sorting and suffix arrays. PhD thesis, The University of Tokyo
[28]
A fast block-sorting algorithm for lossless data compression. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 1997 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 469
[29]
Schindler, M. (1999). Szip program and documentation. Available from www.compressconsult.com/szip/
[30]
On the performance of BWT sorting algorithms. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 2000 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 173-182.
[31]
Space-time tradeoffs in the inverse B-W transform. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 2001 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 439-448.
[32]
Seward, J., & Gailly, J. L. (1999). bzip2program and documentation. Available from sourceware.cygnus.com/bzip2/
[33]
PPM: one step to practicality. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 2002 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 202-211.
[34]
Self-adjusting binary search trees. Journal of the ACM. v32 i3. 652-686.
[35]
Housekeeping for prefix coding. IEEE Transactions on Communications. v48 i4. 622-628.
[36]
Can we do without ranks in Burrows-Wheeler transform compression?. In: Storer, J.A., Cohn, M. (Eds.), Proceedings of 2001 IEEE data compression conference, IEEE Computer Society Press, Los Alamitos, CA. pp. 419-428.
[37]
Adding compression to a full-text retrieval system. Software-Practice and Experience. v25 i8. 891-903.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal
Information Processing and Management: an International Journal  Volume 41, Issue 5
September 2005
313 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2005

Author Tags

  1. Burrows-Wheeler transformation
  2. Move-to-Front
  3. Recency ranking
  4. Text compression
  5. Word-based model

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2012)Generalized biwords for bitext compression and translation spottingJournal of Artificial Intelligence Research10.5555/2387915.238792643:1(389-418)Online publication date: 1-Jan-2012
  • (2011)On Optimally Partitioning a Text to Improve Its CompressionAlgorithmica10.5555/2616915.261713661:1(51-74)Online publication date: 1-Sep-2011
  • (2008)An efficient, versatile approach to suffix sortingACM Journal of Experimental Algorithmics10.1145/1227161.127837412(1-23)Online publication date: 12-Jun-2008
  • (2007)Edge-guided natural language text compressionProceedings of the 14th international conference on String processing and information retrieval10.5555/1778666.1778668(14-25)Online publication date: 29-Oct-2007
  • (2006)Mapping words into codewords on PPMProceedings of the 13th international conference on String Processing and Information Retrieval10.1007/11880561_15(181-192)Online publication date: 11-Oct-2006

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media