Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

An Efficient Compression Scheme for Natural Language Text by Hashing

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Data compression means the route towards adjusting, encoding or changing the bit structure of information so that it requires less space. The fundamental standard behind compression is to build up a strategy or convention for utilizing less bits to express the actual data. Character encoding is fairly identified with compression of data that illustrate a character by a kind of encoding system. We proposes an efficient and simple compression algorithm for large natural text named n-Sequence based m Bit Compression (nSmBC) which can beat WinZip and WinRAR in terms of compression ratio. WinZip and WinRAR are two well-known compression techniques used for text compression in the industry. The scheme provides an efficient encoding algorithm that converts an 8 bit character by 5 bits utilizing a look up table. The look up table is produced by utilizing Zipf distribution that represents a discrete dispersion of ordinarily utilized characters in various languages. 8 bit characters are converted to 5 bits by partitioning the characters into 7 sets. After converting the characters into 5 bit, an n-sequence scheme is developed to logically calculate the location number of a particular combination of characters. The reverse algorithm to recover the actual input is further demonstrated. The nSmBC is finally compared with the well-known WinZip, WinRAR, Huffman and LZW techniques. Promising performance is demonstrated both by theoretical and experimental analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. NguyenVH, Nguyen HT, Duong HN, Snasel V. Trigram-based Vietnamese text compression. In: Recent developments in intelligent information and database systems, studies in computational intelligence, vol 642. Springer; 2016. p. 297–307.

  2. Bassiouni MA. Data compression in scientific and statistical databases. IEEE Trans Softw Eng. 1985;11(10):1047–57.

    Article  Google Scholar 

  3. Žalik B, Lukač N. An chain code lossless compression using move-to-front transform and adaptive run-length encoding. Signal Process Image Commun. 2014;29(1):96–106.

    Article  Google Scholar 

  4. Wu J, Wang Y, Ding L, Liao X. Improving performance of network covert timing channel through Huffman coding. Math Comput Model. 2012;25(1–2):69–79.

    Article  MathSciNet  Google Scholar 

  5. Witten IH, Neal RM, Cleary JG. Arithmetic coding for data compression. Commun ACM. 1987;30(6):520–40.

    Article  Google Scholar 

  6. Welch TA. Technique for high-performance data compression. IEEE Comput. 1984;17(6):8–19.

    Article  Google Scholar 

  7. Travis GagieJ, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2014) LZ77-based self-indexing with faster pattern matching. In: Pardo A, Viola A, editors. LATIN 2014, LNCS 8392. Berlin: Springer; 2014. p. 731–742.

  8. Bannai H, Inenaga S, Takeda M. Efficient LZ78 factorization of grammar compressed text. In: Caldron-Benavides L et al, editors. SPIRE 2012, LNCS 7608. Berlin: Springer; 2012. p. 86––98.

  9. Cleary J, Witten I. Data compression using adaptive coding and partial string matching. IEEE Trans Commun. 1984;32(4):396–402.

    Article  Google Scholar 

  10. BurrowsM, Wheeler D. A block-sorting lossless data compression algorithm. Digital SRC Research Report. 1994.

  11. Azharul HasanKM. Compression schemes of high dimensional data for MOLAP. In: Furtado P, editor. Evolving application domains of data warehousing and mining: trends and solutions, University of Coimbra, Portugal. Chapter IV. 2010. p. 64–81.

  12. Wentian L. Random texts exhibit WinZipfs-law-like word. IEEE Trans Inf Theory 1992;38(6).

  13. Fagan S, Gençay R. An introduction to textual econometrics. In: Handbook of empirical economics and finance. 2010. p. 133–153.

  14. Aggarwal CC, Zhai CX. A survey of text clustering algorithms. In: Recent developments in database management & information retrieval, chapter 4 of mining text data. Springer; 2012. p. 1–123.

  15. Taeho J. Text encoding. In: Recent studies in big data, vol 45, sec 3.1 of text mining. Springer; 2018. p. 41–58.

  16. Satir E, Isik H. A compression-based text steganography method. J Syst Softw. 2012;85(10):2385–94.

    Article  Google Scholar 

  17. Nguyen VH, Nguyen HT, Duong HN, Snasel V. n-gram-based text compression. Comput Intell Neurosci. 2016;2016:1–11.

    Article  Google Scholar 

  18. Al-Bahadili H, Hussain SM. An adaptive character word length algorithm for data compression. Comput Math Appl. 2008;55(6):1250–6.

    Article  MathSciNet  Google Scholar 

  19. Dvorsk J, Pokorn J, Sna´sel J. Word-based compression methods and indexing for text retrieval systems. In: Proceedings of the 3rd East European conference on advances in databases and information systems (ADBIS ’99), Maribor, Slovenia. 1999. p. 75–84.

  20. Kalajdzic K, Ali SH, Patel A. Rapid lossless compression of short text messages. Comput Stand Interfaces. 2015;37:53–9.

    Article  Google Scholar 

  21. Platos J, Dvorskþ J. Word-based text compression. 2008. http://arxiv.org/abs/0804.3680.

  22. Akman I, Bayindir H, Ozleme S, Akin Z, Misra S. A lossless text compression technique using syllable based morphology. Int Arab J Inf Technol. 2011;8(1):66–74.

    Google Scholar 

  23. Kuthan T, Lansky J. Genetic algorithms in syllable-based text compression. In: Proceedings of the Dateso annual international workshop on databases, texts, specifications and objects, Desna, Czech Republic, 2007. p. 21–34.

  24. Lansky, Zemlicka M. Text compression: syllables. In: Proceedings of the Dateso annual international workshop on databases, texts, specifications and objects, Desna, Czech Republic, April 2005. p. 32–45.

  25. LanskyJ, Zemlicka M. Compression of small text files using syllables. In: Proceedings of the data compression conference, Snowbird. 2006.

  26. Mahmood A, Latif T, Azharul Hasan KM. An efficient 6 bit encoding scheme for printable characters by table look up. In: International conference on electrical, computer and communication engineering (ECCE). 2017. p. 468–472.

  27. MahmoodMA, Latif T, Azharul Hasan KM, Islam R. A feasible 6 bit text database compression scheme with character encoding (6BC). In: 2018 21st international conference of computer and information technology (ICCIT). 2018. p. 1–6.

  28. https://www.microsoft.com/enus/download/details.aspx?id=54262. 2020.

  29. Toutanova C, Brockett C, Tran KM, Amershi S. A dataset and evaluation metrics for abstractive compression of sentences and short paragraph. In: Empirical methods in natural language processing, EMNLP. 2016. p. 340–350.

Download references

Funding

None.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Ashiq Mahmood.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahmood, M.A., Hasan, K.M.A. An Efficient Compression Scheme for Natural Language Text by Hashing. SN COMPUT. SCI. 3, 314 (2022). https://doi.org/10.1007/s42979-022-01210-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01210-0

Keywords