An Efficient Compression Scheme for Natural Language Text by Hashing

Mahmood, Md. Ashiq; Hasan, K. M. Azharul

doi:10.1007/s42979-022-01210-0

An Efficient Compression Scheme for Natural Language Text by Hashing

Original Research
Published: 04 June 2022

Volume 3, article number 314, (2022)
Cite this article

SN Computer Science Aims and scope Submit manuscript

120 Accesses
Explore all metrics

Abstract

Data compression means the route towards adjusting, encoding or changing the bit structure of information so that it requires less space. The fundamental standard behind compression is to build up a strategy or convention for utilizing less bits to express the actual data. Character encoding is fairly identified with compression of data that illustrate a character by a kind of encoding system. We proposes an efficient and simple compression algorithm for large natural text named n-Sequence based m Bit Compression (nSmBC) which can beat WinZip and WinRAR in terms of compression ratio. WinZip and WinRAR are two well-known compression techniques used for text compression in the industry. The scheme provides an efficient encoding algorithm that converts an 8 bit character by 5 bits utilizing a look up table. The look up table is produced by utilizing Zipf distribution that represents a discrete dispersion of ordinarily utilized characters in various languages. 8 bit characters are converted to 5 bits by partitioning the characters into 7 sets. After converting the characters into 5 bit, an n-sequence scheme is developed to logically calculate the location number of a particular combination of characters. The reverse algorithm to recover the actual input is further demonstrated. The nSmBC is finally compared with the well-known WinZip, WinRAR, Huffman and LZW techniques. Promising performance is demonstrated both by theoretical and experimental analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Arithmetic N-gram: an efficient data compression technique

Article Open access 13 March 2024

Trigram-Based Vietnamese Text Compression

Multi-Stream Word-Based Compression Algorithm for Compressed Text Search

Article 12 June 2018

References

NguyenVH, Nguyen HT, Duong HN, Snasel V. Trigram-based Vietnamese text compression. In: Recent developments in intelligent information and database systems, studies in computational intelligence, vol 642. Springer; 2016. p. 297–307.
Bassiouni MA. Data compression in scientific and statistical databases. IEEE Trans Softw Eng. 1985;11(10):1047–57.
Article Google Scholar
Žalik B, Lukač N. An chain code lossless compression using move-to-front transform and adaptive run-length encoding. Signal Process Image Commun. 2014;29(1):96–106.
Article Google Scholar
Wu J, Wang Y, Ding L, Liao X. Improving performance of network covert timing channel through Huffman coding. Math Comput Model. 2012;25(1–2):69–79.
Article MathSciNet Google Scholar
Witten IH, Neal RM, Cleary JG. Arithmetic coding for data compression. Commun ACM. 1987;30(6):520–40.
Article Google Scholar
Welch TA. Technique for high-performance data compression. IEEE Comput. 1984;17(6):8–19.
Article Google Scholar
Travis GagieJ, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2014) LZ77-based self-indexing with faster pattern matching. In: Pardo A, Viola A, editors. LATIN 2014, LNCS 8392. Berlin: Springer; 2014. p. 731–742.
Bannai H, Inenaga S, Takeda M. Efficient LZ78 factorization of grammar compressed text. In: Caldron-Benavides L et al, editors. SPIRE 2012, LNCS 7608. Berlin: Springer; 2012. p. 86––98.
Cleary J, Witten I. Data compression using adaptive coding and partial string matching. IEEE Trans Commun. 1984;32(4):396–402.
Article Google Scholar
BurrowsM, Wheeler D. A block-sorting lossless data compression algorithm. Digital SRC Research Report. 1994.
Azharul HasanKM. Compression schemes of high dimensional data for MOLAP. In: Furtado P, editor. Evolving application domains of data warehousing and mining: trends and solutions, University of Coimbra, Portugal. Chapter IV. 2010. p. 64–81.
Wentian L. Random texts exhibit WinZipfs-law-like word. IEEE Trans Inf Theory 1992;38(6).
Fagan S, Gençay R. An introduction to textual econometrics. In: Handbook of empirical economics and finance. 2010. p. 133–153.
Aggarwal CC, Zhai CX. A survey of text clustering algorithms. In: Recent developments in database management & information retrieval, chapter 4 of mining text data. Springer; 2012. p. 1–123.
Taeho J. Text encoding. In: Recent studies in big data, vol 45, sec 3.1 of text mining. Springer; 2018. p. 41–58.
Satir E, Isik H. A compression-based text steganography method. J Syst Softw. 2012;85(10):2385–94.
Article Google Scholar
Nguyen VH, Nguyen HT, Duong HN, Snasel V. n-gram-based text compression. Comput Intell Neurosci. 2016;2016:1–11.
Article Google Scholar
Al-Bahadili H, Hussain SM. An adaptive character word length algorithm for data compression. Comput Math Appl. 2008;55(6):1250–6.
Article MathSciNet Google Scholar
Dvorsk J, Pokorn J, Sna´sel J. Word-based compression methods and indexing for text retrieval systems. In: Proceedings of the 3rd East European conference on advances in databases and information systems (ADBIS ’99), Maribor, Slovenia. 1999. p. 75–84.
Kalajdzic K, Ali SH, Patel A. Rapid lossless compression of short text messages. Comput Stand Interfaces. 2015;37:53–9.
Article Google Scholar
Platos J, Dvorskþ J. Word-based text compression. 2008. http://arxiv.org/abs/0804.3680.
Akman I, Bayindir H, Ozleme S, Akin Z, Misra S. A lossless text compression technique using syllable based morphology. Int Arab J Inf Technol. 2011;8(1):66–74.
Google Scholar
Kuthan T, Lansky J. Genetic algorithms in syllable-based text compression. In: Proceedings of the Dateso annual international workshop on databases, texts, specifications and objects, Desna, Czech Republic, 2007. p. 21–34.
Lansky, Zemlicka M. Text compression: syllables. In: Proceedings of the Dateso annual international workshop on databases, texts, specifications and objects, Desna, Czech Republic, April 2005. p. 32–45.
LanskyJ, Zemlicka M. Compression of small text files using syllables. In: Proceedings of the data compression conference, Snowbird. 2006.
Mahmood A, Latif T, Azharul Hasan KM. An efficient 6 bit encoding scheme for printable characters by table look up. In: International conference on electrical, computer and communication engineering (ECCE). 2017. p. 468–472.
MahmoodMA, Latif T, Azharul Hasan KM, Islam R. A feasible 6 bit text database compression scheme with character encoding (6BC). In: 2018 21st international conference of computer and information technology (ICCIT). 2018. p. 1–6.
https://www.microsoft.com/enus/download/details.aspx?id=54262. 2020.
Toutanova C, Brockett C, Tran KM, Amershi S. A dataset and evaluation metrics for abstractive compression of sentences and short paragraph. In: Empirical methods in natural language processing, EMNLP. 2016. p. 340–350.

Download references

Funding

None.

Author information

Authors and Affiliations

Khulna University of Engineering and Technology, Khulna, 9203, Bangladesh
Md. Ashiq Mahmood & K. M. Azharul Hasan

Authors

Md. Ashiq Mahmood
View author publications
You can also search for this author in PubMed Google Scholar
K. M. Azharul Hasan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md. Ashiq Mahmood.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahmood, M.A., Hasan, K.M.A. An Efficient Compression Scheme for Natural Language Text by Hashing. SN COMPUT. SCI. 3, 314 (2022). https://doi.org/10.1007/s42979-022-01210-0

Download citation

Received: 22 June 2020
Accepted: 15 May 2022
Published: 04 June 2022
DOI: https://doi.org/10.1007/s42979-022-01210-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Compression Scheme for Natural Language Text by Hashing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Arithmetic N-gram: an efficient data compression technique

Trigram-Based Vietnamese Text Compression

Multi-Stream Word-Based Compression Algorithm for Compressed Text Search

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now