Abstract
Data compression means the route towards adjusting, encoding or changing the bit structure of information so that it requires less space. The fundamental standard behind compression is to build up a strategy or convention for utilizing less bits to express the actual data. Character encoding is fairly identified with compression of data that illustrate a character by a kind of encoding system. We proposes an efficient and simple compression algorithm for large natural text named n-Sequence based m Bit Compression (nSmBC) which can beat WinZip and WinRAR in terms of compression ratio. WinZip and WinRAR are two well-known compression techniques used for text compression in the industry. The scheme provides an efficient encoding algorithm that converts an 8 bit character by 5 bits utilizing a look up table. The look up table is produced by utilizing Zipf distribution that represents a discrete dispersion of ordinarily utilized characters in various languages. 8 bit characters are converted to 5 bits by partitioning the characters into 7 sets. After converting the characters into 5 bit, an n-sequence scheme is developed to logically calculate the location number of a particular combination of characters. The reverse algorithm to recover the actual input is further demonstrated. The nSmBC is finally compared with the well-known WinZip, WinRAR, Huffman and LZW techniques. Promising performance is demonstrated both by theoretical and experimental analysis.
Similar content being viewed by others
References
NguyenVH, Nguyen HT, Duong HN, Snasel V. Trigram-based Vietnamese text compression. In: Recent developments in intelligent information and database systems, studies in computational intelligence, vol 642. Springer; 2016. p. 297–307.
Bassiouni MA. Data compression in scientific and statistical databases. IEEE Trans Softw Eng. 1985;11(10):1047–57.
Žalik B, Lukač N. An chain code lossless compression using move-to-front transform and adaptive run-length encoding. Signal Process Image Commun. 2014;29(1):96–106.
Wu J, Wang Y, Ding L, Liao X. Improving performance of network covert timing channel through Huffman coding. Math Comput Model. 2012;25(1–2):69–79.
Witten IH, Neal RM, Cleary JG. Arithmetic coding for data compression. Commun ACM. 1987;30(6):520–40.
Welch TA. Technique for high-performance data compression. IEEE Comput. 1984;17(6):8–19.
Travis GagieJ, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2014) LZ77-based self-indexing with faster pattern matching. In: Pardo A, Viola A, editors. LATIN 2014, LNCS 8392. Berlin: Springer; 2014. p. 731–742.
Bannai H, Inenaga S, Takeda M. Efficient LZ78 factorization of grammar compressed text. In: Caldron-Benavides L et al, editors. SPIRE 2012, LNCS 7608. Berlin: Springer; 2012. p. 86––98.
Cleary J, Witten I. Data compression using adaptive coding and partial string matching. IEEE Trans Commun. 1984;32(4):396–402.
BurrowsM, Wheeler D. A block-sorting lossless data compression algorithm. Digital SRC Research Report. 1994.
Azharul HasanKM. Compression schemes of high dimensional data for MOLAP. In: Furtado P, editor. Evolving application domains of data warehousing and mining: trends and solutions, University of Coimbra, Portugal. Chapter IV. 2010. p. 64–81.
Wentian L. Random texts exhibit WinZipfs-law-like word. IEEE Trans Inf Theory 1992;38(6).
Fagan S, Gençay R. An introduction to textual econometrics. In: Handbook of empirical economics and finance. 2010. p. 133–153.
Aggarwal CC, Zhai CX. A survey of text clustering algorithms. In: Recent developments in database management & information retrieval, chapter 4 of mining text data. Springer; 2012. p. 1–123.
Taeho J. Text encoding. In: Recent studies in big data, vol 45, sec 3.1 of text mining. Springer; 2018. p. 41–58.
Satir E, Isik H. A compression-based text steganography method. J Syst Softw. 2012;85(10):2385–94.
Nguyen VH, Nguyen HT, Duong HN, Snasel V. n-gram-based text compression. Comput Intell Neurosci. 2016;2016:1–11.
Al-Bahadili H, Hussain SM. An adaptive character word length algorithm for data compression. Comput Math Appl. 2008;55(6):1250–6.
Dvorsk J, Pokorn J, Sna´sel J. Word-based compression methods and indexing for text retrieval systems. In: Proceedings of the 3rd East European conference on advances in databases and information systems (ADBIS ’99), Maribor, Slovenia. 1999. p. 75–84.
Kalajdzic K, Ali SH, Patel A. Rapid lossless compression of short text messages. Comput Stand Interfaces. 2015;37:53–9.
Platos J, Dvorskþ J. Word-based text compression. 2008. http://arxiv.org/abs/0804.3680.
Akman I, Bayindir H, Ozleme S, Akin Z, Misra S. A lossless text compression technique using syllable based morphology. Int Arab J Inf Technol. 2011;8(1):66–74.
Kuthan T, Lansky J. Genetic algorithms in syllable-based text compression. In: Proceedings of the Dateso annual international workshop on databases, texts, specifications and objects, Desna, Czech Republic, 2007. p. 21–34.
Lansky, Zemlicka M. Text compression: syllables. In: Proceedings of the Dateso annual international workshop on databases, texts, specifications and objects, Desna, Czech Republic, April 2005. p. 32–45.
LanskyJ, Zemlicka M. Compression of small text files using syllables. In: Proceedings of the data compression conference, Snowbird. 2006.
Mahmood A, Latif T, Azharul Hasan KM. An efficient 6 bit encoding scheme for printable characters by table look up. In: International conference on electrical, computer and communication engineering (ECCE). 2017. p. 468–472.
MahmoodMA, Latif T, Azharul Hasan KM, Islam R. A feasible 6 bit text database compression scheme with character encoding (6BC). In: 2018 21st international conference of computer and information technology (ICCIT). 2018. p. 1–6.
https://www.microsoft.com/enus/download/details.aspx?id=54262. 2020.
Toutanova C, Brockett C, Tran KM, Amershi S. A dataset and evaluation metrics for abstractive compression of sentences and short paragraph. In: Empirical methods in natural language processing, EMNLP. 2016. p. 340–350.
Funding
None.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mahmood, M.A., Hasan, K.M.A. An Efficient Compression Scheme for Natural Language Text by Hashing. SN COMPUT. SCI. 3, 314 (2022). https://doi.org/10.1007/s42979-022-01210-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01210-0