Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

ibook.pub-basic-arithmetic-coding-based-approach-to-compress-a-character-string

This paper presents a lossless data compression algorithm based on Basic Arithmetic Coding, developed in C, which effectively compresses character strings, initially tested on vowels. The algorithm demonstrates significant compression ratios, indicating its potential for broader applications beyond vowels, including any character array. Results show that the compression ratio improves with longer strings, highlighting the algorithm's efficiency in data storage and transmission.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ibook.pub-basic-arithmetic-coding-based-approach-to-compress-a-character-string

This paper presents a lossless data compression algorithm based on Basic Arithmetic Coding, developed in C, which effectively compresses character strings, initially tested on vowels. The algorithm demonstrates significant compression ratios, indicating its potential for broader applications beyond vowels, including any character array. Results show that the compression ratio improves with longer strings, highlighting the algorithm's efficiency in data storage and transmission.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Basic Arithmetic Coding Based Approach

to Compress a Character String

Ipsita Mondal and Subhra J. Sarkar

Abstract Data compression plays an important role for storing and transmitting
text or multimedia information. This paper refers to a lossless data algorithm is
developed in C-platform to compress character string based on Basic Arithmetic
Coding. At the preliminary stage, this algorithm was tested for the character array
comprising of vowels only and the probability distribution is assumed arbitrarily.
The result being obtained is encouraging with compression ratio far beyond unity.
Though the algorithm was tested for vowels only but the work can be extended for
any character array with probability of distribution as obtained from the survey of
few randomly selected articles.

Keywords Data compression technique ⋅


Basic arithmetic coding ⋅ Probability
distribution ⋅
Encoding–Decoding ⋅
Compression ratio

1 Introduction

In the present age of digitization, data compression becomes extremely important


for reducing the bit size of the data. With reduced number of bits, there will reduced
memory requirement thereby eliminating the memory constraints of the system. In
context of data communication, reduced number of bits implies lesser energy
requirement thereby leading toward energy efficiency. Data compression not only
reduces the data size but it also has the inherent capability of data encryption
thereby ensuring data security. A typical data compression algorithm can be
represented by the block diagram, as given in Fig. 1 [1–3].

I. Mondal (✉)
Department of CSE, Techno India Batanagar, Kolkata, India
e-mail: ipsita.mondal@yahoo.com
S.J. Sarkar
Department of EE, Techno India Batanagar, Kolkata, India
e-mail: subhro89@gmail.com

© Springer Nature Singapore Pte Ltd. 2017 31


S.C. Satapathy et al. (eds.), Proceedings of the 5th International Conference on Frontiers
in Intelligent Computing: Theory and Applications, Advances in Intelligent Systems
and Computing 515, DOI 10.1007/978-981-10-3153-3_4
32 I. Mondal and S.J. Sarkar

Input Encoding
character Algorithm
string Primary or
Secondary Memory
for storing encrypted
string
Decoded Decoding
character
Algorithm
string

Fig. 1 Block diagram of the proposed system

There are numerous methods of data compression. Broadly, the compression can
be classified as lossy or lossless compression. In lossy compression, there is some
removal of some unimportant data values present in the file while performing these
algorithms. Some of its examples include transform coding, Karhunen–Loeve
Transform (KLT) coding, wavelet-based coding, etc. Real-time applications of
these compression algorithms are in compression of multimedia files like audio,
video, images, etc. [1]. On the other hand, there is no loss of data information in
lossless data compression techniques like Shannon–Fano algorithm, Huffman
algorithm, arithmetic Coding, etc. [1, 4]. Lossless data compression is more popular
for compressing text documents, images of higher importance like image of
cancerous tissues, etc. [4].
The application of the work done in [1] was confined to the compression of data
string for power system applications only. If the algorithm can be extended to
compress any character string, it can be used for the applications like compression
of files present in any office or compression of the contents of books present in the
library, etc. As the actual data obtained after data compression algorithm is
encrypted, it becomes impossible for any external agency to decode the data. So,
this method can allow only the authenticated users to access and use the data [5, 6].
The proposed algorithm is developed in C-language and the tested offline to obtain
the results.

2 Arithmetic Coding

Basic arithmetic coding is a lossless data compression technique where a data or


character string is encoded in form of a fractional single number, n where 0.0 ≤
n < 1.0. In this method of data compression, the probability distribution is applied of
the content of source message to narrow the interval successively. Considering some
source message comprising of symbol set S {‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’} with prob-
ability of occurrence as given in Table 1 [1, 6–9].
Basic Arithmetic Coding Based Approach … 33

Table 1 Probability distribution table for symbol set, S


Sl. no. Character Probability Cumulative probability Range [r_low, r_hi)
1 1 0.3 0.3 [0, 0.3)
2 2 0.2 0.5 [0.3, 0.5)
3 3 0.1 0.6 [0.5, 0.6)
4 4 0.2 0.8 [0.6, 0.8)
5 5 0.1 0.9 [0.8, 0.9)
6 6 0.1 1.0 [0.9, 1.0)

Table 2 Shrinking of range Iteration Character Min Max r


in arithmetic coding for no. (x)
different iterations
1 1 0 0.3 0.3
2 1 0 0.09 0.09
3 2 0.027 0.045 0.018
4 5 0.0414 0.0432 0.0018
5 6 0.04302 0.0432 0.00018

The algorithm for basic arithmetic coding in order to compress a string com-
prising of the characters given in Table 2, is given in the subsequent section [1, 9].
STEP 1: Obtain the string, s and calculate its length (l).
STEP II: Initialize variables min = 0, max = 1, and r = 1.
STEP III: Set a counter i = 1.
STEP IV: Repeat steps V–IX until i != (l + 1).
STEP V: x = s (i).
STEP VI: Obtain r_low and r_hi corresponds to x.
STEP VII: Update min = min + r * r_low and max = min + r * r_hi.
STEP VIII: r = max − min.
STEP IX: i = i + 1.
STEP X: End of the loop.
STEP XI: Obtain a number num with minimum binary string length such that
min < num < max.
STEP XII: End.
Considering a string {‘1’, ‘1’, ‘2’, ‘5’, ‘6’} with 5 characters upon which
arithmetic coding algorithm is going to be implemented. The process of execution
is illustrated below [1].
Initialization: min = 0, max = 1and r = 1.
l = 5 → No. of iterations = 5.
34 I. Mondal and S.J. Sarkar

Table 3 Extracting the Iteration no. Value x l h r


encoded string elements for
different 1 0.0430908203125 1 0 0.3 0.3
2 0.143636067 1 0 0.3 0.3
3 0.478786892 2 0.3 0.5 0.2
4 0.893934461 5 0.8 0.9 0.1
5 0.939344618 6 0.9 1.0 0.1

Output: num = 0.0430908203125 having binary string (.) 0000101100001


which lies between 0.04302 and 0.0432 having minimum binary string length
(13 in this case).
This compressed binary string can be then be utilized for communication or
storage purpose, as required. While decoding the actual string, the output num of
the previous algorithm is compared with Table 3 continuously to extract the actual
string. In this present algorithm, it is also required to give the number of characters
in the actual string as the probability of occurrence of string termination is not
considered. The algorithm for decoding the actual string is given below where the
input is encoded binary array and the actual string size [1, 9].
STEP I: Obtain the binary string and length of encoded string (l).
STEP II: Determine the float number (num) corresponding to the binary string.
STEP III: Set a counter i = 1 and define a null array arr.
STEP IV: Repeat steps until i != (l + 1).
STEP V: Find the range between which num lies and the character x corresponds
to it.
STEP VI: arr (i) = x, l = r_low (x), h = r_hi (x) and r = h − l.
STEP VII: num = (num − l)/r.
STEP VIII: i = i + 1.
STEP IX: End of the loop.
STEP X: Array arr is the encoded string.
STEP XI: End.
The execution of the algorithm for the binary string 000010110001 is illustrated
below [1].
Since, l = 5 → No. of iterations = 5.
Output: arr = {1, 1, 2, 5, 6}
Basic Arithmetic Coding Based Approach … 35

3 Proposed Algorithm

Encoding algorithm

STEP 1: Start
STEP 2: Input the string i.e. str.
STEP 3: Count the length of the string.
STEP 4: Initialize pini = 0.0, pfin = 1.0 and r = pfin-pini.
STEP 5: i = 0
STEP 6: Repeat steps 7 and 8 while i < length do
STEP 7: Fetch the r_min[i] and r_max[i] values from Table No. 4 and detemine the
corresponding stringthat lies between the ranges.
STEP 8: i = i+1
[End of loop]
STEP 9: i1 = 0
STEP 10: Repeat STEP XI and XII while i1< length do
STEP 11: pini = pini + r * r_max[i1] and pfin = pini + r * r_max[i1]
STEP 12: i1 <- i1+1
[End of loop]
STEP 13: Select a value i.e. val which lies between pini and pfin i.e. pini < val < pfin
STEP 14: Convert val to binary and store it in str1
STEP 15: Check if str1 % 7 != 0 do
t = str1%7
t1 = 7 – t
add t1 number of 0’s at the start bits of the string
[End of if]
STEP 16: Count the length of str1 i.e. l
STEP 17: l1 = l / 7
STEP 18: i2 = 0
STEP 19: Repeat steps 20 to 25 while i2 < l1
STEP 20: j = 0
STEP 21: Repeat while j < 7
STEP 22: Store str1[j] in a new array
STEP 23: Convert it to decimal equivalent value
STEP 24: j = j+1
[End of inner loop]
STEP 25: i2 = i2+1
[End of outer loop]
STEP 26: Print the decimal string which is the compressed string.
36 I. Mondal and S.J. Sarkar

Table 4 Minimum and Sl. no. String r_min r_max Range


maximum values for different
strings 1 a 0.0 0.3 0.3
2 e 0.3 0.55 0.25
3 i 0.55 0.75 0.2
4 o 0.75 0.9 0.15
5 u 0.9 1.0 0.1

Decoding algorithmtpb 2

See Table 4.
STEP 1: Read the number of zero added i.e. t1
STEP 2: i = 0
STEP 3: Repeat steps 4 and 5 while i < l1 do
STEP 4: Read the 7 bit decimal number and convert it in equivalent binary
STEP 5: i = i + 1
[End of loop]
STEP 6: Concatenate all the strings and delete the t1 number of 0’s from the string
STEP 7: i1 = t1
STEP 8: Repeat steps 9 and 10 while i1 < length do
STEP 9: Store the elements in an array
STEP 10: i1 = i1 + 1
[End of loop]
STEP 11: Determine the decimal equivalent of the string
STEP 12: Check the range of pini and pfin where the decimal value lies in between i.e.
pini < deci < pfin
STEP 13: i2 <- 0
STEP 14: Repeat while i < length do
STEP 15: Check the r_min[i2] and r_max[i2] values from table 4 and print the
correspondingstring.
STEP 16: i2 = i2 + 1
[End of loop]
STEP 17: The string is the required output that is the input string
STEP 18: End

4 Results and Analysis

The proposed algorithm is tested with input of various string length and corre-
sponding output size is obtained. Compression ratio is an important parameter for
any data compression algorithm which gives the effectiveness of the compression.
Basic Arithmetic Coding Based Approach … 37

Table 5 Variation of output string size with string length for best, intermediate, and worst cases
Sl. no. String length Compression ratio
Best case Intermediate case Worst case
1 5 2.5 2.5 2.5
2 15 3.75 3.75 3.75
3 25 5 4.167 3.571
4 35 7 5 4.375

The value of compression ratio being obtained by the proposed algorithm is much
beyond unity. The compression ratio being obtained for different string length is
given in Table 5. The input string can have three possible combinations, i.e., string
containing characters with highest probability only (best case), string containing
characters with lowest probability only (worst case), and any random combination
of characters (intermediate case). It is obvious that compression ratio for the best
case will have highest possible value than that obtained for intermediate or worst
case. From Table 5, it is also clear that compression ratio increases with the input
string for all three cases and thereby can be used for compressing large strings quite
effectively.

5 Conclusions

From Table 5, it is clear that the compression ratio being obtained is pretty
impressive for longer strings. In this paper, only vowel characters, i.e., a, e, i, o, u
are considered with arbitrary probability to test the algorithm. The algorithm can
extended to be implemented for compressing the character string containing all the
characters including special characters. But it is obvious that the compression ratio
will not be as high as obtained in this case. Accurate determination of the proba-
bility of occurrence of the characters is required to improve the compression ratio.
This is possible either by following the character probability pattern of previous
available data or by employing adaptive algorithm. But the adaptive algorithm has
its own limitations due to the requirement of probability distribution table for
decoding purpose. The variation of actual string and encrypted data size for the
three possible cases with the length of input string is provided in Fig. 2. From the
graph given in Fig. 2, it is clear that though the encrypted data size for all the three
cases are same for lower string length, but for larger string, there is a significant
variation of encrypted data size between the best and worst case.
38 I. Mondal and S.J. Sarkar

Fig. 2 Variation of data size 40


with the length of input array
35
30
Actual
25

Data size
Best Case
20
Intermediate
15 Worst
10
5
0
5 15 25 35
Length of input array

References

1. Sarkar, S. J., Das, B., Dutta, T., Dey, P., Mukherjee, A.: An Alternative Voltage and Frequency
Monitoring Scheme for SCADA based Communication in Power System using Data
Compression. In: International Conference and Workshop on Computing and Communication
(IEMCON), pp. 1–7 (2015)
2. Takahashi, Y., Matsui, S., Nakata, Y., Kondo, T.: Communication Method with Data
Compression & Encryption for Mobile Computing Environment, https://www.isoc.org/inet96/
proceedings/a6/a6_2.html
3. Liu, H.-S., Chuang, C.-C., Lin, C.-C., Chang, R.-I, Wang, C.-H., Hsieh, C.-C.: Data
Compression for Energy Efficient Communication on Ubiquitous Sensor Network. In:
Tamkang Journal of Science and Engineering, Vol. 14, No. 3, pp. 345–354 (2011)
4. Kodituwakku, S. R., Amarasinghe, U. S.: Comparisons of Lossless Data Compression
Algorithms for Text Data. In: Indian Journal of Computer Science and Engineering, Vol. 1,
No. 4, pp. 406–425
5. Brar, R. S. and Singh, B.,: A survey on different compression techniques and bit reduction
algorithm for compression of text data. In: International Journal of Advanced Research in
Computer Science and Software Engineering (IJARCSSE) Volume 3, Issue 3 (March 2013)
6. Theory of Data Compression, http://www.data-compression.com/theory.shtml
7. Porwal, S., Chaudhary, Y., Joshi, J., and Jain, M.: Data Compression Methodologies for
Lossless Data and Comparison between Algorithms. In:International Journal of Engineering
Science and Innovative Technology (IJESIT) Volume 2, Issue 2 (March 2013)
8. Shanmugasundaram, S., and Lourdusamy, R.: A Comparative Study of Text Compres-
sionAlgorithms. In:International Journal of Wisdom Based Computing, Vol. 1 (3) (December
2011)
9. Li, Z.-N., Drew, Mark S., Liu, J.: Fundamentals of Multimedia, 2nd Edition, Springer (2014)

You might also like