Analysis and Comparison of Algorithms For Lossless Data Compression
Analysis and Comparison of Algorithms For Lossless Data Compression
Hyderabad, INDIA.
Abstract
1. Introduction
Data compression is the art of representing information in compact form. It reduces the
file size which in turn reduces the required storage space and makes the transmission
of data quicker. Compression techniques try to find redundant data and remove these
redundancies. Data compression can be divided into two broad classes: lossless data
compression and lossy data compression. In lossless compression, the exact original
data can be recovered from compressed data. It is used when the difference between
original data and decompressed data cannot be tolerated. Medical images, text needed
in legal purposes and computer executable files are compressed using lossless
140 Anmol Jyot Maan
Example of RLE:
Input: AAABBCCCCD
Output: 3A2B4C1D
3. Huffman Coding
First Huffman coding algorithm was developed by David Huffman in 1951. Huffman
coding is an entropy encoding algorithm used for lossless data compression. In this
algorithm fixed length codes are replaced by variable length codes. When using
variable-length code words, it is desirable to create a prefix code, avoiding the need for
a separator to determine codeword boundaries. Huffman Coding uses such prefix code.
Huffman procedure works as follow:
1. Symbols with a high frequency are expressed using shorter encodings than
symbols which occur less frequently.
2. The two symbols that occur least frequently will have the same length.
The Huffman algorithm uses the greedy approach i.e. at each step the algorithm
chooses the best available option. A binary tree is built up from the bottom up. To see
how Huffman Coding works, let’s take an example. Assume that the characters in a
file to be compressed have the following frequencies:
A: 25 B: 10 C: 99 D: 87 E: 9 F: 66
The processing of building this tree is:
1. Create a list of leaf nodes for each symbol and arrange the nodes in the order
from highest to lowest.
Analysis and Comparison of Algorithms for Lossless Data Compression 141
Now add the parent node in the list and remove the two child nodes from the list.
And repeat this step until you have only one node left.
142 Anmol Jyot Maan
3. Now label each edge. The left child of each parent is labeled with the digit 0
and right child with 1. The code word for each source letter is the sequence of
labels along the path from root to the leaf node representing the letter.
C 00
D 01
F 10
A 110
B 1110
E 1111
4. Arithmetic Coding
Arithmetic Coding is useful for small alphabets with highly skewed probabilities. In
this method, a code word is not used to represent a symbol of the text. Instead, it
produces a code for an entire message. Arithmetic Coding assigns an interval to each
symbol. Then a decimal number is assigned to this interval. Initially, the interval is [0,
1). A message is represented by a half open interval [x, y) where x and y are real
numbers between 0 and 1. The interval is then divided into sub-intervals. The number
of sub-intervals is identical to the number of symbols in the current set of symbols and
size is proportional to their probability of appearance. For each symbol a new internal
division takes place based on the last sub interval.
Consider an example illustrating encoding in Arithmetic Coding.
Analysis and Comparison of Algorithms for Lossless Data Compression 143
In table 3, range, high value and low value are calculated as:
Range= High value – Low value
High Value= Low value + Range * high range of the symbol being computed
Low Value= Low value + Range * low range of the symbol being computed
The string “YXX” is represented by an arbitrary number within the interval [0.5,
0.575).
1. Compression Ratio: It is the ratio between the size of the compressed file and
the size of the source file.
Conclusion
Arithmetic coding techniques outperforms Huffman coding and Run Length Encoding.
Also the Compression ratio of the Arithmetic coding algorithm is better than the other
two algorithms examined above. In this paper, it is found that the Arithmetic Coding is
the most efficient algorithm among the selected ones.
References