Kap 5
Kap 5
Kap 5
Lossless Compression
Computers can handle many different kinds of information like text, equations,
games, sound, photos, and film. Some of these information sources require a
huge amount of data and may quickly fill up your hard disk or take a long time
to transfer across a network. For this reason it is interesting to see if we can
somehow rewrite the information in such a way that it takes up less space. This
may seem like magic, but does in fact work well for many types of information.
There are two general classes of methods, those that do not change the informa-
tion, so that the original file can be reconstructed exactly, and those that allow
small changes in the data. Compression methods in the first class are called loss-
less compression methods while those in the second class are called lossy com-
pression methods. Lossy methods may sound risky since they will change the
information, but for data like sound and images small alterations do not usually
matter. On the other hand, for certain kinds of information like for example text,
we cannot tolerate any change so we have to use lossless compression methods.
In this chapter we are going to study lossless methods; lossy methods will be
considered in a later chapter. To motivate our study of compression techniques,
we will first consider some examples of technology that generate large amounts
of information. We will then study two lossless compression methods in detail,
namely Huffman coding and arithmetic coding. Huffman coding is quite simple
and gives good compression, while arithmetic coding is more complicated, but
gives excellent compression.
In section 5.3.2 we introduce the information entropy of a sequence of sym-
bols which essentially tells us how much information there is in the sequence.
This is useful for comparing the performance of different compression strate-
gies.
73
74 CHAPTER 5. LOSSLESS COMPRESSION
5.1 Introduction
The potential for compression increases with the size of a file. A book typically
has about 300 words per page and an average word length of four characters. A
book with 500 pages would then have about 600 000 characters. If we write in
English, we may use a character encoding like ISO Latin 1 which only requires
one byte per character. The file would then be about 700 KB (kilobytes)1 , includ-
ing 100 KB of formatting information. If we instead use UTF-16 encoding, which
requires two bytes per character, we end up with a total file size of about 1300 KB
or 1.3 MB. Both files would represent the same book so this illustrates straight
away the potential for compression, at least for UTF-16 encoded documents.
On the other hand, the capacity of present day hard disks and communication
channels are such that a saving of 0.5 MB is usually negligible.
For sound files the situation is different. A music file in CD-quality requires
44 100 two-byte integers to be stored every second for each of the two stereo
channels, a total of about 176 KB per second, or about 10 MB per minute of
music. A four-minute song therefore corresponds to a file size of 40 MB and a
CD with one hour of music contains about 600 MB. If you just have a few CDs
this is not a problem when the average size of hard disks is approaching 1 TB
(1 000 000 MB or 1 000 GB). But if you have many CDs and want to store the
music in a small portable player, it is essential to be able to compress this in-
formation. Audio-formats like Mp3 and Aac manage to reduce the files down to
about 10 % of the original size without sacrificing much of the quality.
Not surprisingly, video contains even more information than audio so the
potential for compression is considerably greater. Reasonable quality video re-
quires at least 25 images per second. The images used in traditional European
television contain 576 × 720 small coloured dots, each of which are represented
with 24 bits2 . One image therefore requires about 1.2 MB and one second of
video requires about 31MB. This corresponds to 1.9 GB per minute and 112 GB
per hour of video. In addition we also need to store the sound. If you have more
than a handful of films in such an uncompressed format, you are quickly going
to exhaust the capacity of even quite large hard drives.
These examples should convince you that there is a lot to be gained if we
can compress information, especially for video and music, and virtually all video
formats, even the high-quality ones, use some kind of compression. With com-
pression we can fit more information onto our hard drive and we can transmit
information across a network more quickly.
00000000000000000000000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000111000000000000000000000
00000000000000000000000000000000000000000
Figure 5.1. A representation of a black and white image, for example part of a scanned text document.
If we are working with English text, the sequence x will just be a string of
letters and other characters like x = {h,e, l, l, o, , a, g, a, i, n, .} (the character after
’o’ is space, and the last character a period). The alphabet A is then the ordinary
Latin alphabet augmented with the space character, punctuation characters and
digits, essentially characters 32–127 of the ASCII table, see Table 4.2. In fact, the
ASCII codes define a dictionary since it assigns a binary code to each character.
However, if we want to represent a text with few bits, this is not a good dictionary
because the codes of very frequent characters are no shorter than the codes of
the characters that are hardly ever used.
In other contexts, we may consider the information to be a sequence of bits
and the alphabet to be {0, 1}, or we may consider sequences of bytes in which
case the alphabet would be the 256 different bit combinations in a byte.
Let us now suppose that we have a text x = {x 1 , x 2 , . . . , x m } with symbols taken
from an alphabet A . A simple way to represent the text in a computer is to
assign an integer code c(αi ) to each symbol and store the sequence of codes
{c(x 1 ), c(x 2 ), . . . , c(x m )}. The question is just how the codes should be assigned.
Small integers require fewer digits than large ones so a good strategy is to
let the symbols that occur most frequently in x have short codes and use long
codes for the rare symbols. This leaves us with the problem of knowing the
boundary between the codes. Huffman coding uses a clever set of binary codes
which makes it impossible to confuse the codes even though they have different
lengths.
Fact 5.2 (Huffman coding). In Huffman coding the most frequent symbols in
a text x get the shortest codes, and the codes have the prefix property which
means that the bit sequence that represents a code is never a prefix of any
other code. Once the codes are known the symbols in x are replaced by their
codes and the resulting sequence of bits z is the compressed version of x.
Example 5.3. This may sound a bit vague, so let us consider an example. Sup-
pose we have the four-symbol text x = DBACDBD of length 7. We note that
5.2. HUFFMAN CODING 77
z = 1010000011011, (5.2)
altogether 13 bits, while a standard encoding with one byte per character would
require 56 bits. Note also that we can easily decipher the code since the codes
have the prefix property. The first bit is 1 which must correspond to a ’D’ since
this is the only character with a code that starts with a 1. The next bit is 0 and
since this is the start of several codes we read one more bit. The only character
with a code that start with 01 is ’B’ so this must be the next character. The next
bit is 0 which does not uniquely identify a character so we read one more bit. The
code 00 does not identify a character either, but with one more bit we obtain the
code 000 which corresponds to the character ’A’. We can obviously continue in
this way and decipher the complete compressed text.
Compression is not quite as simple as it was presented in example 5.3. A
program that reads the compressed code must clearly know the codes (5.1) in
order to decipher the code. Therefore we must store the codes as well as the
compressed text z. This means that the text must have a certain length before it
is worth compressing it.
An example of a binary tree is shown in figure 5.2. The root node which is
shown at the top has two subtrees. The subtree to the right also has two subtrees,
both of which only contain leaf nodes. The subtree to the left of the root only has
one subtree which consists of a single leaf node.
78 CHAPTER 5. LOSSLESS COMPRESSION
Definition 5.5. A Huffman tree is a binary tree that can be associated with an
alphabet consisting of symbols {αi }ni=1 with frequencies f (αi ) as follows:
1. Each leaf node is associated with exactly one symbol αi in the alphabet,
and all symbols are associated with a leaf node.
3. All nodes that are not leaf nodes have exactly two children.
Example 5.6. In figure 5.3 the tree in figure 5.2 has been turned into a Huffman
tree. The tree has been constructed from the text CCDACBDC with the alphabet
{A,B,C,D} and frequencies f (A) = 1, f (B ) = 1, f (C ) = 4 and f (D) = 2. It is easy
to see that the weights have the properties required for a Huffman tree, and by
5.2. HUFFMAN CODING 79
8
0 1
4 4
C 0 1
2 2
D
0 1
1 1
A B
following the edges we see that the Huffman codes are given by c(C ) = 0, c(D) =
10, c(A) = 110 and c(B ) = 111. Note in particular that the root of the tree has
weight equal to the length of the text.
We will usually omit the labels on the edges since they are easy to remember:
An edge that points to the left corresponds to a 0, while an edge that points to
the right yields a 1.
Algorithm 5.7 (Huffman algorithm). Let the text x with symbols {αi }ni=1 be
given, and let the frequency of αi be f (αi ). The Huffman tree is constructed
by performing the following steps:
(a) Choose two trees T0 and T1 with minimal weights and replace
them with a new tree which has T0 as its left subtree and T1 as
its right subtree.
3. The tree remaining after the previous step is a Huffman tree for the
given text x.
80 CHAPTER 5. LOSSLESS COMPRESSION
Most of the work in algorithm 5.7 is in step 2, but note that the number of
trees is reduced by one each time, so the loop will run at most n times.
The easiest way to get to grips with the algorithm is to try it on a simple
example.
Example 5.8. Let us try out algorithm 5.7 on the text ’then the hen began to eat’.
This text consists of 32 characters, including the five spaces. We first determine
the frequencies of the different characters by counting. We find the collection of
one-node trees
4 3 5 3 1 1 2 1 5
t h e n b g a o t
where the last character denotes the space character. Since ’b’ and ’g’ are two
characters with the lowest frequency, we combine them into a tree,
4 3 5 3 2 2 1 5
t h e n a o t
1 1
b g
The two trees with the lowest weights are now the character ’o’ and the tree we
formed in the last step. If we combine these we obtain
4 3 5 3 3 2 5
t h e n a t
2 1
o
1 1
b g
4 5 5 3 3 5
t e n t
3 2 2 1
h a o
1 1
b g
5.2. HUFFMAN CODING 81
4 5 5 6 5
t e t
3 2 3 3
h a n
2 1
o
1 1
b g
9 5 6 5
t
4 5 3 2 3 3
t e h a n
2 1
o
1 1
b g
9 10 6
4 5 5 5 3 3
t e t n
3 2 2 1
h a o
1 1
b g
82 CHAPTER 5. LOSSLESS COMPRESSION
25
10 15
5 5 6 9
t
3 2 3 3 4 5
h a n t e
2 1
o
1 1
b g
Figure 5.4. The Huffman tree for the text ’then the hen began to eat’.
10 15
5 5 6 9
t
3 2 3 3 4 5
h a n t e
2 1
o
1 1
b g
By combining these two trees we obtain the final Huffman tree in figure 5.4.
From this we can read off the Huffman codes as
so we see that the Huffman coding of the text ’then the hen began to eat’ is
110 000 111 101 01 110 000 111 01 000 111 101 01 10000
111 10001 001 101 01 110 1001 01 111 001 110
The spaces and the new line have been added to make the code easier to read;
on a computer these will not be present.
The original text consists of 25 characters including the spaces. Encoding
this with standard eight-bit encodings like ISO Latin or UTF-8 would require
400 bits. Since there are only nine symbols we could use a shorter fixed width
encoding for this particular text. This would require five bits per symbol and
would reduce the total length to 125 bits. In contrast the Huffman encoding
only requires 75 bits.
Proposition 5.9 (Prefix property). Huffman coding has the prefix property:
No code is a prefix of any other code.
Proof. Suppose that Huffman coding does not have the prefix property, we will
show that this leads to a contradiction. Let the code c 1 be the prefix of another
code c 2 , and let n i be the node associated with the symbol with code c i . Then
the node n 1 must be somewhere on the path from the root down to n 2 . But then
n 2 must be located further from the root than n 1 , so n 1 cannot be a leaf node,
which contradicts the definition of a Huffman tree (remember that symbols are
only associated with leaf nodes).
We emphasise that it is the prefix property that makes it possible to use vari-
able lengths for the codes; without this property we would not be able to decode
an encoded text. Just consider the simple case where c(A) = 01, c(B ) = 010 and
c(C ) = 1; which text would the code 0101 correspond to?
In the Huffman algorithm, we start by building trees from the symbols with
lowest frequency. These symbols will therefore end up the furthest from the root
and end up with the longest codes, as is evident from example 5.8. Likewise, the
symbols with the highest frequencies will end up near the root of the tree and
therefore receive short codes. This property of Huffman coding can be quanti-
fied, but to do this we must introduce a new concept.
84 CHAPTER 5. LOSSLESS COMPRESSION
Note that any binary tree with the symbols at the leaf nodes gives rise to a
coding with the prefix property. A natural question is then which tree gives the
coding with the fewest bits?
`(T ∗ ) ≤ `(T ).
Theorem 5.10 says that Huffman coding is optimal, at least among coding
schemes based on binary trees. Together with its simplicity, this accounts for
the popularity of this compression method.
n
X
B= f (αi )`(αi ). (5.3)
i =1
However, we note that if we multiply all the frequencies by the same constant,
the Huffman tree remains the same. It therefore only depends on the relative
frequencies of the different symbols, and not the length of the text. In other
words, if we consider a new text which is twice as long as the one we used in
example 5.8, with each letter occurring twice as many times, the Huffman tree
would be the same. This indicates that we should get a good measure of the
quality of an encoding if we divide the total number of bits used by the length of
5.3. PROBABILITIES AND INFORMATION ENTROPY 85
the text. If the length of the text is m this leads to the quantity
n f (α )
i
`(αi ).
X
B̄ = (5.4)
i =1 m
If we consider longer and longer texts of the same type, it is reasonable to be-
lieve that the relative frequencies of the symbols would converge to a limit p(αi )
which is usually referred to as the probability of the symbol αi . As always for
probabilities we have ni=1 p(αi ) = 1.
P
Note that the Huffman algorithm will work just as well if we use the prob-
abilities as weights rather than the frequencies, as this is just a relative scaling.
In fact, the most obvious way to obtain the probabilities is to just divide the fre-
quencies with the number of symbols for a given text. However, it is also pos-
sible to use a probability distribution that has been determined by some other
means. For example, the probabilities of the different characters in English have
been determined for typical texts. Using these probabilities and the correspond-
ing codes will save you the trouble of processing your text and computing the
probabilities for a particular text. Remember however that such pre-computed
probabilities are not likely to be completely correct for a specific text, particu-
larly if the text is short. And this of course means that your compressed text will
not be as short as it would be had you computed the correct probabilities.
In practice, it is quite likely that the probabilities of the different symbols
change as we work our way through a file. If the file is long, it probably contains
different kinds of information, as in a document with both text and images. It
would therefore be useful to update the probabilities at regular intervals. In the
case of Huffman coding this would of course also require that we update the
Huffman tree and therefore the codes assigned to the different symbols. This
86 CHAPTER 5. LOSSLESS COMPRESSION
may sound complicated, but is in fact quite straightforward. The key is that the
decoding algorithm must compute probabilities in exactly the same way as the
compression algorithm and update the Huffman tree at exactly the same posi-
tion in the text. As long as this requirement is met, there will be no confusion as
the compression end decoding algorithms will always use the same codes.
where log2 denotes the logarithm to base 2. The quantity H is called the in-
formation entropy of the alphabet with the given probabilities.
Example 5.13. Let us return to example 5.8 and compute the entropy in this
particular case. From the frequencies we obtain the probabilities
of the code for the symbol αi was log2 1/p(αi ) . But we know that this is just the
¡ ¢
number of bits in the number 1/p(αi ). This therefore indicates that an optimal
compression scheme would represent αi by the number 1/p(αi ). Huffman cod-
ing necessarily uses an integer number of bits for each code, and therefore only
has a chance of reaching entropy performance when 1/p(αi ) is a power of 2 for
all the symbols. In fact Huffman coding does reach entropy performance in this
situation, see exercise 3.
Idea 5.15 (Basic idea of arithmetic coding). Arithmetic coding associates se-
quences of symbols with different subintervals of [0, 1). The width of a subin-
terval is proportional to the probability of the corresponding sequence of
symbols, and the arithmetic code of a sequence of symbols is a floating-point
number in the corresponding interval.
0 1
0 0.8 1
00 01 10 11
0.64 0.96
110 111
000 001 010 011 100 101
Figure 5.5. The basic principle of arithmetic coding applied to the text in example 5.16.
Observation 5.17. Let [a, b] be a given interval with a < b. The function
g (z) = a + z(b − a)
will map any number z in [0, 1] to a number in the interval [a, b]. In particular
the endpoints are mapped to the endpoints and the midpoint to the midpoint,
a +b
g (0) = a, g (1/2) = , g (1) = b.
2
We are now ready to study the details of the arithmetic coding algorithm.
As before we have a text x = {x 1 , . . . , x m } with symbols taken from an alphabet
A = {α1 , . . . , αn }, with p(αi ) being the probability of encountering αi at any
given position in x. It is much easier to formulate arithmetic coding if we in-
troduce one more concept.
It is important to remember that the functions F , L and p are defined for the
symbols in the alphabet A . This means that F (x) only makes sense if x = αi for
some i in the range 1 ≤ i ≤ n.
The basic idea of arithmetic coding is to split the interval [0, 1) into the n
subintervals
£ ¢ £ ¢ £ ¢ £ ¢
0, F (α1 ) , F (α1 ), F (α2 ) , . . . , F (αn−2 ), F (αn−1 ) , F (αn−1 ), 1 (5.6)
£ ¢
so that the width of the subinterval F (αi −1 ), F (αi ) is F (αi ) − F (αi −1 ) = p(αi ).
If the first symbol is x 1 = αi , the arithmetic code must lie in the interval [a 1 , b 1 )
where
The next symbol in the text is x 2 . If this were the first symbol of the text,
£ ¢
the desired subinterval would be L(x 2 ), F (x 2 ) . Since it is the second symbol we
must map the whole interval [0, 1) to the interval [a 1 , b 1 ] and pick out the part
£ ¢
that corresponds to L(x 2 ), F (x 2 ) . The mapping from [0, 1) to [a 1 , b 1 ) is given by
g 2 (z) = a 1 + z(b 1 − a 1 ) = a 1 + zp(x 1 ), see observation 5.17, so our new interval is
h ¡ ¢ ¡ ¢´ £ ¢
[a 2 , b 2 ) = g 2 L(x 2 ) , g 2 F (x 2 ) = a 1 + L(x 2 )p(x 1 ), a 1 + F (x 2 )p(x 1 ) .
£ ¢
The third symbol x 3 would be associated with the interval L(x 3 ), F (x 3 ) if it
were the first symbol. To find the correct subinterval, we map [0, 1) to [a 2 , b 2 )
with the mapping g 3 (z) = a 2 + z(b 2 − a 2 ) and pick out the correct subinterval as
h ¡ ¢ ¡ ¢´
[a 3 , b 3 ) = g 3 L(x 3 ) , g 3 F (x 3 ) .
This process is then continued until all the symbols in the text have been pro-
cessed.
With this background we can formulate a precise algorithm for arithmetic
coding of a text of length m with n distinct symbols.
5.4. ARITHMETIC CODING 91
2. For k = 1, . . . , m
(a) Define the linear function g k (z) = a k−1 + z(b k−1 − a k−1 ).
h ¡ ¢ ¡ ¢´
(b) Set [a k , b k ] = g k L(x k ) , g k F (x k ) .
The arithmetic code of the text x is the midpoint C (x) of the interval [a m , b m ),
i.e., the number
am + bm
,
2
truncated to l ¡ ¢m
− log2 p(x 1 )p(x 2 ) · · · p(x m ) + 1
binary digits. Here dwe denotes the smallest integer that is larger than or equal
to w.
so the cumulative probabilities are F (A) = 0.5, F (B ) = 0.8 and F (C ) = 1.0. This
means that the interval [0, 1) is split into the three subintervals
The first symbol is A, so the first subinterval is [a 1 , b 1 ) = [0, 0.5). The second sym-
bol is C so we must find the part of [a 1 , b 1 ) that corresponds to C . The mapping
from [0, 1) to [0, 0.5) is given by g 2 (z) = 0.5z so [0.8, 1] is mapped to
£ ¢
[a 2 , b 2 ) = g 2 (0.8), g 2 (1) = [0.4, 0.5).
The third symbol is B which corresponds to the interval [0.5, 0.8). We map [0, 1)
to the interval [a 2 , b 2 ) with the function g 3 (z) = a 2 + z(b 2 − a 2 ) = 0.4 + 0.1z so
[0.5, 0.8) is mapped to
£ ¢
[a 3 , b 3 ) = g 3 (0.5), g 3 (0.8) = [0.45, 0.48).
Let us now write down the rest of the computations more schematically in a
table,
£ ¢
g 4 (z) = 0.45 + 0.03z, x 4 = B,[a 4 , b 4 ) = g 4 (0.5), g 4 (0.8) = [0.465, 0.474),
£ ¢
g 5 (z) = 0.465 + 0.009z, x 5 = C , [a 5 , b 5 ) = g 5 (0.8), g 5 (1) = [0.4722, 0.474),
£ ¢
g 6 (z) = 0.4722 + 0.0018z, x 6 = A, [a 6 , b 6 ) = g 6 (0), g 6 (0.5) = [0.4722, 0.4731),
£ ¢
g 7 (z) = 0.4722 + 0.0009z, x 7 = A, [a 7 , b 7 ) = g 7 (0), g 7 (0.5) = [0.4722, 0.47265),
£ ¢
g 8 (z) = 0.4722 + 0.00045z, x 8 = B, [a 8 , b 8 ) = g 8 (0.5), g 8 (0.8) = [0.472425, 0.47256),
g 9 (z) = 0.472425 + 0.000135z, x 9 = A,
£ ¢
[a 9 , b 9 ) = g 9 (0), g 9 (0.5) = [0.472425, 0.4724925),
g 10 (z) = 0.472425 + 0.0000675z, x 10 = A,
£ ¢
[a 10 , b 10 ) = g 10 (0), g 10 (0.5) = [0.472425, 0.47245875).
M = 0.472441875 = 0.011110001111000111112 ,
but we just store the 16 bits 0111100011110001. In this example the arithmetic
code therefore uses 1.6 bits per symbol. In comparison the entropy is 1.49 bits
per symbol.
5.4. ARITHMETIC CODING 93
we need to show that b k − a k = p(x 1 ) · · · p(x k ). This follows from step 2 of algo-
rithm 5.19,
¡ ¢ ¡ ¢
b k − a k = g k F (x k ) − g k L(x k )
¡ ¢
= F (x k ) − L(x k ) (b k−1 − a k−1 )
= p(x k )p(x 1 ) · · · p(x k−1 ).
am C (x) M bm
× ◦ × ×
k −1 2k − 1 k
2λ 2λ+1 2λ
am C (x) M bm
× × ◦ ×
k 2k + 1 k +1
2µ 2µ+1 2µ
Figure 5.6. The two situations that can occur when determining the number of bits in the arithmetic code.
the range 0 ≤ j < 2λ . At least one of them, say k/2λ , must lie in the interval
[a m , b m ) since the distance between neighbouring numbers in D λ is 1/2λ which
is at most equal to b m − a m . Denote the midpoint of [a m , b m ) by M . There are
two situations to consider which are illustrated in figure 5.6.
In the first situation shown in the top part of the figure, the number k/2λ
is larger than M and there is no number in D λ in the interval [a m , M ]. If we
form the approximation M̃ to M by only keeping the first λ binary digits, we
obtain the number (k − 1)/2λ in D λ that is immediately to the left of k/2λ . This
number may be smaller than a m , as shown in the figure. To make sure that the
arithmetic code ends up in [a m , b m ) we therefore use one more binary digit and
set C (x) = (2k −1)/2λ+1 , which corresponds to keeping the first λ+1 binary digits
in M .
In the second situation there is a number from D λ in [a m , M ] (this was the
case in example 5.16). If we now keep the first λ digits in M we would get C (x) =
k/2λ . In this case algorithm 5.19 therefore gives an arithmetic code with one
more bit than necessary. In practice the arithmetic code will usually be at least
thousands of bits long, so an extra bit does not matter much.
Recall that each x i is one of the n symbols αi from the alphabet so by properties
of logarithms we have
¡ ¢ Xn
log2 p(x 1 )p(x 2 ) · · · p(x m ) = f (αi ) log2 p(αi )
i =1
In other words, arithmetic coding gives compression rates close to the best pos-
sible for long texts.
Corollary 5.22. For long texts the number of bits per symbol required by the
arithmetic coding algorithm approaches the minimum given by the entropy,
provided the probability distribution of the symbols is correct.
Observation 5.23. Let [a, b] be a given interval with a < b. The function
y −a
h(y) =
b−a
will map any number y in [a, b] to the interval [0, 1]. In particular the end-
points are mapped to the endpoints and the midpoint to the midpoint,
¡ ¢
h(a) = 0, h (a + b)/2 = 1/2, h(b) = 1.
96 CHAPTER 5. LOSSLESS COMPRESSION
1. Set z 1 = C (x).
2. For k = 1, . . . , m
(a) Find the integer i such that L(αi ) ≤ z k < F (αi ) and set
£ ¢
[a k , b k ) = L(αi ), F (αi ) .
(b) Output x k = αi .
(c) Determine the linear function h k (y) = (y − a k )/(b k − a k ).
(d) Set z k+1 = h k (z k ).
it is that contains the arithmetic code z 1 = C (x). This requires a search among
the cumulative probabilities. When the index i of the interval is known, we
know that x 1 = αi . The next step is to decide which subinterval of [a 1 , b 1 ) =
£ ¢
L(αi ), F (αi ) that contains the arithmetic code. If we stretch this interval out to
[0, 1) with the function h k , we can identify the next symbol just as we did with
the first one. Let us see how this works by decoding the arithmetic code that we
computed in example 5.16.
Example 5.25 (Decoding of an arithmetic code). Suppose we are given the arith-
metic code 1001 from example 5.16 together with the probabilities p(0) = 0.8
and p(1) = 0.2. We also assume that the length of the code is known, the proba-
bilities, and how the probabilities were mapped into the interval [0, 1]; this is the
typical output of a program for arithmetic coding. Since we are going to do this
5.4. ARITHMETIC CODING 97
h 1 (y) = y/0.8.
z 2 = h 1 (z 1 ) = z 1 /0.8 = 0.703125
relative to the new interval. This number lies in the interval [0, 0.8) so the second
symbol is x 2 = 0. Once again we map the current interval and arithmetic code
back to [0, 1) with the function h 2 and obtain
z 3 = h 2 (z 2 ) = z 2 /0.8 = 0.87890625.
This number lies in the interval [0.8, 1), so our third symbol must be a x 3 = 1. At
the next step we must map the interval [0.8, 1) to [0, 1). From observation 5.23
we see that this is done by the function h 3 (y) = (y −0.8)/0.2. This means that the
code is mapped to
z 4 = h 3 (z 3 ) = (z 3 − 0.8)/0.2 = 0.39453125.
This brings us back to the interval [0, 0.8), so the fourth symbol is x 4 = 0. This
time we map back to [0, 1) with the function h 4 (y) = y/0.8 and obtain
z 5 = h 4 (z 4 ) = 0.39453125/0.8 = 0.493164.
Since we remain in the interval [0, 0.8) the fifth and last symbol is x 5 = 0, so the
original text was ’00100’.
being used, but there are now good algorithms for handling this. The basic idea
is to organise the computations of the endpoints of the intervals in such a way
that early digits are not influenced by later ones. It is then sufficient to only work
with a limited number of digits at a time (for example 32 or 64 binary digits). The
details of how this is done is rather technical though.
Since the compression rate of arithmetic coding is close to the optimal rate
predicted by the entropy, one would think that it is often used in practice. How-
ever, arithmetic coding is protected by many patents which means that you have
to be careful with the legal details if you use the method in commercial software.
For this reason, many prefer to use other compression algorithms without such
restrictions, even though these methods may not perform quite so well.
In long texts the frequency of the symbols may vary within the text. To com-
pensate for this it is common to let the probabilities vary. This does not cause
problems as long as the coding and decoding algorithms compute and adjust
the probabilities in exactly the same way.
in more detail in the context of digital sound and images in later chapters, but
want to mention two general-purpose programs for lossless compression here.
5.6.1 Compress
The program compress is a much used compression program on UNIX plat-
forms which first appeared in 1984. It uses the LZW-algorithm. After the pro-
gram was published it turned out that part of the algorithm was covered by a
patent.
5.6.2 gzip
To avoid the patents on compress, the alternative program gzip appeared in
1992. This program is based on the LZ77 algorithm, but uses Huffman coding
to encode the pairs of numbers. Although gzip was originally developed for
the Unix platform, it has now been ported to most operating systems, see www.
gzip.org.
Exercises
5.1 In this exercise we are going to use Huffman coding to encode the text
’There are many people in the world’, including the spaces.
5.2 We can generalise Huffman coding to numeral systems other than the bi-
nary system.
5.3 In this exercise we are going to do Huffman coding for the text given by
x = {AB AC ABC A}.
5.4 Recall from section 4.3.1 in chapter 4 that ASCII encodes the 128 most
common symbols used in English with seven-bit codes. If we denote the
alphabet by A = {αi }128
i =1
, the codes are
Explain how these codes can be associated with a certain Huffman tree.
What are the frequencies used in the construction of the Huffman tree?
x = {A A A A A A AB A A}
5.6 The four-symbol alphabet A = {A, B,C , D} is used throughout this exer-
cise. The probabilities are given by p(A) = p(B ) = p(C ) = p(D) = 0.25.
a) Compute the information entropy for this alphabet with the given
probabilities.
b) Construct the Huffman tree for the alphabet. How many bits per
symbol is required if you use Huffman coding with this alphabet?
c) Suppose now that we have a text x = {x 1 , . . . , x m } consisting of m sym-
bols taken from the alphabet A . We assume that the frequencies of
the symbols correspond with the probabilities of the symbols in the
alphabet.
How many bits does arithmetic coding require for this sequence and
how many bits per symbol does this correspond to?
d) The Huffman tree you obtained in (b) is not unique. Here we will fix
a tree so that the Huffman codes are
5.7 The three-symbol alphabet A = {A, B,C } with probabilities p(A) = 0.1,
p(B ) = 0.6 and p(C ) = 0.3 is given. A text x of length 10 has been encoded
by arithmetic coding and the code is 1001101. What is the text x?
5.8 We have the two-symbol alphabet A = {A, B } with p(A) = 0.99 and p(B ) =
0.01. Find the arithmetic code of the text
99 times
z }| {
A A A · · · A A A B.
5.9 The two linear functions in observations 5.17 and 5.23 are special cases of
a more general construction. Suppose we have two nonempty intervals
[a, b] and [c, d ], find the linear function which maps [a, b] to [c, d ].
Check that your solution is correct by comparing with observations 5.17
and 5.23.