Lecture35-37 SourceCoding
Lecture35-37 SourceCoding
1. Source symbols encoded in binary 2. The average codelength must be reduced 3. Remove redund ancy reduces bit -rate Consider a discrete memoryless source on the alphabet S = {s0 , s1 , , sk } Let the corresponding probabilities be {p0 , p1 , , pk } and codelengths be {l0 , l1 , , lk }.
Then, the average codelength(average number of bits per symbol) of the source is defined as
L =
K 1 X k=0
pk ll
If Lmin is the minimum possible value of L , the n the coding efficiency of the source is given by .
Lmin = L For an efficient code approaches unity. The question: What is smallest average codelength that is possible? The Answer: Shannons source coding theorem Given a discrete memoryless source of ent ropy H(s), the average
L H(s) Since, H(s) is the fund amental limit on the average number of bits/symbol, we can say Lmin H(s) H(s) = = L Data Compaction: 1. Removal of redund ant information prior to transmission. 2. Lossless data compaction no information is lost. 3. A source code which represents the outpu t of a discrete memoryless source should be uniquely decodable.
Table 1: Illustrating the definition of prefix code Symbol s0 s1 s2 s3 Prob.of Occurrence 0.5 0.25 0.125 0.125 Code I 0 1 00 11 Code II 0 10 110 111 Code III 0 01 011 0111
Table 2: Table is reproduce d from S.Haykins book on Communication Systems From 1 we see that Code I is not a prefix code. Code II is a prefix code. Code III is also uniquely decodable but not a prefix code. Prefix codes also satisfies Kraft-McMillan inequality which is given by
K 1 X k=0
2 lk 1
s 0
Figure 1: Decision tree for Code II Given a discrete memoryless source of entropy H(s), a prefix code can be constructed with an average code-word length l, which is
bounded as follows:
H(s) (L) < H(s) + 1 The left hand side of the above equation, the equality is satisfied owing to the condition that, any symbol sk is emitt ed with the probability
(1)
pk = 2lk
(2)
where, lk is the length of the codeword assigned to the symbol sk . Hence, from Eq. 2, we have
K 1
lk
K 1 X k=0
pk = 1
(3)
k=0
With this condition, the Kraft-McMillan inequality tells that a prefix code can be constructed such that the length of the codeword assigned to source symbol sk is log2 pk . Therefore, the average codeword length is given by
L =
K 1 X k=0
lk 2 lk
(4)
H(s) =
K 1 X k=0
1 2 lk lk 2l
k
K 1 X k=0
Hence, from Eq. 5, the equality condition on the leftside of Eq. 1, L = H(s) is satisfied. To prove the inequality condition we will proceed as follows: Let Ln denote the average codeword length of the extended prefix code. For a uniquely decodable code,
Huffman Coding
1. Huffman code is a prefix code 2. The length of codeword for each symbol is roughly equal to the amount of information conveyed. 3. The code need not be unique (see Figure 3) A Huffman tree is constructed as shown in Figure. 3, (a) and (b) represents two forms of Huffman trees. We see that both schemes have same average length but different variances. Variance is a measure of the variability in codeword lengths of a source code. It is defined as follows:
K 1 X k=0
2 =
pk (lk L )2
(6)
where, pk is the probability of kth symbol. lk is the codeword is the average codeword length. It is length of kth symbol and L reasonable to choose the huffman tree which gives greater variace.
s0 (a) Average length, L = .2.2 Variance = 0.160 0 1 s1 0 s2 1 s3 0 1 s4 Symbol s0 s1 s2 s3 0 1 s4 code 10 00 01 110 111
s0
s1
s2
s3 0
s4 1 1
Symbol
code
1 1
s1 s2 s3 s4
Drawbacks: 1. Requires proper statistics. 2. Cannot exploit relationships between words, phrases etc., 3. Does not consider redund ancy of the language.
Lempel-Ziv Coding
1. Overcomes the drawbacks of Huffman coding 2. It is an adaptive and simple encoding scheme. 3. When applied to English tex t it achieves 55% in contrast to Huffman coding which achieves only 43%.
4. Encodes patterns in the text This algorithm is accomplished by parsing the source data stream into segments that are the shortest subsequences not encountered previously. (see Figure 3 the example is reproduce d from S.Haykins book on Communication Systems.)
Let the input sequence be 000101110010100101......... We assume that 0 and 1 are known and stored in codebook subsequences stored : 0, 1 Data to be parsed: 000101110010100101......... The shortest subsequence of the data stream encountered for the first time and not seen before is 00 subsequences stored: 0, 1, 00 Data to be parsed: 0101110010100101.........
The second shortest subsequence not seen before is 01; accordingly, we go on to write Subsequences stored: 0, 1, 00, 01 Data to be parsed: 01110010100101......... We continue in the manner described here until the given data stream has been completely parsed. The code book is shown below:
2 1
3 00 11
4 01 12
5 011 42
6 10 21
7 010 41
8 100 61
9 101 62
0010
0011
1001
0100
1000
1100
1101