Implementation Details and Examples: Variable-Length Entropy Encoding Lossless Data Compression
Implementation Details and Examples: Variable-Length Entropy Encoding Lossless Data Compression
Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding such as Huffman coding in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, a fraction n where (0.0 n < 1.0).
Contents
[hide]
o o o o o
1.1 Equal probabilities 1.2 Defining a model 1.3 Encoding and decoding: overview 1.4 Encoding and decoding: example 1.5 Sources of inefficiency
2 Adaptive arithmetic coding 3 Precision and renormalization 4 Arithmetic coding as a generalized change of radix
o o
6 US patents 7 Benchmarks and other technical characteristics 8 Teaching aid 9 See also 10 References 11 External links
[edit]Implementation [edit]Equal
probabilities
In the simplest case, the probability of each symbol occurring is equal. For example, consider a sequence taken from a set of three symbols, A, B, and C, each equally likely to occur. Simple block encoding would use 2 bits per symbol, which is wasteful: one of the bit variations is never used. A more efficient solution is to represent the sequence as a rational number between 0 and 1 in base 3, where each digit represents a symbol. For example, the sequence "ABBCAB" could become 0.011201 3. The next step is to encode this ternary number using a fixed-point binary number of sufficient precision to recover it, such as 0.0010110012 this is only 9 bits, 25% smaller than the nave block encoding. This is feasible for long sequences because there are efficient, in-place algorithms for converting the base of arbitrarily precise numbers.
To decode the value, knowing the original string had length 6, one can simply convert back to base 3, round to 6 digits, and recover the string. [edit]Defining
a model
In general, arithmetic coders can produce near-optimal output for any given set of symbols and probabilities (the optimal value is log2P bits for each symbol of probability P, see source coding theorem). Compression algorithms that use arithmetic coding start by determining a model of the data basically a prediction of what patterns will be found in the symbols of the message. The more accurate this prediction is, the closer to optimal the output will be. Example: a simple, static model for describing the output of a particular monitoring instrument over time might be:
60% chance of symbol NEUTRAL 20% chance of symbol POSITIVE 10% chance of symbol NEGATIVE 10% chance of symbol END-OF-DATA. (The presence of this symbol means that the stream will be 'internally terminated', as is fairly common in data compression; when this symbol appears in the data stream, the decoder will know that the entire stream has been decoded.)
Models can also handle alphabets other than the simple four-symbol set chosen for this example. More sophisticated models are also possible: higher-order modelling changes its estimation of the current probability of a symbol based on the symbols that precede it (the context), so that in a model for English text, for example, the percentage chance of "u" would be much higher when it follows a "Q" or a "q". Models can even be adaptive, so that they continuously change their prediction of the data based on what the stream actually contains. The decoder must have the same model as the encoder. [edit]Encoding
In general, each step of the encoding process, except for the very last, is the same; the encoder has basically just three pieces of data to consider:
The next symbol that needs to be encoded The current interval (at the very start of the encoding process, the interval is set to [0,1], but that will change) The probabilities the model assigns to each of the various symbols that are possible at this stage (as mentioned earlier, higher-order or adaptive models mean that these probabilities are not necessarily the same in each step.)
The encoder divides the current interval into sub-intervals, each representing a fraction of the current interval proportional to the probability of that symbol in the current context. Whichever interval corresponds to the actual symbol that is next to be encoded becomes the interval used in the next step. Example: for the four-symbol model above:
the interval for NEUTRAL would be [0, 0.6) the interval for POSITIVE would be [0.6, 0.8) the interval for NEGATIVE would be [0.8, 0.9) the interval for END-OF-DATA would be [0.9, 1).
When all symbols have been encoded, the resulting interval unambiguously identifies the sequence of symbols that produced it. Anyone who has the same final interval and model that is being used can reconstruct the symbol sequence that must have entered the encoder to result in that final interval. It is not necessary to transmit the final interval, however; it is only necessary to transmit one fraction that lies within that interval. In particular, it is only necessary to transmit enough digits (in whatever base) of the fraction so that all fractions that begin with those digits fall into the final interval. [edit]Encoding
A diagram showing decoding of 0.538 (the circular point) in the example model. The region is divided into subregions proportional to symbol frequencies, then the subregion containing the point is successively subdivided in the same way.
Consider the process for decoding a message encoded with the given four-symbol model. The message is encoded in the fraction 0.538 (using decimal for clarity, instead of binary; also assuming that there are only as many digits as needed to decode the message.) The process starts with the same interval used by the encoder: [0,1), and using the same model, dividing it into the same four sub-intervals that the encoder must have. The fraction 0.538 falls into the sub-interval for NEUTRAL, [0, 0.6); this indicates that the first symbol the encoder read must have been NEUTRAL, so this is the first symbol of the message. Next divide the interval [0, 0.6) into sub-intervals:
the interval for NEUTRAL would be [0, 0.36) -- 60% of [0, 0.6) the interval for POSITIVE would be [0.36, 0.48) -- 20% of [0, 0.6) the interval for NEGATIVE would be [0.48, 0.54) -- 10% of [0, 0.6) the interval for END-OF-DATA would be [0.54, 0.6). -- 10% of [0, 0.6)
Since .538 is within the interval [0.48, 0.54), the second symbol of the message must have been NEGATIVE. Again divide our current interval into sub-intervals:
the interval for NEUTRAL would be [0.48, 0.516) the interval for POSITIVE would be [0.516, 0.528) the interval for NEGATIVE would be [0.528, 0.534) the interval for END-OF-DATA would be [0.534, 0.540).
Now .538 falls within the interval of the END-OF-DATA symbol; therefore, this must be the next symbol. Since it is also the internal termination symbol, it means the decoding is complete. If the stream is not internally terminated, there needs to be some other way to indicate where the stream stops. Otherwise, the decoding process could continue forever, mistakenly reading more symbols from the fraction than were in fact encoded into it. [edit]Sources
of inefficiency
The message 0.538 in the previous example could have been encoded by the equally short fractions 0.534, 0.535, 0.536, 0.537 or 0.539. This suggests that the use of decimal instead of binary introduced some inefficiency. This is correct; the information content of a three-digit decimal is approximately 9.966 bits; the same message could have been encoded in the binary fraction 0.10001010 (equivalent to 0.5390625 decimal) at a cost of only 8 bits. (The final zero must be specified in the binary fraction, or else the message would be ambiguous without external information such as compressed stream size.)
This 8 bit output is larger than the information content, or entropy of the message, which is 1.57 3 or 4.71 bits. The large difference between the example's 8 (or 7 with external compressed data size information) bits of output and the entropy of 4.71 bits is caused by the short example message not being able to exercise the coder effectively. The claimed symbol probabilities were [0.6, 0.2, 0.1, 0.1], but the actual frequencies in this example are [0.33, 0, 0.33, 0.33]. If the intervals are readjusted for these frequencies, the entropy of the message would be 1.58 bits and the same NEUTRAL NEGATIVE ENDOFDATA message could be encoded as intervals [0, 1/3); [1/9, 2/9); [5/27, 6/27); and a binary interval of [1011110, 1110001). This could yield an output message of 111, or just 3 bits. This is also an example of how statistical coding methods like arithmetic encoding can produce an output message that is larger than the input message, especially if the probability model is off. [edit]Adaptive
arithmetic coding
One advantage of arithmetic coding over other similar methods of data compression is the convenience of adaptation. Adaptation is the changing of the frequency (or probability) tables while processing the data. The decoded data matches the original data as long as the frequency table in decoding is replaced in the same way and in the same step as in encoding. The synchronization is, usually, based on a combination of symbols occurring during the encoding and decoding process. Adaptive arithmetic coding significantly improves the compression ratio compared to static methods; it may be as effective as 2 to 3 times better in the result. [edit]Precision
and renormalization
The above explanations of arithmetic coding contain some simplification. In particular, they are written as if the encoder first calculated the fractions representing the endpoints of the interval in full, using infinite precision, and only converted the fraction to its final form at the end of encoding. Rather than try to simulate infinite precision, most arithmetic coders instead operate at a fixed limit of precision which they know the decoder will be able to match, and round the calculated fractions to their nearest equivalents at that precision. An example shows how this would work if the model called for the interval [0,1) to be divided into thirds, and this was approximated with 8 bit precision. Note that since now the precision is known, so are the binary ranges we'll be able to use.
Symbol
Interval reduced to eight-bit Interval reduced to eight-bit precision (as fractions) precision (in binary)
Range in binary
1/3
[0, 85/256)
[0.00000000, 0.01010101)
00000000 01010100
1/3
[85/256, 171/256)
[0.01010101, 0.10101011)
01010101 10101010
1/3
[171/256, 1)
[0.10101011, 1.00000000)
10101011 11111111
A process called renormalization keeps the finite precision from becoming a limit on the total number of symbols that can be encoded. Whenever the range is reduced to the point where all values in the range share certain beginning digits, those digits are sent to the output. For however many digits of precision the computer can handle, it is now handling fewer than that, so the existing digits are shifted left, and at the right, new digits are added to expand the range as widely as possible. Note that this result occurs in two of the three cases from our previous example.
Symbol Probability
Range
1/3
00000000 01010100
00000000 10101001
1/3
01010101 10101010
None
01010101 10101010
1/3
10101011 11111111
01010110 11111111
[edit]Arithmetic
Recall that in the case where the symbols had equal probabilities, arithmetic coding could be implemented by a simple change of base, or radix. In general, arithmetic (and range) coding may be interpreted as a generalized change of radix. For example, we may look at any sequence of symbols:
DABDDB
as a number in a certain base presuming that the involved symbols form an ordered set and each symbol in the ordered set denotes a sequential integer A = 0, B = 1, C = 2, D = 3, and so on. This results in the following frequencies and cumulative frequencies:
The cumulative frequency is the total of all frequencies below it in a frequency distribution (a running total of frequencies). In a positional numeral system the radix, or base, is numerically equal to a number of different symbols used to express the number. For example, in the decimal system the number of symbols is 10, namely 0,1,2,3,4,5,6,7,8,9. The radix is used to express any finite integer in a presumed multiplier in polynomial 2 1 0 form. For example, the number 457 is actually 410 + 510 + 710 , where base 10 is presumed but not shown explicitly. Initially, we will convert DABDDB into a base-6 numeral, because 6 is the length of the string. The string is first mapped into the digit string 301331, which then maps to an integer by the polynomial:
The result 23671 has a length of 15 bits, which is not very close to the theoretical limit (the entropy of the message), which is approximately 9 bits. To encode a message with a length closer to the theoretical limit imposed by information theory we need to slightly generalize the classic formula for changing the radix. We will compute lower and upper
bounds L and U and choose a number between them. For the computation of L we multiply each term in the above expression by the product of the frequencies of all previously occurred symbols:
The difference between this polynomial and the polynomial above is that each term is multiplied by the product of the frequencies of all previously occurring symbols. More generally, L may be computed as:
where Ci are the cumulative frequencies and fk are the frequencies of occurrences. Indexes denote the position of the symbol in a message. In the special case where all frequencies fk are 1, this is the change-of-base formula. The upper bound U will be L plus the product of all frequencies; in this case U = L + (3 1 2 3 3 2) = 25002 + 108 = 25110. In general, U is given by:
Now we can choose any number from the interval [L, U) to represent the message; one convenient choice is the value with the longest possible trail of zeroes, 25100, since it 2 allows us to achieve compression by representing the result as 25110 . The zeroes can also be truncated, giving 251, if the length of the message is stored separately. Longer messages will tend to have longer trails of zeroes. To decode the integer 25100, the polynomial computation can be reversed as shown in the table below. At each stage the current symbol is identified, then the corresponding term is subtracted from the result.
Corrected remainder
25100
25100 / 6 = 3 D
(25100 6 3) / 3 = 590
590
590 / 6 = 0
(590 6 0) / 1 = 590
590
590 / 6 = 2
(590 6 1) / 2 = 187
187
187 / 6 = 5
(187 6 3) / 3 = 26
26
26 / 6 = 4
(26 6 3) / 3 = 2
2/6 =2
During decoding we take the floor after dividing by the corresponding power of 6. The result is then matched against the cumulative intervals and the appropriate symbol is selected from look up table. When the symbol is identified the result is corrected. The process is continued for the known length of the message or while the remaining result is positive. The only difference compared to the classical change-of-base is that there may be a range of values associated with each symbol. In this example, A is always 0, B is either 1 or 2, and D is any of 3, 4, 5. This is in exact accordance with our intervals that are determined by the frequencies. When all intervals are equal to 1 we have a special case of the classic base change. [edit]Theoretical
The lower bound L never exceeds n , where n is the size of the message, and so can be n represented in log 2(n ) = nlog 2(n) bits. After the computation of the upper bound U and the reduction of the message by selecting a number from the interval [L, U) with the longest trail of zeros we can presume that this length can be reduced
by bits. Since each frequency in a product occurs exactly same number of times as the value of this frequency, we can use the size of the alphabet A for the computation of the product
Applying log2 for the estimated number of bits in the message, the final message (not counting a logarithmic overhead for the message length and frequency tables) will match the number of bits given by entropy, which for long messages is very close to optimal:
[edit]Connections [edit]Huffman
coding
Main article: Huffman coding There is great similarity between arithmetic coding and Huffman coding in fact, it has been shown that Huffman is just a specialized case of arithmetic coding but because arithmetic coding translates the entire message into one number represented in base b, rather than translating each symbol of the message into a series of digits in base b, it will sometimes approach optimal entropy encoding much more closely than Huffman can. In fact, a Huffman code corresponds closely to an arithmetic code where each of the frequencies is rounded to a nearby power of for this reason Huffman deals relatively poorly with distributions where symbols have frequencies far from a power of , such as 0.75 or 0.375. This includes most distributions where there are either a small numbers of symbols (such as just the bits 0 and 1) or where one or two symbols dominate the rest. For an alphabet {a, b, c} with equal probabilities of 1/3, Huffman coding may produce the following code:
a 0: 50%
This code has an expected (2 + 2 + 1)/3 1.667 bits per symbol for Huffman coding, an inefficiency of 5 percent compared to log23 1.585 bits per symbol for arithmetic coding. For an alphabet {0, 1} with probabilities 0.625 and 0.375, Huffman encoding treats them as though they had 0.5 probability each, assigning 1 bit to each value, which does not achieve any compression over naive block encoding. Arithmetic coding approaches the optimal compression ratio of
When the symbol 0 has a high probability of 0.95, the difference is much greater:
One simple way to address this weakness is to concatenate symbols to form a new alphabet in which each symbol represents a sequence of symbols in the original alphabet. In the above example, grouping sequences of three symbols before encoding would produce new "super-symbols" with the following frequencies:
000: 85.7% 001, 010, 100: 4.5% each 011, 101, 110: .24% each 111: 0.0125%
With this grouping, Huffman coding averages 1.3 bits for every three symbols, or 0.433 bits per symbol, compared with one bit per symbol in the original encoding. [edit]Range
encoding
Main article: Range encoding Range encoding is regarded by one group of engineers as a different technique and by another group only as a different name for [who?][verification needed] arithmetic coding. There is no unique opinion but [weasel words] some people believe that, when processing is applied as one step per symbol, it is range coding, and when one step is required per every bit it is arithmetic coding. In another [who?] opinion arithmetic coding is the computing of two boundaries on interval [0,1) and choosing the shortest fraction from it, and range n encoding is computing boundaries on the interval [0,n ) and choosing the number with the longest trail of zeros from within. Many [who?] researchers believe that slight difference in the approach makes range encoding patent free. To back up this idea they provide reference to the article of G. Nigel N. Martin, which is not reader friendly and is subject to interpretation. It is cited in the Glen Langdon article An Introduction to Arithmetic Coding, IBM J. RES. DEVELOP. VOL. 28, No 2, March 1984, which makes the method suggested by Martin as prior art recognized by an industry expert. It is close to the first topic of the current article with the difference that both the LOW and HIGH limits are computed on every step and that probabilities are still used for narrowing down the interval and not the frequencies. The
article of G. N. N. Martin amazingly dropped out of attention of many researchers who were filing patents on arithmetic coding explaining the matter of their algorithms as building long proper fraction, which put all their patents at risk to be circumvented by those who do it differently because a patent is a very formal document and language [verification needed] definitions should be very precise. It is not necessary that all patents on arithmetic coding are now void in the light of Martin's article but it opens the ground for debates, which could have [verification been avoided if authors at least mentioned the approach.
needed]
n computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes". Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common source symbols using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. A method was later found to design a Huffman code in linear time if input probabilities (also known [citation needed] asweights) are sorted. For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binaryblock encoding, e.g., ASCII coding. Huffman coding is such a widespread method for creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix code" even when such a code is not produced by Huffman's algorithm. Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e. a stream of unrelated symbols) with a known input probability distribution, it is not optimal when the symbol-by-symbol restriction is dropped, or when the probability mass functions are unknown, not identically distributed, or notindependent (e.g., "cat" is more common than "cta"). Other methods such as arithmetic coding and LZW coding often have better compression capability: both of these methods can combine an arbitrary number of symbols for more efficient coding, and generally adapt to the actual input statistics, the latter of which is useful when input probabilities are not precisely known or vary significantly within the stream. However, the limitations of Huffman coding should not be overstated; it can be used adaptively, accommodating unknown, changing, or context-dependent probabilities. In the case of known independent and identically-distributed random variables, combining symbols together reduces inefficiency in a way that approaches optimality as the number of symbols combined increases.
Contents
[hide]
o o o
3 Basic technique
o o
o o o o o o o
5.1 n-ary Huffman coding 5.2 Adaptive Huffman coding 5.3 Huffman template algorithm 5.4 Length-limited Huffman coding 5.5 Huffman coding with unequal letter costs 5.6 Optimal alphabetic binary trees (Hu-Tucker coding) 5.7 The canonical Huffman code
[edit]History In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the [1] most efficient. In doing so, the student outdid his professor, who had worked with information theory inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of the suboptimal Shannon-Fano coding by building the tree from the bottom up instead of from the top down. [edit]Problem [edit]Informal Given A set of symbols and their weights (usually proportional to probabilities). Find A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root). [edit]Formalized Input. Alphabet Set proportional to probabilities), i.e. Output. , which is the symbol alphabet of size n. , which is the set of the (positive) symbol weights (usually .
definition
description
description
Sum
0.10
0.15
0.30
0.16
0.29
=1
Codewords (ci)
010 011
11
00
10
Output C
0.30
0.45
0.60
0.32
0.58
L(C) = 2.25
Probability budget -l (2 i)
1/8
1/8
1/4
1/4
1/4
= 1.00
Optimality
2.74
1.74
2.64
1.79
For any code that is biunique, meaning that the code is uniquely decodeable, the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a complete code. If this is not the case, you can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it biunique. As defined by Shannon (1948), the information content h (in bits) of each symbol ai with non-null probability is
The entropy H (in bits) is the weighted sum, across all symbols ai with non-zero probability wi, of the information content of each symbol:
(Note: A symbol with zero probability has zero contribution to the entropy, since out of the formula above.) So for simplicity, symbols with zero probability can be left
As a consequence of Shannon's source coding theorem, the entropy is a measure of the smallest codeword length that is theoretically possible for the given alphabet with associated weights. In this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense that no other feasible code performs better, but it is very close to the theoretical limit established by Shannon. Note that, in general, a Huffman code need not be unique, but it is always one of the codes minimizing L(C). [edit]Basic
technique
[edit]Compression
A source generates 4 different symbols{a1,a2,a3,a4} with probability{0.4;0.35;0.2;0.05}. A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:
Symbol Code
a1
a2
10
a3
110
a4
111
The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the entropy of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.
The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, n. A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbolweight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to n leaf nodes and n 1 internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths. The process essentially begins with the leaf nodes containing the probabilities of the symbol they represent, then a new node whose children are the 2 nodes with smallest probability is created, such that the new node's probability is equal to the sum of the children's probability. With the previous 2 nodes merged into one node (thus not considering them anymore), and with the new node being now considered, the procedure is repeated until only one node remains, the Huffman tree. The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority: 1. 2. Create a leaf node for each symbol and add it to the priority queue. While there is more than one node in the queue: 1. Remove the two nodes of highest priority (lowest probability) from the queue 2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities. 3. 3. Add the new node to the queue.
The remaining node is the root node and the tree is complete.
Since efficient priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n1 nodes, this algorithm operates in O(n log n) time, where n is the number of symbols. If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues: 1. 2. 3. Start with as many leaves as there are symbols. Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue). While there is more than one node in the queues: 1. Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
2.
Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
3. 4.
Enqueue the new node into the rear of the second queue.
The remaining node is the root node; the tree has now been generated.
Although this algorithm may appear "faster" complexity-wise than the previous algorithm using a priority queue, this is not actually the case because the symbols need to be sorted by probability before-hand, a process that takes O(n log n) time in itself. In many cases, time complexity is not very important in the choice of algorithm here, since n here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when n grows to be very large. It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code. Here's an example using the French subject string "j'aime aller sur le bord de l'eau les jeudis ou les jours impairs":
[edit]Decompression Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value). Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be preconstructed (and even statistically adjusted on each compression cycle) and thus reused every time, at the expense of at least some measure of compression efficiency. Otherwise, the information to reconstruct the tree must be sent a priori. A naive approach might be to prepend the frequency count of each character to the compression stream. Unfortunately, the overhead in such a case could amount to several kilobytes, so this method has little practical use. If the data is compressed using canonical B encoding, the compression model can be precisely reconstructed with just B2 bits of information (where B is the number of bits per symbol). Another method is to simply prepend the Huffman tree, bit by bit, to the output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered the tree building routine simply reads the next 8 bits to determine the character value of that particular leaf. The process continues recursively until the last leaf node is reached; at that point, the
Huffman tree will thus be faithfully reconstructed. The overhead using such a method ranges from roughly 2 to 320 bytes (assuming an 8-bit alphabet). Many other techniques are possible as well. In any case, since the compressed data can include unused "trailing bits" the decompressor must be able to determine when to stop producing output. This can be accomplished by either transmitting the length of the decompressed data along with the compression model or by defining a special code symbol to signify the end of input (the latter method can adversely affect code length optimality, however). [edit]Main
properties
The probabilities used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. (This variation requires that a frequency table or other hint as to the encoding must be stored with the compressed text; implementations employ various tricks to store tables efficiently.) Huffman coding is optimal when the probability of each input symbol is a negative power of two. Prefix codes tend to have inefficiency on small alphabets, where probabilities often fall between these optimal points. "Blocking", or expanding the alphabet size by grouping multiple symbols into "words" of fixed or variable-length before Huffman coding helps both to reduce that inefficiency and to take advantage of statistical dependencies between input symbols within the group (as in the case of natural language text). The worst case for Huffman coding 1 can happen when the probability of a symbol exceeds 2 = 0.5, making the upper limit of inefficiency unbounded. These situations often respond well to a form of blocking called runlength encoding; for the simple case of Bernoulli processes,Golomb coding is a provably optimal run-length code. Arithmetic coding produces some gains over Huffman coding, although arithmetic coding has higher computational complexity. Also, arithmetic coding was historically a subject of some concern overpatent issues. However, as of mid-2010, various well-known effective techniques for arithmetic coding have passed into the public domain as the early patents have expired. [edit]Variations Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be polynomial time. An exhaustive list of papers on Huffman coding and its variations is given by "Code and Parse Trees for Lossless Source Encoding"[1]. [edit]n-ary
Huffman coding
The n-ary Huffman algorithm uses the {0, 1, ... , n 1} alphabet to encode message and build an n-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary (n equals 2) codes, except that the n least probable symbols are taken together, instead of just the 2 least probable. Note that for n greater than 2, not all sets of source words can properly form an n-ary tree for Huffman coding. In this case, additional 0-probability place holders must be added. This is because the tree must form an n to 1 contractor; for binary coding, this is a 2 to 1 contractor, and any sized set can form such a contractor. If the number of source words is congruent to 1 modulo n-1, then the set of source words will form a proper Huffman tree. [edit]Adaptive
Huffman coding
A variation called adaptive Huffman coding involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates. [edit]Huffman
template algorithm
Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only that the weights form a totally ordered commutative monoid, meaning a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many combining methods (not just addition). Such algorithms can solve other minimization problems, such as minimizing [edit]Length-limited , a problem first applied to circuit design [2].
Huffman coding
Length-limited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant. The package-merge algorithm solves this problem with a simple greedy approach very similar to that used by Huffman's algorithm. Its time complexity is O(nL), where L is the maximum length of a codeword. No algorithm is known to solve this problem in linear or linearithmic time, unlike the presorted and unsorted conventional Huffman problems, respectively. [edit]Huffman
In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing. Huffman coding with unequal letter costs is the generalization in which this assumption is no longer assumed true: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message. No algorithm is known to solve this in the same manner or with the same efficiency as conventional Huffman coding. [edit]Optimal
In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, code could not be assigned , but instead should be assigned
either or . This is also known as the Hu-Tucker problem, after the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem, which has some similarities to Huffman algorithm, but is not a variation of this algorithm. These optimal alphabetic binary trees are often used as binary search trees. [edit]The
If weights corresponding to the alphabetically ordered inputs are in numerical order, the Huffman code has the same lengths as the optimal alphabetic code, which can be found from calculating these lengths, rendering Hu-Tucker coding unnecessary. The code resulting from numerically (re-)ordered input is sometimes called the canonical Huffman code and is often the code used in practice, due to ease of encoding/decoding. The technique for finding this code is sometimes called Huffman-Shannon-Fano coding, since it is optimal like Huffman coding, but alphabetic in weight probability, likeShannon-Fano coding. The Huffman-
Shannon-Fano code corresponding to the example is {000,001,01,10,11}, which, having the same codeword lengths as the original solution, is also optimal. [edit]Applications Arithmetic coding can be viewed as a generalization of Huffman coding, in the sense that they k produce the same output when every symbol has a probability of the form 1/2 ; in particular it tends to offer significantly better compression for small alphabet sizes. Huffman coding nevertheless remains in wide use because of its simplicity and high speed. Intuitively, arithmetic coding can offer better compression than Huffman coding because its "code words" can have effectively non-integer bit lengths, whereas code words in Huffman coding can only have an integer number of bits. Therefore, there is an inefficiency in Huffman coding where a k code word of length k only optimally matches a symbol of probability 1/2 and other probabilities are not represented as optimally; whereas the code word length in arithmetic coding can be made to exactly match the true probability of the symbol. Huffman coding today is often used as a "back-end" to some other compression methods. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model andquantization followed by Huffman coding (or variable-length prefix-free codes with a similar structure, although perhaps not necessarily [clarification needed] designed by using Huffman's algorithm ). LempelZivWelch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement, and has the [1] potential for very high throughput in hardware implementations.
Contents
[hide]
1 Algorithm
o o o o
1.1 Encoding 1.2 Decoding 1.3 Variable-width codes 1.4 Packing order
2 Example
o o
3 Further coding 4 Uses 5 Patents 6 Variants 7 See also 8 References 9 External links
[edit]Algorithm
The scenario described in Welch's 1984 paper encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence for which there is no code yet in the dictionary. The code for the sequence (without that character) is emitted, and a new code (for the sequence with that character) is added to the dictionary. The idea was quickly adapted to other situations. In an image based on a color table, for example, the natural character alphabet is the set of color table indexes, and in the 1980s, many images had small color tables (on the order of 16 colors). For such a reduced alphabet, the full 12-bit codes yielded poor compression unless the image was large, so the idea of a variable-width code was introduced: codes typically start one bit wider than the symbols being encoded, and as each code size is used up, the code width increases by 1 bit, up to some prescribed maximum (typically 12 bits). Further refinements include reserving a code to indicate that the code table should be cleared (a "clear code", typically the first value immediately after the values for the individual alphabet characters), and a code to indicate the end of data (a "stop code", typically one greater than the clear code). The clear code allows the table to be reinitialized after it fills up, which lets the encoding adapt to changing patterns in the input data. Smart encoders can monitor the compression efficiency and clear the table whenever the existing table no longer matches the input well. Since the codes are added in a manner determined by the data, the decoder mimics building the table as it sees the resulting codes. It is critical that the encoder and decoder agree on which variety of LZW is being used: the size of the alphabet, the maximum code width, whether variable-width encoding is being used, the initial code size, whether to use the clear and stop codes (and what values they have). Most formats that employ LZW build this information into the format specification or provide explicit fields for them in a compression header for the data. [edit]Encoding A high level view of the encoding algorithm is shown here: 1. Initialize the dictionary to contain all strings of length one. 2. Find the longest string W in the dictionary that matches the current input. 3. Emit the dictionary index for W to output and remove W from the input. 4. Add W followed by the next symbol in the input to the dictionary. 5. Go to Step 2. A dictionary is initialized to contain the single-character strings corresponding to all the possible input characters (and nothing else except the clear and stop codes if they're being used). The algorithm works by scanning through the input string for successively longer substrings until it finds one that is not in the dictionary. When such a string is found, the index for the string less the last character (i.e., the longest substring that is in the dictionary) is retrieved from the dictionary and sent to output, and the new string (including the last character) is added to the dictionary with the next available code. The last input character is then used as the next starting point to scan for substrings. In this way, successively longer strings are registered in the dictionary and made available for subsequent encoding as single output values. The algorithm works best on data with repeated patterns, so the initial parts of a message will see little compression. As the message grows, however, the compression ratio tends [2] asymptotically to the maximum. [edit]Decoding The decoding algorithm works by reading a value from the encoded input and outputting the corresponding string from the initialized dictionary. At the same time it obtains the next value from the input, and adds to the dictionary the concatenation of the string just output and the first character of the string obtained by decoding the next input value. The decoder then proceeds to the next input value (which was already read in as the "next value" in the
[1]
previous pass) and repeats the process until there is no more input, at which point the final input value is decoded without any more additions to the dictionary. In this way the decoder builds up a dictionary which is identical to that used by the encoder, and uses it to decode subsequent input values. Thus the full dictionary does not need be sent with the encoded data; just the initial dictionary containing the single-character strings is sufficient (and is typically defined beforehand within the encoder and decoder rather than being explicitly sent with the encoded data.) [edit]Variable-width
codes
If variable-width codes are being used, the encoder and decoder must be careful to change the width at the same points in the encoded data, or they will disagree about where the boundaries between individual codes fall in the stream. In the standard version, the encoder increases the width from p to p + 1 when a sequence + s is encountered that is not in the table (so that a code must be added for it) but the next available code in the table is p 2 (the first code requiring p + 1 bits). The encoder emits the code for at width p (since that code does not require p + 1 bits), and then increases the code width so that the next code emitted will be p + 1 bits wide. The decoder is always one code behind the encoder in building the table, so when it sees the code for , it will p generate an entry for code 2 1. Since this is the point where the encoder will increase the code width, the decoder must increase the width here as well: at the point where it generates the largest code that will fit in p bits. Unfortunately some early implementations of the encoding algorithm increase the code width and then emit at the new width instead of the old width, so that to the decoder it looks like the width changes one code too early. This is called "Early Change"; it caused so much confusion that Adobe now allows both versions in PDF files, but includes an explicit flag in the header of each LZW-compressed stream to indicate whether Early Change is being used. Most graphic file formats do not use Early Change. When the table is cleared in response to a clear code, both encoder and decoder change the code width after the clear code back to the initial code width, starting with the code immediately following the clear code. [edit]Packing
order
Since the codes emitted typically do not fall on byte boundaries, the encoder and decoder must agree on how codes are packed into bytes. The two common methods are LSB-First ("Least Significant Bit First") and MSBFirst ("Most Significant Bit First"). In LSB-First packing, the first code is aligned so that the least significant bit of the code falls in the least significant bit of the first stream byte, and if the code has more than 8 bits, the high order bits left over are aligned with the least significant bits of the next byte; further codes are packed with LSB going into the least significant bit not yet used in the current stream byte, proceeding into further bytes as necessary. MSB-first packing aligns the first code so that its most significant bit falls in the MSB of the first stream byte, with overflow aligned with the MSB of the next byte; further codes are written with MSB going into the most significant bit not yet used in the current stream byte. GIF files use LSB-First packing order. TIFF files and PDF files use MSB-First packing order. [edit]Example The following example illustrates the LZW algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding the data. This example has been constructed to give reasonable compression on a very short message. In real text data, repetition is generally less pronounced, so longer input streams are typically necessary before the compression builds up efficiency. The plaintext to be encoded (from an alphabet using only the capital letters) is:
TOBEORNOTTOBEORTOBEORNOT#
The # is a marker used to show that the end of the message has been reached. There are thus 26 symbols in the plaintext alphabet (the 26 capital letters A through Z), plus the stop code #. We arbitrarily assign these the values 1 through 26 for the letters, and 0 for '#'. (Most flavors of LZW would put the stop code after the data alphabet, but nothing in the basic algorithm requires that. The encoder and decoder only have to agree what value it has.)
A computer will render these as strings of bits. Five-bit codes are needed to give sufficient combinations to encompass this set of 27 values. The dictionary is initialized with these 27 values. As the dictionary grows, the 5 codes will need to grow in width to accommodate the additional entries. A 5-bit code gives 2 = 32 possible combinations of bits, so when the 33rd dictionary word is created, the algorithm will have to switch at that point from 5-bit strings to 6-bit strings (for all code values, including those which were previously output with only five bits). Note that since the all-zero code 00000 is used, and is labeled "0", the 33rd dictionary entry will be labeled 32. (Previously generated output is not affected by the code-width change, but once a 6-bit value is generated in the dictionary, it could conceivably be the next code emitted, so the width for subsequent output shifts to 6 bits to accommodate that.) The initial dictionary, then, will consist of the following entries:
# 00000
A 00001
B 00010
C 00011
D 00100
E 00101
F 00110
G 00111
H 01000
I 01001
J 01010
10
K 01011
11
L 01100
12
M 01101
13
N 01110
14
O 01111
15
P 10000
16
Q 10001
17
R 10010
18
S 10011
19
T 10100
20
U 10101
21
V 10110
22
W 10111
23
X 11000
24
Y 11001
25
Z 11010 [edit]Encoding
26
Buffer input characters in a sequence until + next character is not in the dictionary. Emit the code for , and add + next character to the dictionary. Start buffering again with the next character.
Output Current Sequence Next Char Code Bits Extended Dictionary Comments
NULL
20
10100
27:
15
01111
28:
OB
00010
29:
BE
00101
30:
EO
15
01111
31:
OR
18
10010
32:
RN
14 001110
33:
NO
15 001111
34:
OT
20 010100
35:
TT
TO
27 011011
36:
TOB
BE
29 011101
37:
BEO
OR
31 011111
38:
ORT
TOB
36 100100
39:
TOBE
EO
30 011110
40:
EOR
RN
32 100000
41:
RNO
OT
34 100010
0 000000
Unencoded length = 25 symbols 5 bits/symbol = 125 bits Encoded length = (6 codes 5 bits/code) + (11 codes 6 bits/code) = 96 bits. Using LZW has saved 29 bits out of 125, reducing the message by almost 22%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, allowing repeated words to be sent very compactly. [edit]Decoding To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply concatenations of previous entries.
10100
20
27:
T?
01111
15
27:
TO 28:
O?
00010
28:
OB 29:
B?
00101
29:
BE 30:
E?
01111
15
30:
EO 31:
O?
10010
18
31:
OR 32:
001110
14
32:
RN 33:
001111
15
33:
NO 34:
O?
010100
20
34:
OT 35:
T?
011011
27
TO
35:
TT 36:
TO?
011101
29
BE
36:
TOB 37:
011111
31
OR
37:
BEO 38:
100100
36
TOB
38:
011110
30
EO
EO?
100000
32
RN
40:
EOR 41:
RN?
100010
34
OT
41:
RNO 42:
OT?
000000
At each stage, the decoder receives a code X; it looks X up in the table and outputs the sequence it codes, and it conjectures + ? as the entry the encoder just added because the encoder emitted X for precisely because + ? was not in the table, and the encoder goes ahead and adds it. But what is the missing letter? It is the first letter in the sequence coded by the next code Z that the decoder receives. So the decoder looks up Z, decodes it into the sequence and takes the first letter z and tacks it onto the end of as the next dictionary entry. This works as long as the codes received are in the decoder's dictionary, so that they can be decoded into sequences. What happens if the decoder receives a code Z that is not yet in its dictionary? Since the decoder is always just one code behind the encoder, Z can be in the encoder's dictionary only if the encoder just generated it, when emitting the previous code X for . Thus Z codes some that is + ?, and the decoder can determine the unknown character as follows: 1. The decoder sees X and then Z. 2. It knows X codes the sequence and Z codes some unknown sequence . 3. It knows the encoder just added Z to code + some unknown character, 4. and it knows that the unknown character is the first letter z of . 5. But the first letter of (= + ?) must then also be the first letter of . 6. So must be + x, where x is the first letter of . 7. So the decoder figures out what Z codes even though it's not in the table, 8. and upon receiving Z, the decoder decodes it as + x, and adds + x to the table as the value of Z.
This situation occurs whenever the encoder encounters input of the form cScSc, where c is a single character, S is a string and cS is already in the dictionary, but cSc is not. The encoder emits the code for cS, putting a new code for cSc into the dictionary. Next it sees cSc in the input (starting at the second c of cScSc) and emits the new code it just inserted. The argument above shows that whenever the decoder receives a code not in its dictionary, the situation must look like this. Although input of form cScSc might seem unlikely, this pattern is fairly common when the input stream is characterized by significant repetition. In particular, long strings of a single character (which are common in the kinds of images LZW is often used to encode) repeatedly generate patterns of this sort. [edit]Further
coding
The simple scheme described above focuses on the LZW algorithm itself. Many applications apply further encoding to the sequence of output symbols. Some package the coded stream as printable characters using some form of Binary-to-text encoding; this will increase the encoded length and decrease the compression frequency. Conversely, increased compression can often be achieved with an adaptive entropy encoder. Such a coder estimates the probability distribution for the value of the next symbol, based on the observed frequencies of values so far. A standard entropy encoding such as Huffman coding or arithmetic coding then uses shorter codes for values with higher probabilities. [edit]Uses LZW compression became the first widely used universal data compression method on computers. A large English text file can typically be compressed via LZW to about half its original size. LZW was used in the public-domain program compress, which became a more or less standard utility in Unix systems circa 1986. It has since disappeared from many distributions, both because it infringed the LZW patent and because gzip produced better compression ratios using the LZ77-based DEFLATE algorithm, but as of 2008 at least FreeBSD includes both compress and uncompress as a part of the distribution. Several other popular compression utilities also used LZW, or closely related methods. LZW became very widely used when it became part of the GIF image format in 1987. It may also (optionally) be used in TIFF and PDF files. (Although LZW is available in Adobe Acrobat software, Acrobat by default uses DEFLATE for most text and color-table-based image data in PDF files.)