# Arithmetic coding for data compression.

ARITHMETIC CODING FOR DATA COMPRESSION Arithmetic coding is superior in most respects to the better-known Huffman [10] method. It represents information at least as compactly--sometimes considerably more so. Its performance is optimal without the need for blocking of input data. It encourages a clear separation between the model for representing data and the encoding of information with respect to that model. It accommodates adaptive models easily and is computationally efficient. Yet many authors and practitioners seem unaware of the technique. Indeed there is a widespread belief that Huffman coding cannot be improved upon.We aim to rectify this situation by presenting an accessible implementation of arithmetic coding and by detailing its performance characteristics. We start by briefly reviewing basic concepts of data compression and introducing the model-based approach that underlies most modern techniques. We then outline the idea of arithmetic coding using a simple example, before presenting programs for both encoding and decoding. In these programs the model occupies a separate module so that different models can easily be used. Next we discuss the construction of fixed and adaptive models and detail the compression efficiency and execution time of the programs, including the effect of different arithmetic word lengths on compression efficiency. Finally, we outline a few applications where arithmetic coding is appropriate.

DATA COMPRESSION

To many, data compression conjures up an assortment of ad hoc techniques such as conversion of spaces in text to tabs, creation of special codes for common words, or run-length coding of picture data (e.g., see [8]). This contrasts with the more modern model-based paradigm for coding, where, from an input string of symbols and a model, an encoded string is produced that is (usually) a compressed version of the input. The decoder, which must have access to the same model, regenerates the exact input string from the encoded string. Input symbols are drawn from some well-defined set such as the ASCII or binary alphabets; the encoded string is a plain sequence of bits. The model is a way of calculating, in any given context, the distribution of probabilities for the next input symbol. It must be possible for the decoder to produce exactly the same probability distribution in the same context. Compression is achieved by transmitting the more probable symbols in fewer bits than the less probable ones.

For example, the model may assign a predetermined probability to each symbol in the ASCII alphabet. No context is involved. These probabilities can be determined by counting frequencies in representative samples of text to be transmitted. Such a fixed model is communicated in advance to both encoder and decoder, after which it is used for many messages.

Alternatively, the probabilities that an adaptive model assigns may change as each symbol is transmitted, based on the symbol frequencies seen so far in the message. There is no need for a representative Using an array NODE: array [edge] of vertex we can store [alpha](e) in NODE[-e] and [omega](e) in NODE [+e] for every edge e. This allows all edge-list-oriented algorithms to be used on our structure by accessing the array NODE alone.

Conversely, the adjacency lists for every vertex v are made traversable by using array indexes as links: FIRST : array [vertex] of edge NEXT : array [edge] of edge

These arrays link all edges incident with a vertex v to v by using the chain of indexes starting at FIRST [v], following the NEXT-entries, and ending with a zero-entry. The indexes are positive for edges going out of v and negative for edges going into v. Then, the nonzero entries in the FIRST/NEXT-arrays are the signed edges themselves, with their direction seen from the corresponding vertex. Figure 5 gives an example by showing a sample graph and its array representation. Figure 6 shows how the traversal procedures of Figure 2 can be implemented using our structure, and Figure 7 gives the implementation of the auxiliary functions of Figure 3. (Note that in practical applications all generally usable procedures should check their arguments. These checks have been skipped to simplify the presentation.)

These implementations of the traversal functions lead to a complexity of for-loops over (for/backward) stars of a vertex v proportional to [gamma](v). Equally, for-loops over V and E have a complexity proportional to n and m, respectively. All the implementations of the auxiliary functions apparently use constant time.

The first/next-pair of functions enumerates self-loops twice. If this is not wanted, first and next should be modified to ignore negative edge values e, if NODE[e] = NODE[-e].

The graph module can of course be augmented by additional operations, if they are needed, such as procedures for determining degrees, for testing the existence of edges, etc. Note that is.sub.-.edge (v, w)-tests have a complexity at least proportional to min ([gamma](v), [gamma](w)).

RECONSTRUCTION AND COMPRESSION

The NODE-array alone (being an edge-list representation of the graph) already contains all incidence information. Thus, for example, for external storage, this array suffices. Figure 8 shows how the graph representation described in the previous section can be reconstructed from its NODE-array in linear time. This algorithm traverses the edge list in reverse order. It inserts each (positive) edge e at the front of the adjacency list of [alpha](e) and inserts -e to the list of [omega](e).

This algorithm is order preserving: If the NODE-array is compatible with all the adjacency lists, then the adjacency lists are reconstructed as they were before. On the other hand, if the NODE-array gets sorted (e.g., according to some weight), then all adjacency lists are sorted as well--after reconstruction.

This procedure might also be used for simplifying graph input to simply reading an edge list. Graph output can be performed by just printing the edge list while traversing the NODE-array. If, however, the NODE-array order is not compatible with the orders of the adjacency lists, some topological sorting has to be done when the graph is written to an external medium. Otherwise, the original order on the stars cannot be reconstructed. (This incompatibility might occur, for instance, if the graph was constructed dynamically using the procedures presented in the next section.)

Figure 9 gives a linear time algorithm for compressing the graph structure into an edge list. But here we need some (linear) additional work space for queueing (up to m) normalized edges, and marking (up to 2m) signed edges. This algorithm prints an edge e only if its predecessors in the adjacency list of [alpha](e) as well as [omega](e) have all been printed. To decide this property, a marking of e and -e is used. A queue helps to keep track of all printable edges. This algorithm fails (by not printing all edges) if there is no NODE-array ordering that is compatible with the adjacency-list ordering.

DYNAMIC GRAPHS

Many applications use graphs whose size and structure change during execution. This implies a need for procedures to create and delete vertices and edges. The representation presented in this article can easily be extended for graphs with a (moderately) varying number of vertices and/or edges, as long as a maximum number (NMAX/MMAX) can be given for both. Figure 10 gives the procedures necessary for handling dynamic graphs.

To implement dynamic graphs, we link all unused (nonnegative) entries in the FIRST/NEXT arrays. Using the zero-entries in these arrays (which are unused up to now) as a start pointer, we can link the unused entries together following a stack discipline. To distinguish used from unused entries, we add a large value LARGE (preferably MMAX) to the link vallues in FIRST and NEXT, if we use them for chaining unused entries.

In this case the graph should be initialized as an empty graph using the following procedure: procedure init ( ); for v := 0 to NMAX do FIRST[v] := v + 1 + LARGE od; for e := 0 to MMAX do NEXT[-e] := 0; NEXT[e] := e + 1 + LARGE od; n := m := 0.

(In this case the procedures for traversing the vertex and edge sets have to be adapted.)

It is even possible to keep track of the order in which the edges are added to the vertices by using an additional array LAST : array[vertex] of edge to keep the last edge entry for every vertex. (This applies to forward and backward stars as well as to (undirected) stars.) In this case all loops varying over adjacency sets traverse these sets in the order of edge creation. This leads to ordered graphs, where the edges incident with every vertex are linearly ordered. Figure 11 shows how the creation/deletion procedures are implemented; two auxiliary procedures are given in Figure 12.

The create procedures use constant time. The complexity of delete.sub.-.vertex is proportional to [gamma](v). The delete.sub.-.edge(e) procedure uses time proportional to [gamma]([alpha](e)) + [gamma]([omega](e)). It could be made constant time by using double chains on edges, but this does not seem worthwhile. sample of text, because each message is treated as an independent unit, starting from scratch. The encoder's model changes with each symbol transmitted, and the decoder's changes with each symbol received, in sympathy.

More complex models can provide more accurate probabilistic predictions and hence achieve greater compression. For example, several characters of previous context could condition the next-symbol probability. Such methods have enabled mixed-case English text to be encoded in around 2.2 bits/character with two quite different kinds of model [4, 6]. Techniques that do not separate modeling from coding so distincly, like that of Ziv and Lempe [23], do not seem to show such great potential for compression, although they may be appropriate when the aim is raw speed rather than compression performance [22].

The effectiveness of any model can be measured by the entropy of the message with respect to it, usually expressed in bits/symbol. Shannon's fundamental theorem of coding states that, given messages randomly generated from a model, it is impossible to encode them into less bits (on average) than the entropy of that model [21].

A message can be coded with respect to a model using either Huffman or arithmetic coding. The former method is frequently advocated as the best possible technique for reducing the encoded data rate. It is not. Given that each symbol in the alphabet must translate into an integral number of bits in the encoding. Huffman coding indeed achieves "minimum redundancy". In other words, it performs optimally if all symbol probabilities are integral powers of 1/2. But this is not normally the case in practice: indeed, Huffman coding can take up to one extra bit per symbol. THe worst case is realized by a source in which one symbol has probability approaching unity. Symbols emanating from such a source convey negligible information on average, but require at least one bit to transmit [7]. Arithmetic coding dispenses with the restriction that each symbol must translate into an integral number of bits, thereby coding more efficiently. It actually achieves the theoretical entropy bound to compression efficiency for any source, including the one just mentioned.

In general, sophisticated models expose the deficiencies of Huffman coding more starkly than simple ones. This is because they more often predict symbols with probabilities close to one, the worst case for Huffman coding. For example, the techniques mentioned above that code English text in 2.2 bits/character both use arithmetic coding as the final step, and performance would be impacted severely if Huffman coding were substituted. Nevertheless, since our topic is coding and not modelling, the illustrations in this article all employ simple models. Even so, as we shall see, Huffman coding is inferior to arithmetic coding.

The basic concept of arithmetic coding can be traced back to Elias in the early 1960s (see [1, pp.61-62]). Practical techniques were first introduced by Rissanen [16] and Pasco [15], and developed further by Rissanen [17]. Details of the implementation presented here have not appeared in the literature before; Rubin [20] is closest to our approach. The reader interested in the broader class of arithmetic codes is referred to [18]; a tutorial is available in [13]. Despite these publications, the method is not widely known. A number of recent books and papers on data compression mention it only in passing, or not at all.

THE IDEA OF ARITHMETIC CODING

In arithmetic coding, a message is represented by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller, and the number of bits needed to specify that interval grows. Successive symbols of the message reduce the size of the interval in accordance with the symbol probabilities generated by the model. The more likely symbols reduce the range by less than the unlikely symbols and hence add fewer bits to the message.

Before anything is transmitted, the range for the message is the entire interval [0, 1], denoting the half-open interval 0[is less than or =] x <1. As each symbol is processed, the range is narrowed to that portion of it allocated to the symbol. For example, suppose the alphabet is (a, e, i, o, u, !), and a fixed model is used with probabilities shown in Table I. Imagine transmitting the message eaii!. Initially, both encoder and decoder know that the range is [0, 1]. After seeing the first symbol, e, the encoder narrows it to [0.2, 0.5], the range the model allocates to his symbol. The second symbol, [alpha], will narrow this new range to the first one-fifth of it, since [alpha] has been allocated [0, 0.2]. This produces [0.2, 0.26], since the previous range was 0.3 units long and one-fifth of that is 0.06. The next symbol, i, is allocated [0.5, 0.6], which when applied to [0.2, 0.26] gives the smaller range [0.23, 0.236]. Proceeding in this way, the encoded message builds up as follows: Initially [0, 1) After seeing e [0.2, 0.5) a [0.2, 0.26) i [0.23, 0.236) i [0.233, 0.2336) ! [0.23354, 0.2336)

Figure 1 shows another representation of the encoding process. The vertical bars with ticks represent the symbol probabilities stipulated by the model. After the first symbol has been processed, the model is scaled into the range [0.2, 0.5], as shown in Figure 1a. The second symbol scales it again into the range [0.2, 0.26]. But the picture cannot be continued in this way without a magnifying glass! Consequently, Figure 1b shows the ranges expanded to full height at every stage and marked with a scale that gives the endpoints as numbers.

Suppose all the decoder knows about the message is the final range, [0.23354, 0.2336]. It can immediately deduce that the first character was e, since the range lies entirely within the space the model of Table I allocates for e. Now it can simulate the operation of the encoder: Initially [0, 1] After seeing e [0,2, 0.5]

This makes it clear that the second character is [alpha], since this will produce the range. After seeing [alpha] [0.2, 0.26], which entirely encloses the given range [0.23354, 0.2336]. Proceeding like this, the decoder can identify the whole message.

It is not really necessary for the decoder to know both ends of the range produced by the encoder. Instead, a single number within the range--for example, 0.23355--will suffice. (Other numbers, like 0.23354, 0.23357, or even 0.23354321, would do just as well). However, the decoder will face the problem of detecting the end of the message, to determine when to stop decoding. After all, the single number 0.0 could represent any of [alpha, alpha, alpha, alpha], .... To resolve the ambiguity, we ensure that each message endw with a special EOF symbol known to both encoder and decoder. For the alphabet of TAble I,! will be used to terminate messages, and only to terminate messages. When the decoder sees this symbol, it stops decoding.

Relative to the fixed model of Table I, the entropy of the five-symbol message eaii! is -log 0.3 - log 0.2 - log 0.1 - log 0.1 - log 0.1 = -log 0.00006 [is approx. or =] 4.22 (using base 10, since the above encoding was performed in decimal). This explains why it takes five decimal digits to encode the message. In fact, the size of the final range is 0.2336 - 0.23354 = 0.00006, and the entropy is the negative logarithm of this figure. Of course, we normally work in binary, transmitting binary digits and measuring entropy in bits.

Five decimal digits seems a lot to encode a message comprising four vowels! It is perhaps unfortunate that our example ended up by expanding rather than compressing. Needless to say, however, different models will give different entropies. The best single-character model of the message eaii! is the set of symbol frequencies [e(0.2), a(0.2), i(0.4), !(0.2)], which gives an entropy of 2.89 decimal digits. Using this model the encoding would be only three digits long. Moreover, as noted earlier, more sophisticated models give much better performance in general.

A PROGRAM FOR ARITHMETIC CODING

Figure 2 shows a psuedocode fragment that summarizes the encoding and decoding procedures developed in the last section. Symbols are numbered 1, 2, 3, .... The frequency range for the ith symbol is from cum.sub. .freq[i] to cum.sub. freq[i - 1]. As i decreases, cum.sub. .freq[i] increases, and cum.sub. .freq[0] = 1. (The reason for this "backwards" convention is that cum.sub. .freq[0] will later contain a normalizing factor, and it will be convenient to have it begin the array.) The "current interval" is [low, high), and for both encoding and decoding, this should be initialized to [0, 1).

Unfortunately, Figure 2 is overly simplistic. In practice, there are several factors that complicate both encoding and decoding:

Incremental transmission and reception. The encode algorithm as described does not transmit anything until the entire message has been encoded; neither does the decode algorithm begin decoding until it has received the complete transmission. In most applications an incremental mode of operation is necessary.

The desire to use integer arithmetic. The precision required to represent the [low, high) interval grows with the length of the message. Incremental operation will help overcome this, but the potential for overflow nad underflow must still be examined carefully.

Representing the model so that it can be consulted efficiently. The representation used for the model should minimize the time required for the decode algorithm to identify the next symbol. Moreover, an adaptive model should be organized to minimize the time-consuming task of maintaining cumulative frequencies.

Figure 3 shows working code, in C, for arithmetic encoding and decoding. It is considerably more detailed than the bare-bones sketch of Figure 2! Implementations of two different models are given in Figure 4; the Figure 3 code can use either one.

The remainder of this section examines the code of Figure 3 more closely, and includes a proof that decoding is still correct in the integer implementation and a review of constraints on word lenghts in the program.

Representing the Model

Implementations of models are discussed in the next section; here we are concerned only with the interface to the model (lines 20-38). In C, a byte is represented as an integer between 0 and 255 (a char). Internally, we represent a byte as an integer between 1 and 257 inclusive (an index), EOF being treated as a 257th symbol. It is advantageous to sort the model into frequency order, so as to minimize the number of executions of the decoding loop (line 189). To permit such reordering, the Char/index translation is implemented as a pair of tables, index.sub. .to.sub. .char[] and char.sub. .to.sub. .index[]. In one of our models, these tables simply form the index by adding 1 to the char, but another implements a more complex translation that assigns small indexes to frequently used symbols.

The probabilities in the model are represented as integer frequency counts, and cumulative counts are stored in the array cum.sub. .freq[]. As previously, this array is "backwards," and the total frequency count, which is used to normalize all frequencies, appears in cum.sub. .freq[0]. Cumulative counts must not exceed a predetermined maximum, Max.sub. .frequency, and the model implementation must prevent overflow by scaling appropriately. It must also ensure that neighboring values in the cum.sub. .freq[] aray differ by at least 1; otherwise the affected symbol cannot be transmitted.

Incremental Transmission and Reception

Unlike Figure 2 the program in Figure 3 represents low and high as integers. A special data type, code.sub. .value, is defined for these quantities, together with some useful constants: Top.sub. value, representing the largest possible code.sub. .value, and First.sub. .qtr, Half, and Third.sub. .qtr, representing parts of the range (lines 6-16). Whereas in Figure 2 the current interval is represented by [low, high), in Figure 3 it is [low, high]; that is, the range now includes the value of high. Actually, it is more accurate (though more confusing) to say that, in the program in Figure 3, the interval represented is [low, high + 0.11111 ...). This is because when the bounds are scaled up to increase the precision, zeros are shifted into the low-order bits of low, but ones are shifted into high. Although it is possible to write the program to use a different convention, this one has some advantages in simplifying the code.

As the code range narrows, the top bits of low and high become the same. Any bits that are the same can be transmitted immediately, since they cannot be affected by future narrowing. For encoding, since we know that low [is less than or =] high, this requires code like for (;;) [ if (high < Half) [ output.sub.-bit(0); low = 2*low; high = 2*high+1; ] else if (low [is greater than or =] Half) [ output.sub.-.bit(1); low = 2*(low-Half); high = 2*(high-Half)+1; ] else break; ] which ensures that, upon completion, low < Half [is less than or =] high. This can be found in lines 95-113 of encode.sub. .symbol(), although there are some extra complications caused by underflow possibilities (see the next subsection). Care is taken to shift ones in at the bottom when high is scaled, as noted above.

Incremental reception is done using a number called value as in Figure 2, in which processed bits flow out the top (high-significance) end and newly received ones flow in the bottom. Initially, start.sub. .decoding() (lines 168-176) fills value with received bits. Once decode.sub. .symbol() has identified the next input symbol, it shifts out now-useless high-order bits that are the same in low and high, shifting value by the same amount (and replacing lost bits by fresh input bits at the bottom end): for (;;) [ if (high < Half) [ value = 2*value+input.sub. .bit( ); low = 2*low; high = 2*high+1; ] else if (low > Half) [ value = 2*(value-Half)+input.sub. .bit( ); low = 2*(low-Half); high = 2*(high-Half)+1; ] else break; ] (see lines 194-213, again complicated by precautions against underflow, as discussed below).

Proof of Decoding correctness

At this point it is worth checking that identification of the next symbol by decode.sub. .symbol() works properly. Recall from Figure 2 that decode.sub. .symbol() must use value to find the symbol that, when encoded, reduces the range to one that still includes value. Lines 186-188 in decode.sub. .symbol() identify the symbol for which cum freq[symbol] [is less than or =][ (value - low + 1) * cum freq[0] - 1 / high - low + 1] < cum freq[symbol - 1], where ( ) denotes the "integer part of" function that comes from integer division with truncation. It is shown in the Appendix that this implies low + [ (high -low +1) * cum freq[symbol] / cum freq[0]] [is less than or =] v [is less than or =] low +[ (high - low +1) * cum freq[symbol - 1] / cum freq[0]] - 1, so that value lies within the new interval that decode symbol () calculates in lines 190-193. This is sufficient to guarantee that the decoding operation identifies each symbol correctly.

Underflow

As Figure 1 shows, arithmetic coding works by scaling the cumulative probabilities given by the model into the interval [low, high] for each character transmitted. Suppose low and high are very close together--so close that this scaling operation maps some different symbols of the model onto the same integer in the [low, high] interval. This would be disastrous, because if such a symbol actually occurred it would not be possible to continue encoding. consequently, the encoder must guarantee that the interval [low, high] is always large enough to prevent this. The simplest way to do this is to ensure that this interval is at least as large as Max frequency, the maximum allowed cumulative frequency count (line 36).

How could this condition be violated? The bit-shifting operation explained above ensures that low and high can only become close together when they straddle Half. Suppose in fact they become as close as First qtr [is less than or =] low < Half [is less than or = high < Third qtr.

Then the next two bits sent will have opposite polarity, either 01 or 10. For example, if the next bit turns out to be zero (i.e., high descends below Half and [0, Half] is expanded to the full interval), the bit after that will be one, since the range has to be above the midpoint of the expanded interval. Conversely, if the next bit happens to be one, the one after that will be zero. Therefore the interval can safely be expanded right now, if only we remember that, whatever bit actually comes next, its opposite must be transmitted afterwards as well. Thus lines 104-109 expand [First qtr, Third qtr] into the whole interval, remembering in bits to follow that the bit that is output next must be followed by an opposite bit. This explains why all output is done via bit plus follow() (lines 128-135), instead of directly with output bit().

But what if, after this operation, it is still true that First qtr [is less than or =] low < Half [is less than or =] high < Third qtr?

Figure 5 illustrates this situation, where the current [low, high] range (shown as a thick line) has been expanded a total of three times. Suppose the next bit turns out to be zero, as indicated by the arrow in Figure 5a being below the halfway point. Then the next three bits will be ones, since the arrow is not only in the top half of the bottom half of the original range, but in the top quarter, and moreover the top eighth, of that half--this is why the expansion can occur three times. Similarly, as Figure 5b shows, if the next bit turns out to be a one, it will be followed by three zeros. Consequently, we need only count the number of expansions and follow the next bit by that number of opposites (lines 106 and 131-134).

Using the technique the encoder can guarantee that, after the shifting operations, either low < First qtr < Half [is less than or =] high (1a) or low < Half < Third qtr [is less than or =] high. (1b)

Therefore, as long as the integer range spanned by the cumulative frequencies fits into a quarter of that provided by code values, the underflow problem cannot occur. This corresponds to the condition Max frequence [is less than or =] Top value + 1 / 4 + 1, which is satisfied by Figure 3, since Max frequency = 2.sup.14 - 1 and Top value = 2.sup.16 - 1 (lines 36, 9). More than 14 bits cannot be used to represent cumulative frequency counts without increasing the number of bits allocated to code values.

We have discussed underflow in the encoder only. Since the decoder's job, once each symbol has been decoded, is to track the operation of the encoder, underflow will be avoided if it performs the same expansion operation under the same conditions.

Overflow

Now consider the possibility of overflow in the integer multiplications corresponding to those of Figure 2, which occur in lines 91-94 and 190-193 of Figure 3. Overflow cannot occur provided the product range * Max frequency fits within the integer word length available, since cumulative frequencies cannot exceed Max frequency. Range might be as large as Top value + 1, so the largest possible product in Figure 3 is 2.sup.16(2.sup.14 - 1), which is less than 2.sup.30. Long declarations are used for code value (line 7) and range (lines 89, 183) to ensure that arithmetic is done to 32-bit precision.

Constraints on the Implementation

The constraints on word length imposed by underflow and overflow can be simplified by assuming that frequency counts are represented in f bits, and code value in c bits. The implementation will work correctly provided f [is less than or =] c - 2 f + c [is less than or =] p, the precision to which arithmetic is performed.

In most C implementation, p = 31 if long integers are used, and p = 32 if they are unsigned long. In Figure 3, f = 14 and c = 16. With appropriately modified declarations, unsigned long arithmetic with f = 15 and c = 17 could be used. In assembly language, c = 16 is a natural choice because it expedites some comparisons and bit manipulations (e.g., those of lines 95-113 and 194-213).

If p is restricted to 16 bits, the best values possible are c = 9 and f = 7, making it impossible to encode a full alphabet of 256 symbols, as each symbol must have a count of at least one. A smaller alphabet (e.g., the 26 letters, or 4-bit nibbles) could still be handled.

Termination

To finish the transmission, it is necessary to send a unique terminating symbol (EOF-symbol, line 56) and then follow it by enough bits to ensure that the encoded string falls within the final range. Since done encoding() (line 119-123) can be sure that low and high are constrained by either Eq. (1a) or (1b) above, it need only transmit 01 in the first case or 10 in the second to remove the remaining ambiguity. It is convenient to do this using the bit plus follow() procedure discussed earlier. The input bit() procedure will actually read a few more bits than were sent by output bit(), as it needs to keep the low end of the buffer full. It does not matter what value these bits have, since EOF is uniquely determined by the last two bits actually transmitted.

MODELS FOR ARITHMETIC CODING

The program in figure 3 must be used with a model that provides a pair of translation tables index to char[] and char to index[], and a cumulative frequency array cum frez[]. The requirements on the latter are that

* cum freq[i - 1] [is less than or =] cum freq[i];

* an attempt is never made to encode a symbol i for which cum freq[i - 1] = cum freq[i]; and

* cum freq[0] [is less than or =] Max frequency.

Provided these conditions are satisfied, the values in the array need bear no relationship to the actual cumulative symbol frequencies in messages. Encoding and decoding will still work correctly, although encodings will occupy less space if the frequencies are accurate. (Recall our successfully encoding eaii! according to the model of Table I, which does not actually reflect the frequencies in the message.)

Fixed Models

The simplest kind of model is one in which symbol frequencies are fixed. The first model in Figure 4 has symbol frequencies that approximate those of English (taken from a part of the Brown Corpus [ui]). However, bytes that did not occur in that sample have been given frequency counts of one in case they do occur in messages to be encoded (so this model will still work for binary files in which all 256 bytes occur). Frequencies have been normalized to total 8000. The initialization procedure start model() simply computes a cumulative version of these frequencies (lines 48-51), having first initialized the translation tables (lines 44-47). Execution speed would be improved if these tables were used to reorder symbols and frequencies so that the most frequent came first in the cum freq[] array. Since the model is fixed, the procedure update model)(), which is called from both encode.c and decode.c is null.

An exact model is one where the symbol frequentcies in the message are exactly as prescribed by the model. For example, the fixed model of figure 4 is close to an exact model for the particular excerpt of the Brown Corpus from which it was taken. To be truly exact, however, symbols that did not occur in the excerpt would be assigned counts of zero, rather than one (sacrificing the capability of transmitting messages containing those symbols). Moreover, the frequency counts would not be scaled to a predetermined cumulative frequency, as they have been in Figure 4. The exact model can be calculated and transmitted before the message is sent. It is shown by Clearly and Witten [3] that, under quite general conditions, this will not give better overall compression than adaptive coding (which is described next).

Adaptive Models

An adaptive model represents the chaning symbol frequencies seen so far in a message. Initially all counts might be the same (reflecting no initial information), but they are updated, as each symbol is seen, to approximate the observed frequencies. Provided both encoder and decoder use the same initial values (e.g., equal counts) and the same updating algorithm, their models will remain in step. The encoder receives the next symbol, encodes it, and updates its model. The decoder identifies it according to its current model and then updates its model.

The second half of Figure 4 shows such a adaptive model. This is the type of model recommended for use with Figure 3, for in practice it will outperform a fixed model in terms of compression efficiency. Initialization is the same as for the fixed model, except that all frequencies are set to one. The procedure update model (symbol) is called by both encode symbol() and decode symbol() (Figure 3, lines 54 and 15) after each symbol is processed.

Updating the model is quite expensive because of the need to maintain cumulative totals. In the code of Figure 4, frequency counts, which must be maintained anyway, are used to optimize access by keeping the array in frequency order--an effective kind of self-organizing linear search [9]. Update model() first checks to see if the new model will exceed the cumulative-frequency limit, and if so scales all frequencies down by a factor of two (taking care to ensure that no count scales to zero) and recomputes cumulative values (Figure 4, lines 19-37). Then, if necessary, update model() reorders the symbols to place the current one in its correct rank in the frequency ordering, altering the translation tables to reflect the change. Finally, it increments the appropriate frequency count and adjusts cumulative frequencies accordingly.

PERFORMANCE

Now consider the performance of the algorithm of Figure 3, both in compression efficiency and execution time.

Compression Efficiency

In principle, when a message is coded using arithmetic coding, the number of bits in the encoded string is the same as the entropy of that message with respect to the model used for coding. Three factors cause performance to be worse than this in practice:

(1) message termination overhead;

(2) the use of fixed-length rather than infinite-precision arithmetic; and

(3) scaling of counts so that their total is at most Max)frequency.

None of these effects is significant, as we now show. In order to isolate the effect of arithmetic coding, the model will be considered to be exact (as defined above).

ARithmetic coding must send extra bits at the end of each message, causing a message termination overhead. Two bits are needed, sent by done encoding() (Figure 3, lines 119-123), in order to disambiguate the final symbol. In cases where a bit stream must be blocked into 8-bit characters before encoding, it will be necessary to round out to the end of a block. Combining these, an extra 9 bits may be required.

The overhead of using fixed-length arithmetic occurs because remainders are truncated on division. It can be assessed by comparing the algorithm's performance with the figure obtained froma theoretical entropy calculation that derives its frequencies from counts scaled exactly as for coding. It is completely negligible--on the order of 10.sup.-4 bits/symbol.

The penalty paid by scaling counts is somewhat larger, but still very small. For short messages (less than 2.sup.14 bytes), no scaling need be done. Even with messages of 10.sup.5-10.sup.6 bytes, the overhead was found experimentally to be less than 0.25 percent of the encoded string.

The adaptive model in Figure 4 scales down all counts whenever the total threatens to exceed Max frequency. This has the effect of weighting recent events more heavily than events from earlier in the message. The statistics thus tend to track changes in the input sequence, which can be very beneficial. (We have encountered cases where limiting counts to 6 or 7 bits gives better results than working to higher precision.) Of course, this depends on the source being modeled. Bentley et al. [2] consider other, more explicit, ways of incorporating a recency effect.

Execution Time

The program in Figure 3 has been written for clarity rather than for execution speed. In fact, with the adaptive model in Figure 4, it takes about 420 us per input byte on a VAX-11/780 to encode a text file, and about the same for decoding. However, easily avoidable overheads such as procedure calls account for much of this, and some simple optimizations increase speed by a factor of two. The following alterations were made to the C version shown:

(1) The procedures input-bit(), output-bit(), and bit plus follow() were converted to macros to eliminate procedure-call overhead.

(2) Frequently used quantities were put in register variables.

(3) Multiplies by two were replaced by additions (C "+=").

(4) Array indexing was replaced by pointer manipulation in the loops at line 189 in Figure 3 and lines 49-52 of the adaptive model in Figure 4.

This mildly optimized C implementation has an execution time of 214 us/252 us per input byte, for encoding/decoding 100,000 bytes of English text on a VAX-11/780, as shown in Table II. Also given are corresponding figures for the same program on an Apple Macintosh and a SUN-3/75. As can be seen, coding a C source program of the same length took slightly longer in all cases, and a binary object program longer still. The reason for this will be discussed shortly. Two artificial test files were included to allow readers to replicate the results. "Alphabet" consists of enough copies of the 26-letter alphabet to fill out 100,000 characters (ending with a partially completed alphabet). "Skew statistics" contains 10,000 copies of the string aaaabaaaac; it demonstrates that files may be encoded into less than one bit per character (output size of 12,092 bytes = 96,736 bits). All results quoted used the adaptive model of Figure 4.

A further factor of two can be gained by reprogramming in assembly language. A carefully optimized version of Figures 3 and 4 (adaptive model) was written in both VAX and M68000 assembly languages. Full use was made of registers, and advantage taken of the 16-bit code value to expedite some crucial comparisons and make subtractions of Half trivial. The performance of these implementations on the test files is also shown in Table II in order to give the reader some idea of typical execution speeds.

The VAX-11/780 assembly-language timings are broken down in Table III. These figures were obtained with the UNIX profile facility and are accurate only to within perhaps 10 percent. (This mechanism constructs a histogram of program counter values at real-time clock interrupts and suffers from statistical variation as well as some systematic errors.) "Bounds calculation" refers to the initial parts of encode symbol () and decode symbol () (Figure 3, lines 90-94 and 190-193), which contain multiply and divide operations. "Bit shifting" is the major loop in both the encode and decode routines (lines 95-113 and 194-213). The cum calculation in decode symbol (), which requires a multiply/divide, and the following loop to identify the next symbol (lines 187-189), is "Symbol decode." Finally, "Model update" refers to the adaptive update model () procedure of Figure 4 (lines 26-53).

As expected, the bounds calculation and model update take the same time for both encoding and decoding, within experimental error. Bit shifting was quicker for the text file than for the C program and object file because compression performance was better. The extra time for decoding over encoding is due entirely to the symbol decode step. This takes longer in the C program and object file tests because the loop of line 189 was executed more often (on average 9 times, 13 times, and 35 times, respectively). This also affects the model update time because it is the number of cumulative counts that must be incremented in Figure 4, lines 49-52. In the worst case, when the symbol frequencies are uniformly distributed, these loops are executed an average of 128 times. Worst-case performance would be improved by using a more complex tree representation for frequencies, but this would likely be slower for text files.

SOME APPLICATIONS

Applications of arithmetic coding are legion. By liberating coding with respect to a model from the modeling required for prediction, it encourages a whole new view of data compression [19]. This separation of function costs nothing in compression performance, since arithmetic coding is (practically) optimal with respect to the entropy of the model. Here we intend to do no more than suggest the scope of this view by briefly considering

(1) adaptive text compression,

(2) nonadaptive coding,

(3) compressing black/white images, and

(4) coding arbitrarily distributed integers.

Of course, as noted earlier, greater coding efficiencies could easily be achieved with more sophisticated models. Modeling, however, is an extensive topic in its own right and is beyond the scope of this article.

Adaptive text compression using single-character adaptive frequencies shows off arithmetic coding to good effect. The results obtained using the program in Figures 3 and 4 vary from 4.8-5.3 bits/character for short English text files (10.sup.3.-10.sup.4. bytes) to 4.5-4.7 bits/character for long ones (10.sup.5.-10.sup.6. bytes). Although adaptive Huffman techniques do exist (e.g., [5, 7]), they lack the conceptual simplicity of arithmetic coding. Although competitive in compression efficiency for many files, they are slower. For example, Table IV compares the performance of the mildly optimized C implementation of arithmetic coding with that of the UNIX compact program that implements adaptive Huffman coding using a similar model. (Compact's model is essentially the same for long files, like those of Table IV, but is better for short files than the model used as an example in this article.) Casual examination of compact indicates that the care taken in optimization is roughly comparable for both systems, yet arithmetic coding halves execution time. Compression performance is somewhat better with arithmetic coding on all the example files. The difference would be accentuated with more sophisticated models that predict symbols with probabilities approaching one under certain circumstances (e.g., the letter u following q).

Nonadaptive coding can be performed arithmetically using fixed, prespecified models like that in the first part of Figure 4. Compression performance will be better than Huffman coding. In order to minimize execution time, the total frequency count, cum freq[0], should be chosen as a power of two so the divisions in the bounds calculations (Figure 3, lines 91-94 and 190-193) can be done as shifts. Encode/decode times of around 60 [mu]s/90 [mu]s should then be possible for an assembly-language implementation on a VAX-11/780. A carefully written implementation of Huffman coding, using table lookup for encoding and decoding, would be a bit faster in this application.

Compressing black/white images using arithmetic coding has been investigated by Langdon and Rissanen [14], who achieved excellent results using a model that conditioned the probability of a pixel's being black on a template of pixels surrounding it. The template contained a total of 10 pixels, selected from those above and to the left of the current one so that they precede it in the raster scan. This creates 1024 different possible contexts, and for each the probability of the pixel being black was estimated adaptively as the picture was transmitted. Each pixel's polarity was then coded arithmetically according to this probability. A 20-30 percent improvement in compression was attained over earlier methods. To increase coding speed, Langdon and Rissanen used an approximate method of arithmetic coding that avoided multiplication by representing probabilities as integer powers of 1/2. Huffman coding cannot be directly used in this application, as it never compresses with a two-symbol alphabet. Run-length coding, a popular method for use with two-valued alphabets, provides another opportunity for arithmetic coding. The model reduces the data to a sequence of lengths of runs of the same symbol (e.G., for picture coding, run-lengths of black followed by white followed by black followed by white . . .). The sequence of lenghts must be transmitted. The CCITT facsimile coding standard [11] bases a Huffman code on the frequencies with which black and white runs of different lengths occur in sample documents. A fixed arithmetic code using these same frequencies would give better performance; adapting the frequencies to each particular document would be better still.

Coding arbitrarily distributed integers is often called for in use with more sophisticated models of text, image, or other data. Consider, for instance, the locally adaptive data compression scheme of Bentley et al. [2], in which the encoder and decoder cache the last N different words seen. A word present in the cache is transmitted by sending the integer cache index. Words not in the cache are transmitted by sending a new-word marker followed by the characters of the word. This is an excellent model for text in which words are used frequently over short intervals and then fall into long periods of disuse. Their paper discusses several variable-length codings for the integers used as cache indexes. Arithmetic coding allows any probability distribution to be used as the basis for a variable-length encoding, including--among countless others--the ones implied by the particular codes discussed there. It also permits use of an adaptive model for cache indexes, which is desirable if the distribution of cache hits is difficult to predict in advance. Furthermore, with arithmetic coding, the code spaced allotted to the cache indexes can be scaled down to accommodate any desired probability for the new-word marker.

Printer friendly Cite/link Email Feedback | |

Author: | Witten, Ian H.; Neal, Radford M.; Cleary, John G. |
---|---|

Publication: | Communications of the ACM |

Article Type: | technical |

Date: | Jun 1, 1987 |

Words: | 7854 |

Previous Article: | A versatile data structure for edge-oriented graph algorithms. |

Next Article: | Rule-based versus structure-based models for explaining and generating expert behavior. |