# Modeling word occurrences for the compression of concordances.

1. INTRODUCTIONWith increasing attention being given to digital libraries - massive amounts of information, much of it in textual format, stored and widely accessible by computer - data compression is assuming a new degree of importance. As noted before [Klein et al. 1989], effective compression of a text-based information retrieval (IR) system involves far more than compressing the text itself: it is often overlooked that to be able to access and manipulate text, auxiliary data structures must also be created and stored.

Most large information retrieval systems depend on inverted files for access to their information. In this approach, query processing does not directly involve the original text files (in which keywords might be located using some pattern-matching technique), but rather auxiliary dictionary and concordance files. The dictionary is a list of all the different words appearing in the text and is usually ordered alphabetically. For each entry in the dictionary, there is a pointer into the concordance, which lists each occurrence of the word. While the dictionary is only moderately large, the exact size of a concordance depends on a number of parameters, such as the omission or inclusion of the most frequent words (the so-called stopwords) and whether stemming is first done - in our experiments, all words, as they appear in the text, are used. However, the size of an uncompressed concordance is generally of the same order of magnitude of that of the text itself.

Compressing the concordance serves several purposes. Not only does it save space, but it also saves processing time by reducing the number of I/O operations needed to fetch parts of the concordance into main memory (see Bookstein et al. [1992b], Klein et al. [1989], and Witten et al. [1991]). Thus, to distribute a functional, full-text IR system, consideration must be given how to store the concordance efficiently. But concordance compression has theoretical interest as well. Current approaches to data compression tend to take a two-stage approach: first one models the source, defining the message set and the probability of each message; then one creates the code for each message [Bell et al. 1989]. The concordance of a full-text information retrieval system is the ideal entity on which to test this approach. As a well-structured component of the IR system, it would seem particularly susceptible to modeling.

When choosing an appropriate compression scheme, we note that in a static IR system, compression and decompression are not symmetrical tasks. Compression is done only once while building the system, whereas decompression is needed during the processing of every query and directly affects the response time. One may thus use extensive and costly preprocessing for compression, provided reasonably fast decompression methods are possible. Moreover, in an IR system, while we compress full files (text, concordance, etc.), we decompress only (though possibly many) short pieces on demand; these may be accessed at random by means of pointers to their exact locations. This limits the value of adaptive methods based on tables that systematically change from the beginning to the end of the file.

This article is based on the following conceptualization of a concordance: every occurrence of a word in the database can be uniquely characterized by a sequence of numbers that gives its exact position in the text. Typically, such a sequence would consist of the document number d, the paragraph number p (in the document), the sentence number s (in the paragraph), and the word number w (in the sentence). This defines a hierarchy of four levels: to identify the location of a word, we need give only the quadruple (d, p, s, w), containing the coordinate of the occurrence. The concordance contains, for every word of the dictionary, the lexicographically ordered list of all its coordinates in the text.

In our implementation, we translate this model to an equivalent one, which we describe for a simpler two-level hierarchy In this representation, we indicate (1) the index of the next document containing the word, (2) the number of times the word occurs in the document, followed by (3) the list of word indices of the various occurrences:

[word.sub.1] : ([d.sub.1], [m.sub.1]; [w.sub.1,1], [w.sub.1,2], . . ., [w.sub.1,[m.sub.1]])

([d.sub.2], [m.sub.2]; [w.sub.2,1], . . ., [w.sub.2,[m.sub.2]])

([d.sub.N], [m.sub.N]; [w.sub.N, 1], . . ., [w.sub.N,[m.sub.N]])

[word.sub.2] : . . .

where [d.sub.i] is the document number; [m.sub.i] is the number of occurrences of the given word in the ith document; and [w.sub.i,j] is the index within [d.sub.i] of the jth occurrence of the word To fully represent the concordance, we must model each of the components of the coordinate and develop compression methods appropriate for each entity.

We see that the concordance can be a rather complicated data structure, especially if it permits hierarchical access to the database. But one or more components indicated above can usually be conceptualized as a bitmap (for example, the values indicating which documents a term appears in, or which paragraphs within a document contain a term), with auxiliary information indicating the number of terms in a unit of the hierarchy. Some of the global auxiliary information can be stored in the dictionary.

In this article we shall focus on compressing those components of the concordance that can be represented as bitmaps. Thus, when we refer to a concordance below, we are thinking of a set of bitmaps, each indicating, for example, the documents (or other textual units) in which a word occurs. Formally, each vector is the occurrence map of a given word W. Each bit position corresponds to one document, and the bit in position i is I if and only if the word W appears in document i, i.e., there is at least one coordinate of W which has i in its d-field. In the sequel, we shall use the languages of bits or documents interchangeably.

In order to encode the values of successive bits of the bitmap efficiently, we must obtain a good underlying model for the bit generation process. A variety of methods have been used [Choueka et al. 1988; Linoff and Stanfill 1993; Wisniewski 1986; Witten et al. 1991] to compress concordances. The model of Bookstein et al. [1992a] assumed that all documents are approximately of the same size, that there is no between-document clustering of term occurrences, and that within a single document, words are independently distributed. On the basis of these assumptions, probability distributions were derived for each term, describing the behavior of the variables comprising coordinates in the concordance. The probabilities corresponding to the actual coordinate values were then transformed to bits by using arithmetic encoding, though Huffman or Shannon-Fano codes could have been used as well.

In this article we generalize the independence models by incorporating a tendency of terms to cluster. This is a natural extension of the independence model, since textual data contain inherent dependencies. For example, if the documents of an IR system are grouped by author or some other criterion, the d-values of many terms will tend to appear in clusters because of the specific style of the author, or because adjacent documents might treat similar subjects. Similarly, term occurrences will cluster within documents, reflecting content variations over a document.

In the next part of this article, we begin our study of various models of clustering by introducing a Hidden Markov Model (HMM) to represent bitmap generation. For computational reasons, we then switch to traditional Markov models which approximate the HMM. A set of criteria is developed in Section 2.3 to constrain the allowable set of n-state models, and a full inventory is given for n [less than or equal to] 4. Graph-oriented operations among the various models are defined in Section 3 and are used to simplify the organization of the models. Section 4 concentrates on four-state models. One of the models is analyzed in detail in Section 4.3. Similar results can be derived for the other models and are given without the proofs in Section 4.4.

Section 5 presents the details of the experimental evaluation. The encoding algorithm is formally stated in Section 5.1. The following sections suggest ways for parameter and performance estimation. Finally, the new methods were tested on the concordances of the English Bible and of two large full-text retrieval systems: the Tresor de la Langue Francaise (TLF) and the Responsa Retrieval Project (RRP). The results, including comparisons with other compression methods that appeared in the literature, are presented in Section 5.4.

2. MODELS OF CLUSTERING

The most common method for compressing a concordance is a form of run-length encoding, perhaps using a variable-length encoding scheme for representing the run lengths [Moffat and Zobel 1992; Witten et al. 1994]. In effect this assumes a model of statistically independent term occurrence. But terms do not occur independently. They tend to cluster over authors and over time. Within documents, they cluster over text segments discussing the concept represented by the term. We thus expect a model recognizing term clustering to represent more accurately the occurrences of terms and to offer more effective compression capability.

Should the tendency for clustering be pronounced, codes based on assumptions of term independence will produce poor compression. In this section we look at some Markov models of clustering that potentially can be used to improve compression. The simpler independence model examined in Bookstein et al. [1992a], which ignores clustering effects, can be considered a degenerate Markov model with one state.

We shall conceptualize our bitmap as being generated as follows. At any bitmap site we are in one of two states: a cluster state (C) or a between-cluster state (B). Initially we are in one of these states, as determined by an initial probability distribution. Subsequently, governed by probabilities associated with the state we are in, we generate a bitmap value of zero or one; then, again governed by a probability distribution determined by the current state, we enter a new state as we move to the next bitmap site. This process is continued iteratively until the full bitmap has been generated. Such a model has been referred to as a Hidden Markov Model (HMM) in the literature and will be defined in more detail below.

Unfortunately, while we believe that such a model is likely to be a reasonable representation for most bitmaps that describe concordances, it is analytically difficult to use. The problem is that, even if we know the initial state, we cannot know for certain which state we are in at any bitmap location; even the complete observable history of the generation process is inadequate for determining the state sequence. This makes parameter estimation difficult and complicates the formulae needed for compression and decompression.

However, in many situations, the one-bits are strongly associated with cluster states, and the zero-bits are with the between-cluster states. Thus, while we never know for certain which state we are in, if we have just generated a sequence of ones, or zeros, it is very likely we are in a cluster state, or between-cluster state, respectively. It is mainly when we believe we are in a cluster (between-cluster) state and generate a zero (one) that there is genuine confusion.

For such situations, an approximation to the HMM seems reasonable. We shall introduce several traditional Markov models. These models will have a cluster state and a between-cluster state, reflecting the underlying hidden Markov process. But if we are in a cluster (between-cluster) state and generate a zero (one) then we enter one of the other states, which acts as a transition or uncertainty state. In this way we can introduce some of the aspects of the HMM while enjoying the simplicity of traditional Markov analysis.

2.1 Hidden Markov Model

In a conventional Markov model, given the initial state, we always know the state we are in. The system moves from one state to another governed by the model's matrix of transition probabilities. In making a transition, it emits a symbol, and the symbol uniquely determines the next state.

In an HMM, if we are in a given state, there is a probability distribution describing the symbol that is emitted and a different probability distribution determining the next state. Thus the probability of emitting a symbol has been separated from the probability of making a specific transition to the next state.

To define the HMM formally, we need three probability distributions: one giving the probability of starting in state i (distribution [Pi](i)); one governing transitions between states i and j (distribution A(i, j)); and for each state i, a distribution which gives the probability of generating a given symbol k (distribution B(i, k)). For our models, comprised of two states and two characters, five parameters are needed to define the model completely. However, since the impact of the parameter determining the initial state probability is quickly attenuated, we shall refer to the HMM as a four-parameter model, allowing us to concentrate on the state transitions and character emissions.

HMM's have been used widely in speech recognition [Rabiner 1989] and recently in the prediction of protein structure [Asai et al. 1993; Haussler et al. 1993] and IR research [Mittendorf and Schauble 1994]. In our application the model has two states, one for being in a cluster and another one for being between clusters. In Figure i we have an example of an HMM, taken from TLF: it corresponds to the word extravagant, which appears in 392 of the 39,000 documents. The probability of starting in state [S.sub.1], [Pi](1), is 0.65 and of starting in state [S.sub.2] is 0.35. If we are in state [S.sub.1], we will stay there with probability A(1, 1) = 0.77. In state [S.sub.1] the probability of emitting a zero bit is B(1, 0) = 1.00. The probability of moving from state [S.sub.1] to state [S.sub.2] is small (A(1, 2) = 0.23), but once we do move, the probability of emitting a one-bit increases from 0 to 0.03; once in [S.sub.2], there is a high probability (0.72) that we return back to state [S.sub.1]. Thus, the state [S.sub.1] corresponds to the between-cluster state and the state [S.sub.2] to the cluster state; in this special case of low one-bit density, the two states are quite similar in their likelihood of generating a one-bit.

To use the HMM we must first estimate its parameters. Initially, we choose arbitrary probabilities for all parameters. We then use an iterative procedure that, in each stage, improves the values of the parameters. More about the procedure and the implementation issues can be found in Rabiner [1989]. Once estimated, we store the parameter values of the model (four probabilities in the above model). Given the parameters, we can compute the equilibrium state probabilities of the model. In the actual encoding, we begin with these equilibrium state probabilities and compute the probabilities of a one or zero; this is used to encode the actual value - for example, with arithmetic encoding. The current state probabilities, the actual bit value, and the transition matrix allow us to compute the state probabilities for the next bit site. The procedure continues in this manner until the entire bitmap is encoded.

2.2 Traditional Markov Model

We are conceptualizing bit generation as a process in which we are in either a cluster or noncluster state, and which state we are in determines the probabilities of a one- or zero-bit. An advantage of the HMM is that it recognizes the possibility that we can generate (1) zeros while we are in a cluster state and (2) ones in a noncluster state.

The simplest traditional Markov model permitting clustering would likewise be comprised of two states: one representing the high probability of generating a i within a cluster, the other the between-cluster condition. However, in such a traditional Markov model, the bit generated would determine the next state; thus, for example, a zero-bit would cause us to leave the cluster state.

To incorporate the flexibility of the HMM within a traditional Markov framework, we introduce additional states. In this article, we investigate a family of n-state Markov chains as models of how our bitmap was generated: we assume we know the initial state of the process; then, as we traverse the bitmap from beginning to end, at each position we are in a state, and that state determines the probability of being in any given state at the next location. Each such state transition is associated with a one- or zero-bit being generated. When we encode a bitmap, we simulate this process. We begin in the known initial state, and the bit values scanned determine the state we are in and the probability of the next symbol we will scan; these probabilities are used to construct the code.

2.3 Hierarchy of Models

The simplest traditional Markov model is the degenerate model with a single state, equivalent to the independence model. Such a model cannot exhibit true clustering.

The simplest Markov model permitting clustering is the two-state model referred to above. It is based on two states C and B, with the probability of a transition to C (that is, the probability of generating a one-bit) depending on the state we are in. The state C indicates that we are in a cluster and thus are more likely to generate occurrences of the designated term. We enter B as soon as we leave the cluster (that is, generate a zero-bit).

A limitation of the traditional two-state model is that it does not recognize the possibility of spurious zeros within a cluster or spurious ones between clusters. That is, we would like to incorporate the possibility, inherent in the HMM studied above, that we may be in a cluster and generate zero-bits without leaving the cluster and, conversely, being between clusters and generating one-bits. We accommodate these possibilities by introducing transitional states.

A general n-state model has states C and B as above and n - 2 transitional states ([X.sub.1], [X.sub.2], . . ., [X.sub.n - 2]). It can be represented by a directed graph G = (V, E) in the following way:

* MM1: The set of vertices is V = {[v.sub.0], [v.sub.1], . . ., [v.sub.n - 1]} = {C, [X.sub.1], . . ., [X.sub.n-2], B}.

* MM2: Exactly two arcs (i.e., directed edges) leave each vertex; one arc is labeled by 0, the other by 1, corresponding to the bit value generated in the transition represented by the vertices connected by the arc.

* MM3: One vertex is designated as the start vertex; below we assume the system starts in state B.

* MM4: Transition probabilities (going from state i to state j) are assigned to the corresponding arcs.

Thus, our graph resembles the transition diagram of a finite-state automaton, with the additional assignment of transition probabilities. Such models allow us to simulate remaining in a cluster (or between clusters) even after generating a 0 (1), thereby preserving a very important property of the HMM. In this sense, we shall think of the transitional states as being accomplices of the C and B states: if generating a 0 in a C state takes us to a transitional state, this suggests that we are only tentatively leaving the C state.

At first sight, an analysis of the graph associated with the Markov model suggests there are [n.sup.2n] possible structures that can represent Markov models with n states: two arcs leave each vertex, each of which can enter n possible vertices, so there are [n.sup.2] possibilities for the outgoing arcs of each vertex; since there are n vertices, the total number of possible graphs is [([n.sup.2]).sup.n]. This means that, e.g., even for n = 4, the number is 65,536, and for n = 5 it is almost 10 million. However, almost all of these are unreasonable as representations of our problem. We thus impose several additional conditions for a graph to be an acceptable representation of our Markov model. In summary, a legitimate Markov model can be represented by an arc-labeled graph, satisfying the following conditions C1-C5 in addition to MM1-MM4:

* C1: Each vertex has exactly two arcs leaving it: one labeled by a 1, the other by a 0.

* C2: There are no multiple arcs: there are no states I and J such that the system can go from I to J by either arc, emitting either a one or a zero. Otherwise, we would have states that served only as lag or delay states, with the symbol emitted when within such a state not providing real information; there is no physical justification within our problem for introducing lag states.

* C3: The graph is strongly connected: each vertex can be reached from any other vertex.

* C4: There is a loop on C and one on B, but the intermediate states [X.sub.i] cannot have loops. Thus the transitional states are indeed transitional: we can not remain in any transitional state while generating runs of 1s or 0s; the former should lead to state C, the latter to state B. And, once we are in C (B), we remain there as long as we observe 1s (0s).

* C5: All arcs entering C are labeled with a 1; all arcs entering B are labeled with a 0. This is true in particular for the loops on C and B.

One can represent each graph G by an incidence matrix [M.sub.G] = ([c.sub.ij]). Because of conditions C1 and C2, it seems natural to set [c.sub.ij] as the label of the unique arc from [v.sub.i] to [v.sub.j] (0 [less than or equal to] i, j [less than] n) when such an arc exists; however, we want to reserve the zero value to denote the absence of an arc. We thus use the following consistent labeling:

[Mathematical Expression Omitted]

where, here and below, we use the notation [Mathematical Expression Omitted] if there is an arc from vertex [v.sub.1] to vertex [v.sub.2], and it is labeled [Alpha] [element of] {0, 1}.

Thus, because of C1, each row in [M.sub.G] contains a 1, a 2, and n - 2 zeros. For example, the incidence matrix corresponding to the graph labeled 4S1 in Figure 2 is

[Mathematical Expression Omitted],

and that of 4C1 is

[Mathematical Expression Omitted].

The labels 4S1 and 4C1 describe the structure of the matrix and are described in detail below, after the necessary concepts are developed.

Condition C5 implies that in the first (last) column, corresponding to C (B), there are only 1's (2's) or 0's. From C4 and C5 follows that [c.sub.00] = 1, [c.sub.n - 1,n - 1] = 2 and [c.sub.ii] = 0 for 0 [less than] i [less than] n - 1. To check strong connectivity (C3), note that [([M.sub.G]).sup.k] contains a nonzero value in its (i, j) position, if and only if there is a directed path from [v.sub.i] to [v.sub.j] in G of length exactly k arcs. Thus, G is strongly connected if and only if there are no null entries in the matrix [summation of] [([M.sub.G]).sup.k] where k = 1 to n - 1.

The above conditions impose severe constraints on the incidence matrix and permit an explicit enumeration of the legitimate graphs. Further, many are a relabeling of others, with [X.sub.i] and [X.sub.j] interchanged, for certain i and j. This is easily tested: if interchanging the [X.sub.i] and [X.sub.j] rows and columns of a matrix gives another legitimate matrix, one of the two models can without loss be eliminated. For the case n = 4, this reduces the number of models from 65,536 to 21. We can remove an additional eight models if we introduce, as well, the following plausible condition, which supports our interpretation of the states [X.sub.i] as transitional states.

* C6: If k zeros (ones) are needed to go from C to B (from B to C), and there is a path of length l from [X.sub.i] to B (C) all of whose arcs are labeled by zeros (ones), then l [less than or equal to] k. For example, we considered a graph to be unreasonable if a zero value takes us directly from C to B, but it requires two zeros to go from a transition state to B.

3. STRUCTURE OF MODELS

The models described above have a structure that is best described in terms of two general sets of operations: complementation and reduction. These will be described in the following sections.

3.1 Complementation Operator

The set of resulting graphs has an interesting internal structure: It is possible to divide these graphs into two classes if we introduce the concept of complementation (C) functions, and thereby, of a C-symmetric graph.

In general, given an arc-labeled graph, G, we define its complementary graph, C(G) by means of a pair of self-inverting complementation functions C = ([C.sub.V], [C.sub.L]), defined respectively on the vertices and on the labels of G.

* C(G) has the same set of vertices and labels as G. Since C is self-inverting, any vertex of C(G) is the image of some v in G, and similarly for the labels.

* C(G) has an arc, labeled by [C.sub.L]([Alpha]), between vertices [C.sub.V]([v.sub.i]) and [C.sub.V]([v.sub.j]), for [v.sub.i], [v.sub.j] [element of] V, if and only if G has an arc labeled, from [v.sub.i] to [v.sub.j].

Thus, if we are given a path [v.sub.1][v.sub.2] . . . [v.sub.r] between vertices [v.sub.1] and [v.sub.r] of G, associated with the sequence of labels [[Alpha].sub.1][[Alpha].sub.2] . . . [[Alpha].sub.r - 1], then a C-complementary path [C.sub.V]([v.sub.1])[C.sub.V]([v.sub.2]) . . . [C.sub.V]([v.sub.r]), with labels [C.sub.L]([[Alpha].sub.1])[C.sub.L]([[Alpha].sub.2]) . . . [C.sub.L]([[Alpha].sub.r - 1]), exists in C(G) between vertices [C.sub.V]([v.sub.1]) and [C.sub.V]([v.sub.r]). For the models we are considering, we insist in addition that [C.sub.V](B) = C and that [C.sub.L](1) = 0. With these conditions, we can easily see that a graph G satisfies conditions C1 to C6 if and only if the C-complementary graph C(G) does. Thus the complementation operator defines a structure of our set of models as follows.

A labeled graph is C-symmetric if C(G) - G. Thus, for every path in G, the corresponding C-complementary path also exists in G. This condition is equivalent to being able to order the vertices so that complementary elements are at positions which are symmetric relative to the middle point of the matrix, i.e., [c.sub.ij] and [c.sub.n - 1 - i,n - 1 - j] are either both zero, or, if the one is 1, the other is 2, for 0 [less than or equal to] i, j [less than] n. Thus the first matrix mentioned above, corresponding to 4S1, satisfies the symmetry condition, but the second matrix does not. (A similar observation is valid for more general graphs. If the entries in the incidence matrix are the actual labels of arcs, or the null value (<null>) for vertices that are not connected, then [c.sub.ij] = [C.sub.L]([c.sub.n - 1 - i,n - 1 - j]) if [c.sub.ij] is not null, and [c.sub.ij] = <null> otherwise, for 0 [less than or equal to] i,j [less than] n. The value <null> could be any number distinct from the arc labels.)

We shall only consider cases where [C.sub.L] is binary complementation, though the concept can be generalized to any self-inverting function [C.sub.L]. In terms of our complementation functions, we define [C.sub.V](B) = C, and [C.sub.V]([X.sub.1]) = [X.sub.2]; then the 13 acceptable models divide into three symmetric models and five pairs of models, each pair of which is comprised of a model and its complement.

While particularly useful for organizing our models, the complementation operator is generally applicable to arc-labeled graphs. For example, the well-known de Bruijn diagrams are C-symmetric. In fact, the model labeled 4S1 in Figure 2 is the de Bruijn diagram [G.sub.2,3] (see Even [1973, Sect. 4.3]).

The 13 graphs of the four-state models satisfying the conditions C1-C6, grouped by complementation relation, are schematically displayed in Figure 2, and the graphs for all the models with n [less than or equal to] 3 are displayed in Figure 3. The labeling is implicit in the form of the arrows: a solid line representing an arc labeled i and a dotted line for an arc labeled zero. For ease of description, we shall refer in the sequel to the models by their labels in the figures, which mention the number of their states, e.g., M-4S1 will refer to the model in the upper leftmost corner of Figure 2.

The following conventions were used for naming the various models: each model name consists of a pair nL or a triplet nLi, where n is the number of states, L [element of] {S, C, B} indicating whether the model is C-symmetric, i.e., states B and C play symmetric roles, or whether the model is biased toward the C state or the B state, respectively; i is a running index distinguishing graphs that would otherwise share a given name. We say that a model is C-biased, if it is harder to get from C to B than vice versa; that is, one needs more zeros to get from C to B than one needs ones for the opposite path. For M-4C2 and M-4B2 these paths are both of length 2; we consider M-4C2 to be C-biased because starting in C and generating the sequence 01, we get back to C, whereas starting in B and generating the sequence 10 gets us to an intermediate state. In other words, a C-biased model, once in state C, is more reluctant to accept a zero as an indicator of having left the cluster than would be the case in the complementary situation.

3.2 Reduction Operator

We shall demonstrate below how various properties of our models can be derived, given the model parameters. However, we can save much effort by noting a special relation that exists between some lower- and higher-order models: this relationship allows us to directly derive the results for the lower-order model given the corresponding results for the higher-order model. In this section we shall explore this relation, which allows us to consider certain models as generalizations of lower-order models. To do this we define the notion of reducibility among models.

We first review the formal structure of our Markov models. Each consists, subject to restrictions as stated above, of

* a graph,

* labels for arcs,

* an initial probability for each vertex (here, the probability of B when starting is 1), and,

* transition probabilities.

Suppose we are given such a structure. We wish to determine when an arbitrary partition of the vertices defines a graph that itself represents a legitimate Markov model of the type we are considering; such partitions will be called reductions, an idea closely related to lumpability, as defined, for example, by Kemeny and Snell [1960].

Consider a graph, representing a Markov model, with vertex set V = {[v.sub.0], . . ., [v.sub.n - 1]}. Let P = {[V.sub.1], . . ., [V.sub.n]} be a partition of V. If vertices [v.sub.1] and [v.sub.2] of V are in the same class [V.sub.i], we write [v.sub.1] [equivalent to] [v.sub.2]. Such a partition is called a reduction if

(a) P = {V},

which is the degenerate case for which the partition consists of a single element (that is, all the states are merged into a single one, and the graph consists of a single vertex) or

(b) if the partition contains more than one element and satisfies the following conditions on the underlying graph structure:

* R1: Vertices C and B are in different members of the partition.

* R2: Whenever an arc labeled with [Alpha] goes from vertices [v.sub.1] to [w.sub.1], then for any [v.sub.2] [equivalent to] = [v.sub.1], if an arc labeled with [Alpha] connects [v.sub.2] with [w.sub.2], then [w.sub.2] [equivalent to] [w.sub.1]. That is, all arcs leaving a class and having the same label are directed to the same target class.

More succinctly, we have

[Mathematical Expression Omitted],

which is schematically represented in Figure 4(a).

* R3: Whenever an arc labeled with [Alpha] goes from vertices [v.sub.1] to [w.sub.1], then for any [v.sub.2] [equivalent to] [v.sub.1], if an arc labeled with [Beta], with [Beta] [not equal to] [Alpha], connects [v.sub.2] with [w.sub.2], then [w.sub.2] [not equal to] [w.sub.1]. That is, if [Alpha] [not equal to] [Beta]

[Mathematical Expression Omitted],

as in Figure 4(b). Note that for the binary case we consider, we have [Beta] = 1 - [Alpha]. In other words, for two given classes [V.sub.i] and [V.sub.j] of the partition, all the arcs connecting any vertex in [V.sub.i] to any vertex in [V.sub.j] must be labeled identically.

R2 and R3 tell us that all arcs between two classes have the same label and that all arcs leaving a class with the same label go to the same target class. For example, for model M-4S2, one could define [V.sub.1] = {C, [X.sub.2]} and [V.sub.2] = {B, [X.sub.1]}; then all the arcs going from [V.sub.1] to [V.sub.2] are labeled 0, and all those going from [V.sub.2] to [V.sub.1] are labeled 1.

* R4: If v [equivalent to] C, then all arcs terminating on v must be labeled by 1; similarly, if v [equivalent to] B, then all arcs terminating on v must be labeled by 0.

* R5: There can be no internal arcs in any component not containing either C orB.

In addition, to be consistent with the Markov property, we require that

* R6: If [v.sub.i] [equivalent to] [v.sub.j], then the probability of a transition to a vertex in [V.sub.k] is the same for [v.sub.i] and [v.sub.j]. Specifically, [v.sub.i] and [v.sub.j] have the same probability of generating a i or a 0.

If we are given a graph G and a reduction, then we can define a reduced Markov graph G[prime] as follows:

* RG1: Vertices: P.

* RG2: Arcs: If [Mathematical Expression Omitted] in G, with [v.sub.1] [element of] [V.sub.1] and [v.sub.2] [element of] [V.sub.2], then [Mathematical Expression Omitted] in G[prime]. This defines a single arc labeled [Alpha] leaving any vertex of G[prime], because of condition R2, and the definition is independent of the specific vertices selected because of condition R3. If [V.sub.1] = [V.sub.2], then the arc is a loop, which is possible only for a vertex containing C or B.

* RG3: Initial probabilities: In general, for a vertex [V.sub.i] of G[prime], the initial probability is just the sum of the initial probabilities in G of the constituent vertices of [V.sub.i]. For the case at hand, this will be 0 for all the vertices, except the one containing B, for which it is 1.

* RG4: Transition probabilities: If [V.sub.1] and [V.sub.2] are vertices in G[prime], we define the transition probability between [V.sub.1] and [V.sub.2] as

[T.sub.[V.sub.1],[V.sub.2]] = [T.sub.[v.sub.i],[v.sub.j]],

for [v.sub.i] [element of] [V.sub.1], [v.sub.j] [element of] [V.sub.2], and [T.sub.[v.sub.i],[v.sub.j]], the transition probability between [v.sub.i] and [v.sub.j] in G. (Note that by R6, if [Mathematical Expression Omitted], then [Mathematical Expression Omitted], for all [v.sub.i] [element of] [V.sub.1] and [v.sub.j] [element of] [V.sub.2]). Below, when we derive properties of lower order from higher-order models, we will implicitly equate the above [T.sub.[v.sub.i],[v.sub.j]] and set the transition probabilities between [v.sub.i] and [v.sub.j] to the common value.

* RG5: Special vertices C and B: The component in G[prime] containing C of G is denoted C, and similarly for the component containing B. (For completeness, if there is a single component, we call the vertex C.)

The reduction relation is very much like a congruence in algebra, with the reduced graph like a factor structure.

Several consequences follow from this definition, taken in conjunction with the restrictions defining a legitimate graph structure. The most important is that the reduced graph has all the properties we demand of a legitimate graph.

We assume below that we have at least two components.

* C[prime]1: Each vertex in G[prime] has exactly two arcs leaving it, one labeled by 1, the other by 0; this is a consequence of RG2 given the defining property C1 of G.

* C[prime]2: There are no multiple arcs, because of R3.

* C[prime]3: G[prime] is strongly connected, because G is.

* C[prime]4: In G[prime], as in G, there will be a one-loop attached to C and a zero-loop attached to B. Further, all arcs labeled i in the C component of G[prime] must be internal to the C component, because of R2 and the requirement that C in G has a loop; and all arcs labeled by 0 and starting in the C component must leave the component, because of condition R3: if an arc labeled with 0 connected any vertex in C with a different vertex in C, then all zero-labeled arcs starting in C must end in C; since this is true as well for one-labeled arcs, no path starting with C could connect to B, contradicting the assumption that G is strongly connected. Similar conditions, for arcs labeled 0 and i respectively, are true for the B component. Transitional states can have no loops, because of R5.

* C[prime]5: All arcs entering C are labeled by 1, and all arcs entering B are labeled by 0, by R4.

Looking forward, the notion of a reduction will allow us to deduce properties of lower-order models, given properties of higher-order models. Intuitively, if some analysis assigns a probability to each vertex in G, the parallel probability for a vertex in the reduced graph should be the sum of the corresponding probabilities of the original graph - this is true, for example, for equilibrium probabilities.

Similarly, both the original and reduced structures define a probability for any sequence of 1s and 0s. We will be interested in properties that depend on probabilities of strings for which we can deduce the value for the reduced graph by equating the transition probabilities in the corresponding value for the initial graph. This is the case for the expected number of clusters.

Figure 5 displays all the possible reductions among the n-state models we consider, for n [less than or equal to] 4. Models that are C-symmetric are indicated by a double borderline. If [M.sub.1] can be reduced to [M.sub.2], this is indicated by an arrow from [M.sub.1] to [M.sub.2], near to which appear the states to be put in the same class of the partition.

We note that M - 4S1 reduces to the three-state model M - 3C by identifying state(1) [X.sub.2] with state C, and to M - 3B by identifying state [X.sub.1] with state B. M - 4S2 reduces to M - 3S by identifying the two transitional states [X.sub.1] and [X.sub.2]. Thus, all the three-state models (as well as the two- and one-state models) previously studied are special cases of the symmetric four-state models.

Note also that M - 4S2 does not reduce to another three-state model beside M - 3S; only by identifying both [X.sub.1] with B and [X.sub.2] with C do we get an acceptable partition, that corresponding to the two-state model. For M - 4S3 and the other models appearing at the bottom of Figure 5, an acceptable partition is obtained only when all four states are merged, yielding the independent one-state model.

We finally draw attention to the symmetric aspect of the reduction diagram, which reflects the fact that for each reduction from one model to another there is a corresponding reduction between C-symmetric counterparts.

Further consequences include the following:

* If there is an arc labeled [Alpha] emanating from a vertex in a given class [V.sub.i], and incident onto a vertex outside of [V.sub.i], then there cannot be an arc labeled [Alpha] from one vertex to another in [V.sub.i]. That is, if an arc labeled [Alpha] connects [v.sub.1] and [v.sub.2], for [v.sub.1] [equivalent to] [v.sub.2], then all arcs labeled [Alpha] starting in the equivalence class defined by [v.sub.1] and [v.sub.2] must terminate in that class. This is in particular true for arcs labeled by 1(0) in the class containing C (B). For example, again in M - 4S2, the set V[prime] = {C, [X.sub.1]} cannot be a class in a partition, because there is an arc labeled 0 leaving V[prime], but [Mathematical Expression Omitted]. This property is a consequence of R2.

* If from a given vertex v, there is an arc labeled 0 to [w.sub.1], and another arc labeled 1 to [w.sub.2], then [w.sub.1] and [w.sub.2] cannot possibly belong to the same class in a partition, unless v too belongs to that class. For example, in M - 4S1, we have [Mathematical Expression Omitted] and [Mathematical Expression Omitted], implying that the set {[X.sub.2], B} may not be part of a partition. This is a consequence of R3.

* To any path in a reduced graph, G[prime], there corresponds at least one path in the original graph G. Denote a path in G[prime] by the sequence vertex (output symbol) vertex . . . vertex. Then, given any such path P = [V.sub.0]([[Alpha].sub.0])[V.sub.1]([[Alpha].sub.1])[V.sub.2]([[Alpha].sub.2]) . . . [V.sub.n], and starting with any vertex of G included in [V.sub.0], the path in G associated with [[Alpha].sub.0][[Alpha].sub.1] . . . [[Alpha].sub.n - 1] is consistent with P. This is the essential idea behind the graph congruence and is why everything works.

4. FOUR-STATE MARKOV MODELS

There is a natural tradeoff when deciding the order n of the Markov model to be used. The higher n, the better can the model describe the bit generation process - but more bits are needed to communicate the model parameters to the decoder. The independence model of Bookstein et al. [1992a] corresponds to n = 1, and in Bookstein et al. [1994] we considered n = 3. To maintain the same complexity as the two-state HMM of Section 2.1 (which has four parameters), in this section we shall focus on four-state Markov models.

4.1 Interpretation of the Models

The models we have developed are intended to be simplifications of the HMM. To make explicit the relationship between these two types of models, we describe one in detail.

The scenario represented by M - 4S2, for example, is as follows. Suppose we start within a cluster - that is, in state C. As long as ones are generated, we remain within the cluster. However, if a zero is generated, it suggests the possibility of having left the cluster, but the possibility remains that we are still in C and that the zero is just a random occurrence. This possibility, consistent with the underlying HMM, is taken into account by entering the transitional state [X.sub.1]. If, once in [X.sub.1], the next bit generated is a one, we go back to the cluster state and conclude that the preceding zero was merely an accident; if, however, a second zero is produced, we conclude that we have left the cluster and are now in state B. State B remains then in effect as long as zeros are generated, and the behavior is symmetrical. That is, if a one is generated, we are not yet sure it is the beginning of a cluster, so we go into a transitional state [X.sub.2]. A second one confirms that we are in a cluster, so we go to C, but a zero drops us back to B.

Below, we shall analyze in detail model M - 4S1 and indicate without proof the results for the other C-symmetric models. In addition, for illustration and contrast, we shall present the results for one of the asymmetric models that performed well in our experiments, as well as of its C-symmetric counterpart.

4.2 Relation to Earlier Models

In earlier work [Bookstein et al. 1992a], we assumed an independence model, which is equivalent to a single-state Markov model that ignores that word occurrences may cluster. A primitive clustering effect can be described by a two-state Markov model, with only a cluster state and a between-cluster state. But such a model has no tolerance for zeros (ones) in a cluster (between-cluster) state.

The model discussed in Bookstein et al. [1994] is M - 3C. It differs from M - 4S1 in that a single i is enough to make the transition from B to C. In the complementary three-state model M - 3B, the transitional state is entered on the way from B to C: a single 0 is enough to leave the cluster, but at least two consecutive 1s are needed to pass from the between to the cluster state. The third possible three-state model M - 3S is C-symmetric, e.g., two 0s are needed to get from C to B, and two 1s are needed to pass from B to C. These three models exhaust this set of three-state models satisfying the plausibility requirements enumerated above.

The reason that M - 3B, the complementary model of M - 3C, has not been mentioned in Bookstein et al. [1994] is because of the interpretation of the bitvectors in our application. The ones correspond to documents containing the designated term, the zeros to documents in which the term does not occur. Now most of the terms of any natural-language database occur in only a small part of the documents, so that the probability of a one-bit is much smaller than that of a zero-bit. It is thus reasonable to allow single zeros to appear within a cluster as in M - 3C, while the occurrence of a single 1, which in itself is a relatively rare event, can already be considered as indicating the start of a new cluster. It might thus well be that for our applications the symmetry condition should not hold, and indeed, some nonsymmetric four-state models performed consistently better in our experiments, reported in Section 5.

Our goal for the four-state models is to predict the size of the gaps separating documents that contain the term and then deduce corresponding quantities for the three-state models, two-state models, and the independence model. Gap length is an important characteristic of clustering, and the distribution of gap lengths distinguishes a clustered model from the independence models. We shall derive formulae that indicate how the gap lengths depend on the underlying parameters for the various models. These formulae also offer an indication of whether clustering is a factor for a particular set of bitmaps.

4.3 Analysis of M - 4S1

Let us define the transition probabilities for M - 4S1 by

C [approaches] C : [[Gamma].sub.C], C [approaches] [X.sub.1] : 1 - [[Gamma].sub.C];

[X.sub.1][approaches][X.sub.2] : [[Gamma].sub.[X.sub.1]], [X.sub.1] [approaches] B : 1 - [[Gamma].sub.[X.sub.1]];

[X.sub.2] [approaches] C : [[Gamma].sub.[X.sub.2]], [X.sub.2] [approaches] [X.sub.1] : 1 - [[Gamma].sub.[X.sub.2]];

B [approaches] [X.sub.2] : [[Gamma].sub.B], B [approaches] B : 1 - [[Gamma].sub.B].

That is, if we are in state S, [[Gamma].sub.S] is the probability of emitting a 1.

We next compute several interesting properties of this model.

4.3.1 Equilibrium Probabilities. We first compute the long-run equilibrium probabilities for these states. Let [Pi] = [[[Pi].sub.C], [[Pi].sub.[X.sub.1]], [[Pi].sub.[X.sub.2]], [[Pi].sub.B]] be the vector of probabilities of being in states C, [X.sub.1], [X.sub.2], B respectively in the long run. Ordering the states as induced by [Pi], we get the transition matrix

[Mathematical Expression Omitted].

Since [Pi] satisfies [Pi] = [Pi]T, we have the following system of equations:

[[Pi].sub.C] = [[Pi].sub.C][[Gamma].sub.C] + [[Pi].sub.[X.sub.2]][[Gamma].sub.[X.sub.2]] (1)

[[Pi].sub.[X.sub.1]] = [[Pi].sub.C](1 - [[Gamma].sub.C]) + [[Pi].sub.[X.sub.2]](1 - [[Gamma].sub.[X.sub.2]]) (2)

[[Pi].sub.[X.sub.2]] = [[Pi].sub.[X.sub.1]][[Gamma].sub.[X.sub.1]] + [[Pi].sub.B][[Gamma].sub.B] (3)

[[Pi].sub.B] = [[Pi].sub.[X.sub.1]](1 - [[Gamma].sub.[X.sub.1]]) + [[Pi].sub.B](1 - [[Gamma].sub.B]). (4)

In addition, we also have the condition [[Pi].sub.C] + [[Pi].sub.[X.sub.1]] + [[Pi].sub.[X.sub.2]] + [[Pi].sub.B] = 1. Now solving the equations for [[Pi].sub.C], [[Pi].sub.[X.sub.1]], [[Pi].sub.[X.sub.2]], and [[Pi].sub.B] we get

[[Pi].sub.C] = [[Gamma].sub.[X.sub.2]][[Gamma].sub.B]/[[Gamma].sub.[X.sub.2]][[Gamma].sub.B] + 2(1 - [[Gamma].sub.C])[[Gamma].sub.B] + (1 - [[Gamma].sub.C])(1 - [[Gamma].sub.[X.sub.1]]) (5)

[[Pi].sub.[X.sub.1]] = [[Pi].sub.[X.sub.2]] = (1 - [[Gamma].sub.C])[[Gamma].sub.B]/[[Gamma].sub.[X.sub.2]][[Gamma].sub.B] + 2(1 - [[Gamma].sub.C])[[Gamma].sub.B] + (1 - [[Gamma].sub.C])(1 - [[Gamma].sub.[X.sub.1]]) (6)

[[Pi].sub.B] = (1 - [[Gamma].sub.C])(1 - [[Gamma].sub.[X.sub.1]])/[[Gamma].sub.[X.sub.2]][[Gamma].sub.B] + 2(1 - [[Gamma].sub.C])[[Gamma].sub.B] + (1 - [[Gamma].sub.C])(1 -[[Gamma].sub.[X.sub.1]]). (7)

4.3.2 Run Lengths. Each bit in a bitmap can be individually encoded, using arithmetic coding based on the probability of a one-bit at each stage; this is the encoding algorithm used in our experiments and described in more detail in Section 5. Alternatively, it may be more convenient to encode the gaps between one-bits. The reason why this alternative representation is equivalent is shown in the Appendix. To encode the gap sizes, we need their distribution, which is easily derived using the above model.

We thus compute the probability of a run of k zeros, following a 1. Observe that if a 1 was just generated, the associated transition must have left us in state C or state [X.sub.2]. Thus denoting by [[Pi][prime].sub.C] and [[Pi][prime].sub.[X.sub.2]] the probabilities of being in state C and [X.sub.2] respectively after we have just generated a 1, we have [[Pi][prime].sub.C] + [[Pi][prime].sub.[X.sub.2]] = 1. Let [P.sub.k[where]C] be the probability of a run of k zeros followed by a 1, given that we are initially in state C. Then

[p.sub.0[where]C] = [[Gamma].sub.C];

[p.sub.1[where]C] = (1 - [[Gamma].sub.C])[[Gamma].sub.[X.sub.1]];

[p.sub.n[where]C] = (1 - [[Gamma].sub.C])(1 - [[Gamma].sub.[X.sub.1]])[(1 - [[Gamma].sub.B]).sup.n - 2][[Gamma].sub.B], for n [greater than or equal to] 2.

We define [p.sub.k[where][X.sub.2]] similarly as [p.sub.k[where]C]. Then

[p.sub.0[where][X.sub.2]] = [[Gamma].sub.[X.sub.2]];

[p.sub.1[where][X.sub.2]] = (1 - [[Gamma].sub.[X.sub.2]]) [[Gamma].sub.[X.sub.1]];

[P.sub.n[where][X.sub.2]] = (1 - [[Gamma].sub.[X.sub.2]])(1 - [[Gamma].sub.[X.sub.1]])[(1 - [[Gamma].sub.B]).sup.n - 2][[Gamma].sub.B], for n [greater than or equal to] 2.

Let [p.sub.k] be the probability of a run of exactly k zeros following a 1. Then

[p.sub.k] = [p.sub.k[where]C] [[Pi][prime].sub.C] + [p.sub.k[where][X.sub.2]] [[Pi][prime].sub.[X.sub.2]],

so we have

[p.sub.0] = [[Gamma].sub.C][[Pi][prime].sub.C] + [[Gamma].sub.[X.sub.2]][[Pi][prime].sub.[X.sub.2]]; (8)

[p.sub.1] = (1 - [[Gamma].sub.C])[[Gamma].sub.[X.sub.1]][[Pi][prime].sub.C] + (1 - [[Gamma].sub.[X.sub.2]]) [[Gamma].sub.[X.sub.1]][[Pi][prime].sub.[X.sub.2]];

and for n [greater than or equal to] 2, we have

[p.sub.n] = (1 - [[Gamma].sub.[X.sub.1]])[(1 - [[Gamma].sub.B]).sup.n - 2] [[Gamma].sub.B][(1 - [[Gamma].sub.C])[[Pi][prime].sub.C] + (1 - [[Gamma].sub.[X.sub.2]])[[Pi][prime].sub.[X.sub.2]]]. (9)

It remains to compute [[Pi][prime].sub.C] and [[Pi][prime].sub.[X.sub.2]]. Observe that if we are in state C, then the previous state must have been C or [X.sub.2]. Therefore

[[Pi][prime].sub.C] = [[Pi].sub.C][[Gamma].sub.C] + [[Pi].sub.[X.sub.2]][[Gamma].sub.[X.sub.2]]/p(1)

where p(1) is the probability of generating a 1. But we also have

p(1) = [[Pi].sub.C][[Gamma].sub.C] + [[Pi].sub.[X.sub.1]][[Gamma].sub.[X.sub.1]] + [[Pi].sub.[X.sub.2]][[Gamma].sub.[X.sub.2]] [[Pi].sub.B][[Gamma].sub.B].

Hence

[[Pi][prime].sub.C] = [[Pi].sub.C][[Gamma].sub.C] + [[Pi].sub.[X.sub.2]][[Gamma].sub.[X.sub.2]]/[[Pi].sub.C][[Gamma].sub.C] + [[Pi].sub.[X.sub.1]][[Gamma].sub.[X.sub.1]] + [[Pi].sub.[X.sub.2]][[Gamma].sub.[X.sub.2]] + [[Pi].sub.B][[Gamma].sub.B]. (10)

Similarly, if we are in state [X.sub.2], then the previous state must have been [X.sub.1] or B, and

[[Pi][prime].sub.[X.sub.2]] = [[Pi].sub.[X.sub.1]][[Gamma].sub.[X.sub.1]] + [[Pi].sub.B][[Gamma].sub.B]/[[Pi].sub.C][[Gamma].sub.C] + [[Pi].sub.[X.sub.1]][[Gamma].sub.[X.sub.1]] + [[Pi].sub.[X.sub.2]][[Gamma].sub.[X.sub.2]] + [[Pi].sub.B][[Gamma].sub.B]. (11)

Substituting (1) and (3) from the equations for the steady-state probabilities into (10) and (11), we find that

[[Pi][prime].sub.C] = [[Pi].sub.C]/[[Pi].sub.C] + [[Pi].sub.[X.sub.2]], [[Pi][prime].sub.[X.sub.2]] = [[Pi].sub.[X.sub.2]]/[[Pi].sub.C] + [[Pi].sub.[X.sub.2]]. (12)

Substituting the expressions for [[Pi].sub.C] and [[Pi].sub.[X.sub.2]] as functions of the [Gamma]'s using (5) and (6), we get after some simplification

[[Pi][prime].sub.X] = [[Gamma].sub.[X.sub.2]]/1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]], [[Pi][prime].sub.[X.sub.2]] = 1 - [[Gamma].sub.C]/1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]]. (13)

4.3.3 Expected Run Length. Using the above expressions for [p.sub.n] and [Pi][prime], we can calculate the average size of a gap as

[summation of] n[p.sub.n] where n = 0 to [infinity] = (1 - [[Gamma].sub.C]) [[Gamma].sub.[X.sub.1]][[Pi][prime].sub.C] + (1 - [[Gamma].sub.[X.sub.2]])[[Gamma].sub.[X.sub.1]][[Pi][prime].sub.[X.sub.2]]

+ (1 - [[Gamma].sub.[X.sub.1]])[(1 - [[Gamma].sub.C])[[Pi][prime].sub.C] + (1 - [[Gamma].sub.[X.sub.2]]) [[Pi][prime].sub.[X.sub.2]]] [[Gamma].sub.B] [summation of] n [(1 - [[Gamma].sub.B]).sup.n-2] where n = 2 to [infinity]

= (1 - [[Gamma].sub.C]) [[Gamma].sub.[X.sub.1]][[Pi][prime].sub.C] + (1 - [[Gamma].sub.[X.sub.2]]) [[Gamma].sub.[X.sub.1]][[Pi][prime].sub.[X.sub.2]]

+ (1 - [[Gamma].sub.[X.sub.1]])[(1 - [[Gamma].sub.C]) [[Pi][prime].sub.C] + (1 - [[Gamma].sub.[X.sub.2]])[[Pi][prime].sub.[X.sub.2]] 1 + [[Gamma].sub.B]/[[Gamma].sub.B]

= (1 - [[Gamma].sub.C])(1 - [[Gamma].sub.[X.sub.1]] + [[Gamma].sub.B])/[[Gamma].sub.B] [[Pi][prime].sub.C] + (1 - [[Gamma].sub.[X.sub.2]](1 - [[Gamma].sub.[X.sub.1]] + [[Gamma].sub.B])/[[Gamma].sub.B] [[Pi][prime].sub.[X.sub.2]]

= 1 - [[Gamma].sub.[X.sub.1]] + [[Gamma].sub.B]/[[Gamma].sub.B] [(1 - [[Gamma].sub.C])[[Pi][prime].sub.C] + (1 - [[Gamma].sub.[X.sub.2]])[[Pi][prime].sub.[X.sub.2]]]

= 1 - [[Gamma].sub.[X.sub.1]] + [[Gamma].sub.B]/[[Gamma].sub.B] [(1 - [[Gamma].sub.C])[[Gamma].sub.[X.sub.2]]/1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]] + (1 - [[Gamma].sub.[X.sub.2]])(1 - [[Gamma].sub.C]/1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]]]

= 1 - [[Gamma].sub.[X.sub.1]] + [[Gamma].sub.B])(1 - [[Gamma].sub.C])/[[Gamma].sub.B] (1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]])

= (1 - [[Gamma].sub.C]) (1 + 1 - [[Gamma].sub.[X.sub.1]]/[[Gamma].sub.B])(1/1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]]). (14)

In fact, we are only interested in genuine gaps, that is, gaps of length at least 1. The expected genuine gap size is 1/(1 - [p.sub.0]) [summation of] n[p.sub.n] where n = 1 to [infinity]; but from (8) and (13), we get

1 - [p.sub.0] = [[Gamma].sub.C][[Gamma].sub.[X.sub.2]] + [[Gamma].sub.[X.sub.2]](1 - [[Gamma].sub.C])/1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]] = 1 - [[Gamma].sub.C]/1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]].

It thus follows from (14) that the genuine gap size is

1/1 - [p.sub.0] [summation of] n[p.sub.n] where n = 1 to [infinity] = 1 + 1 - [[Gamma].sub.[X.sub.1]]/[[Gamma].sub.B].

Note that this expression is independent of [[Gamma].sub.C] and [[Gamma].sub.[X.sub.2]].

4.3.4 Density of One-Bits. Let D be the total number of documents; we would like to relate this to N, the number of documents containing the designated term, given the model parameters. The predicted density of one-bits N/D, given the easily obtainable values N and D, allows a test of the model. We give two derivations of this important result.

Heuristic Argument. To estimate N, note that we first have a span of zero or more documents without the designated term. This is followed by N - 1 spans made up of a document containing the term followed by an interterm span of zero or more documents. Finally, the last span consists of a document with the term, followed by a (possibly empty) terminal span. For our heuristic estimate, we assume that each internal span has a length equal to its expected value and that the lengths of each of the terminal spans can be approximated as half of this length. So the D documents are made up of the N documents with a designated term, the two end spans, and the N - 1 internal spans. Therefore

D = N [summation of] n[p.sub.n] where n = 0 to [infinity] + N = N(1 - [[Gamma].sub.C])(1 + 1 - [[Gamma].sub.[X.sub.1]]/[[Gamma].sub.B])(1 + 1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]]) + N.

This result immediately gives us an estimate for the density of one-bits, N/D; this would serve as an estimate of the "p" value of an independence model, should we wish to simplify our model by using an independence approximation.

Formal Derivation. One can also derive the above formula by observing that we are in states C or [X.sub.2] if and only if we have just generated a 1. Thus N, the number of ones, is equal to the number of recurrences of states C and [X.sub.2]: [[Pi].sub.C]D + [[Pi].sub.[X.sub.2]]D = N, which implies

D = N/[[Pi].sub.C] + [[Pi].sub.[X.sub.2]]

= N [[Gamma].sub.[X.sub.2]][[Gamma].sub.B] + 2(1 - [[Gamma].sub.C])[[Gamma].sub.B] + (1 - [[Gamma].sub.C]) (1 - [[Gamma].sub.[X.sub.1]])/[[Gamma].sub.[X.sub.2]][[Gamma].sub.B] + (1 - [[Gamma].sub.C])[[Gamma].sub.B]

= N (1 - [[Gamma].sub.C])[[Gamma].sub.B] + (1 - [[Gamma].sub.C]) (1 - [[Gamma].sub.[X.sub.1]])/[[Gamma].sub.[X.sub.2]][[Gamma].sub.B] + (1 - [[Gamma].sub.C])[[Gamma].sub.B] + N

= N (1 - [[Gamma].sub.C]) (1 + 1 - [[Gamma].sub.[X.sub.1]]/[[Gamma].sub.B]) (1/1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]]) + N,

agreeing with the preceding result.

4.3.5 Reductions to Simpler Models. By identifying [X.sub.2] with C, we get M - 3C. Now [[Gamma].sub.[X.sub.2]] = [[Gamma].sub.C], and we get

[summation of] n[p.sub.n] where n = 0 to [infinity] = (1 - [[Gamma].sub.C])(1 + 1 - [[Gamma].sub.[X.sub.1]]/[[Gamma].sub.B]), [summation of] n [p.sub.n]/1 - [p.sub.0] where n = 1 to [infinity] = 1 + 1 - [[Gamma].sub.[X.sub.1]]/[[Gamma].sub.B]

and

D = N(1 - [[Gamma].sub.C])(1 + 1 - [[Gamma].sub.[X.sub.1]]/[[Gamma].sub.B]) + N.

By identifying [X.sub.1] with B in M - 4S1, we get the complementary three-state model M - 3B. Now [[Gamma].sub.[X.sub.1]] = [[Gamma].sub.B], and we get

[summation of] n[p.sub.n] where n = 0 to [infinity] = 1 - [[Gamma].sub.C]/[[Gamma].sub.B](1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]]), [summation of] n [p.sub.n]/1 - [p.sub.0] where n = 1 to [infinity] = 1/[[Gamma].sub.B]

and

D = N(1 - [[Gamma].sub.C])/[[Gamma].sub.B](1 - [[Gamma].sub.C] + [[Gamma].sub.[X.sub.2]]) + N.

By identifying [X.sub.1] with B in M - 3C or by identifying [X.sub.2] with C in M - 3B we get the two-state model. The formulas reduce to

[summation of] n[p.sub.n] where n = 0 to [infinity] = (1 - [[Gamma].sub.C])(1 + 1 - [[Gamma].sub.B]/[[Gamma].sub.B]) = 1 - [[Gamma].sub.C]/[[Gamma].sub.B],

[summation of] n [p.sub.n]/1 - [p.sub.0] where n = 1 to [infinity] = 1 + 1 - [[Gamma].sub.B]/[[Gamma].sub.B] = 1/[[Gamma].sub.B]

and

D = N (1 - [[Gamma].sub.C]/[[Gamma].sub.B]) + N.

By identifying B with C in the two-state model, we get the independence model. Now [[Gamma].sub.B] = [[Gamma].sub.C], and let [Gamma] = [[Gamma].sub.B] = [[Gamma].sub.C]. The formulae become

[summation of] n[p.sub.n] where n = 0 to [infinity] = 1/[Gamma] - 1, [summation of] n [p.sub.n]/1 - [p.sub.0] where n = 0 to [infinity] = 1/[Gamma]

and

D = N/[Gamma].

4.4 Summary of the Four-State Models

Using identical techniques to those of the previous section, one can derive similar formulae for the other four-state models we considered earlier. The calculations are straightforward and therefore omitted. We report here on the resulting equations only for the symmetric four-state models, for the C-biased model which performed best on most experiments (M - 4C1) and for its C-symmetric complement (M - 4B1).

The transition probabilities for models M - 4S2, M - 4S3, M - 4C1, and M - 4B1 can be summarized, respectively, by the following transition matrices, corresponding, as above, to the four states in the order (C, [X.sub.1], [X.sub.2], B):

[Mathematical Expression Omitted],

[Mathematical Expression Omitted],

[Mathematical Expression Omitted],

[Mathematical Expression Omitted].

We then proceed by calculating the equilibrium probabilities satisfying [Pi] = [Pi]T and evaluating the probability distributions of the run lengths. Note that one ought to take care of the fact that the models differ slightly; for example, for M - 4S1 and M - 4S2, we are in state C or [X.sub.2] if and only if we have just generated a 1, while for M - 4S3 and M - 4B1, this is true for states C or [X.sub.1] or [X.sub.2], and for M - 4C1, only for state C. This has to be taken into account when evaluating [p.sub.n]. Table I summarizes the formulae for the average genuine gap size ([Sigma] n[p.sub.n])/(1 - [p.sub.0]) for the different models. For the sake of completeness, we have also included the independent, two-state and three-state models.

Making the same assumptions as above about the spans of zeros, we can derive the following relationships between D and N, which are summarized in Table II.

5. EXPERIMENTAL EVALUATION OF MODELS

5.1 Model-Based Coding Algorithm

The following algorithm is a formal description of the encoding process for any of the Markov models discussed above. Given is a bitvector of length n bits: [b.sub.1] . . . [b.sub.n]. First the model M = <Q, [[Delta].sub.M], S> of the transitions is chosen, where Q is the set of states; S [element of] Q is the starting state; and [[Delta].sub.M]: Q x {0, 1} [approaches] Q is the corresponding transition function. A two-dimensional array count[i, j] with i [element of] Q and j [element of] {0, 1} is used to keep track of the number of times each of the possible transitions has occurred. The function encode(b, p) is part of an arithmetic encoder applied to a binary alphabet {0, 1}, where the probability of a 1 is p, and the bit b is to be encoded, i.e., the [TABULAR DATA FOR TABLE I OMITTED] current subinterval I [subset or equal to] [0, 1] is partitioned according to p, and the new subinterval I[prime] [subset or equal to] I is the one corresponding to b.

[Program Listing Omitted]

[TABULAR DATA FOR TABLE II OMITTED]

The parameter evaluation algorithm for the HMM relies on three control parameters, (i, r, t). Recall that the parameters of the HMM cannot be solved directly and that an iterative maximum-likelihood estimation procedure is required. Such procedures generally lead to local, not global, optima. We thus proceed as follows. For each bitvector we form i initial models; for each model we undertake r iterations of our estimation procedure. We select the model giving the best compression and then continue the iterations with the chosen model until the compression gain between two successive repetitions drops below a predetermined threshold value t. At this point we have the final model for the current bitvector, and the parameters (which are the four probabilities A(0, 0), A(1, 0), B(0, 0), B(1, 0)) and the encoding result are stored; then follows the encoding phase using that model.

5.2 Parameter Estimation

We can estimate the parameters directly. We assume that before generating the bitmap we are in state B. The sequence of ones and zeros making up the bitmap completely determines the state at any bitmap position. Thus it is easy to tabulate the number, and hence probability, of each type of transition.

For illustration, consider the following bitmap, as processed by the Markov model, M - 3C:

0 0 1 0 1 1 0 0 . . .

Since we begin in state B, the initial zero indicates that the first transition is back to state B. Continuing in this manner we find the sequence of states corresponding to the above bitmap is given as follows (with the initial B preceding the colon):

B: B B C X C C X B . . .

In this sequence, we are in state B four times, for which we have three transitions. Of these three transitions, one is to state C, so on the basis of the information given, we would estimate [[Gamma].sub.B] [approximately equal to] 1/3; similarly [[Gamma].sub.C] [approximately equal to] 1/3, and [[Gamma].sub.X] [approximately equal to] 1/2. Thus each parameter is easily evaluated, and these parameters can be used as the basis for compressing the bitmap. Further, the standard deviation of this estimate, for state S, is given by [[Sigma].sub.S] = [square root of [[Gamma].sub.S](1 - [[Gamma].sub.S])/[N.sub.S]], if we experience [N.sub.S] transitions from state S in the pertinent bitmap. The standard deviations make it possible to compute confidence intervals and do tests of hypotheses.

For a large bitmap, the [Gamma]-values can be stored with the bitmap to permit decompression. But several mechanisms can be tried to reduce the cost of storing these parameters. For example, we can save the space for storing these parameters by evaluating them adaptively, beginning with reasonable initial values.

We can also lower storage costs by reparameterizing the model. The [Gamma] are our model's basic parameters. But we can also consider a reparameterization that is suggestive: [[Gamma].sub.C] [equivalent to] [Gamma], [[Gamma].sub.X] [equivalent to] [[Theta].sub.X][[Gamma].sub.C], and [[Gamma].sub.B] [equivalent to] [[Theta].sub.B][[Gamma].sub.X], which define the parameters [Gamma], [[Theta].sub.X], and [[Theta].sub.B]. These parameters satisfy simpler constraints (0 [less than or equal to] [[Theta].sub.X], [[Theta].sub.B] [less than or equal to] 1) independently of one another and of [Gamma], which will ease problems of estimation. We also expect that regularities in the data will be more easily expressed in terms of the [Theta] than in terms of the [Gamma]. It is also useful to define [Theta] [equivalent to] [[Theta].sub.X][[Theta].sub.B], so that [[Gamma].sub.B] = [Theta][[Gamma].sub.C]. [Theta] is a single value reflecting the strength of clustering: it indicates the relative likelihood of a term appearing if we are inside a cluster as compared to if we are outside a cluster.

This reparameterization allows us to lower storage costs if we find the [Theta] are related in a regular manner over the terms. For example, since state X is intermediate between states C and B, we might find that [[Gamma].sub.X] is reasonably approximated by the average of [[Gamma].sub.C] and [[Gamma].sub.B], that is, that [[Theta].sub.X] = (1 + [Theta])/2.

It may also be possible, without serious deterioration in performance, to divide the parameters into a small number of categories. Thus, we might divide our terms into four clustering classes: one class would represent no clustering ([Theta] = 1); the other three states would represent varying clustering strengths, with the values of the clustering parameter, [Theta], for these states determined empirically. This simplification allows clustering strength to be represented with the cost of just two bits per term.

5.3 Performance Estimates

The availability of a detailed model permits us to estimate model performance. For illustration, we will establish some properties for model M - 4S1. If we are in state S [element of] {B, [X.sub.1], [X.sub.2], C}, we expect that we can encode the next bitmap element in H([[Gamma].sub.S]) bits (where H(x) [equivalent to] - (x log(x) + (1 - x) log(1 - x)). Using the equations for [[Pi].sub.C], [[Pi].sub.[X.sub.1]], [[Pi].sub.[X.sub.2]], and [[Pi].sub.B], we estimate that the D bits of the bitmap can be reduced to

[B.sub.M] = ([[Pi].sub.C]H([[Gamma].sub.C]) + [[Pi].sub.B]H([[Gamma].sub.B]) + [[Pi].sub.[X.sub.1]]H([[Gamma].sub.[X.sub.1]]) + [[Pi].sub.[X.sub.2]]H([[Gamma].sub.[X.sub.2]]))D

bits, if the model is valid, and we base our codes on that model.

This estimate of performance can be used as a rough test of the Markov model. But even if the model is valid, we are left with the question of whether the savings of using the full model justifies the additional complexity relative to, say, the independence model. The relative performance of the two models is easily computed.

Under the assumption of the independence model, the appearance of a one-bit is determined by a single parameter, p, the density of one-bits. Developing a code using the value p and assuming a zeroth order model, the size of the bitmap is reduced to

[B.sub.I] = H(p)D

bits. But if the true model is M - 4S1, we can compute the value we would find for p from the model parameters, and thereby predict the performance we would get if we pretended the independence model was valid. The density of one-bits is just the probability of a random bit being set to one. The model predicts that the probability of a one-bit is given by

p = [[Pi].sub.B][[Gamma].sub.B] + [[Pi].sub.[X.sub.1]][[Gamma].sub.[X.sub.1]] + [[Pi].sub.[X.sub.2]][[Gamma].sub.[X.sub.2]] + [[Pi].sub.C][[Gamma].sub.C] = [[Pi].sub.C] + [[Pi].sub.[X.sub.2]],

the last equality following from the equilibrium equations. The equation p = [[Pi].sub.C] + [[Pi].sub.[X.sub.2]] could have been asserted directly: the probability of a one-bit is just the probability of going to state C from any of the states, or going to the state [X.sub.2]; but the long-term probability of going to these states is the same as the probability of being in them, as the last equation asserts.

The ratio [B.sub.M]/[B.sub.I] gives the relative advantage of using the full Markov model versus using the independence model even if the true model is more complex.

5.4 Experimental Results

In our earlier papers we used the King James Bible as "database," its chapters acting as documents. Because of its small size, this choice was useful for testing purposes, as well as providing performance data for an interesting database. We can now provide performance statistics for two of the world's largest natural-language IR systems: the TLF, a database of 680MB of French language texts (112 million words) of the 17th-20th Centuries [Bookstein et al. 1992b], and the database of the RRP, 350MB of Hebrew and Aramaic texts (60 million words) written over the past 10 centuries [Fraenkel 1976]. Their uncompressed concordances span about 345MB for TFL (excluding references to the 100 most frequent words, considered as stopwords), and 450MB for the Responsa database, for which each word is referenced. In fact, we took only subcollections of these: for TLF, the 35,070 terms belonging to the (lexicographic) range between elle and flaube, without limiting the document range, so that each uncompressed bitmap had length D = 38,757 bits; for RRP we took all the 300,000 different terms, but restricted the document range to include only the D = 8119 documents written in the 20th Century. In order to allow comparisons with the methods of the earlier papers, we include also results on the English Bible.

Table III gives general statistics on the tested files and summarizes the results. The column corresponding to the English Bible uses the same restriction as in our earlier work, i.e., only words appearing in at least 60 documents are tested. For real-life databases, most of the words occur rarely, so their bitmaps will be encoded by simply enumerating the one-bits (see Choueka et al. [1986]). We thus decided, for TLF and RRP, to consider only words that appear in at least 0.2% of the possible documents, and to partition the terms into three classes, according to the number of documents N in which the terms occur. The classes are N [element of] [0.2%-1%], N [element of] [1%-3%], and N [element of] [3%-100%]. The threshold values thus were 78, 388, and 1162 for TLF and 16, 81, and 244 for RRP.

The upper part of Table III shows, for each class, the number of different terms and their total number of occurrences. In the lower part of the table, each line corresponds to one of the methods discussed above. To understand the values in the table, recall that we are representing the top level of the concordance as follows. For each term, we list sequentially the documents in which the term occurs. For our measure of compression for the list corresponding to a term, we compute the number of bits needed to encode this list with our methods and then divide this value by the number of [TABULAR DATA FOR TABLE III OMITTED] documents in which the term occurs. The table gives the average of this quantity for all the terms in a class. In other words, it is the average, per one-bit, of the number of bits needed to encode the one-bits of all the bitmaps in this class. The results for HMM correspond to the estimation parameters (10, 1, 0.001), that is, the best out of 10 initial models, each after a single iteration, is chosen, and then improved until the relative gain between successive runs falls below 0.001. The four- and three-state models are referred to by their names. The independence model is the one-state Markov model used in Bookstein et al. [1992a]. The row entitled run length gives the best result out of a range of different run length encoding schemes, including Elias' [Gamma] and [Delta] codes [Elias 1975], various Exp-Golomb codes [Teuhola 1978] and Start-Step-Stop codes [Fiala and Greene 1989].

As can be seen for the files for which HMM results are listed, they usually were best among all the tested methods. However, just to produce the numbers on the TLF, more than 6 days of CPU were necessary (as compared to at most one hour for all the other tests). This also explains why not all the values are given. The best values for the other methods are bold faced. We see that on all the real concordance files, the C-biased model M - 4C1 performed best, except for the lowest-density bitmaps, for which M - 4C4 was slightly better. Moreover, when ordering the methods by compression efficiency, similar rankings were obtained, with all the C-biased methods performing better than any of the B-biased ones.

We have not included the cost of storing the necessary parameters, which is negligible in most of the cases. Four probabilities need be stored for the four-state models and three for the three-state models. If we choose the best out of a possible set of eight probabilities (as in Bookstein et al. [1994]), then only three bits are needed to store the parameters for each term. But since only the high-frequency terms are to be compressed, the average number of added bits per term occurrence is very small. For the ranges given in Table III for TLF and RRP, it is always below 1%, except for the low-frequency Responsa terms, for which it reaches 4% for the four-state models.

When we began these experiments, we were concerned that our results might be very database sensitive, since both language and, at the highest level of the concordance with which we are now dealing, the database organization would be expected to influence the extent of clustering. We therefore were surprised to note the extent of similarity between the TLF and RRP results: within the same range of term density, the results for a given method differ by only 1-5%, with compression on TLF being consistently better. We therefore conclude that the main factor acting on the compression efficiency is the one-bit density of the bitmaps, rather than language-dependent features. As expected, compression improves with increasing density.

We are conceptualizing the concordance as indexing term occurrences at a variety of levels, for example, document or section or sentence. Most of our results reflect compression at the document level. The Bible results give us insight into how our algorithms perform at lower levels of the hierarchy, since here we are considering occurrences of terms over text segments of a single document. This observation, together with the fact that the Bible maps were much denser (at least 6.4% one-bits), may explain why the results for the Bible differ so markedly from that of the Responsa and TLF databases.

Note also that the best results improve upon those obtained from the independence model by 5-13%, which is not negligible, considering the sizes of the databases. There is also a clear tendency, both for TLF and RRP, of having greater improvement with increasing density. We thus conclude that the new models presented here may indeed represent a genuine improvement in our ability to compress concordances.

6. CONCLUSION

The main objective of this article has been to develop models that describe the dependencies among term occurrences in text in order to create codes that compress concordances more effectively than those based on the usual assumption of term independence. The assumption that governed our considerations was the existence of an underlying Hidden Markov Model. We did indeed find a small, but significant, improvement in compression effectiveness when using an HMM. However, such models make great demands on computational resources.

To limit these costs, we approximated the HMM by traditional Markov models. A set of criteria we developed allowed us to greatly reduce the very large set of candidate n-state models, making it possible to provide a full inventory for n [less than or equal to] 4. Further simplification is provided by graph-theoretic reduction and complementation operations defined among the various models. These were used to provide a structure relating the models studied and may well find application in other graph-based problems Finally, tests on the concordances of the English Bible and of two of the world's largest full-text retrieval system - the Tresor de la Langue Francaise and the Responsa Project - demonstrated that traditional Markov models allowed great improvements in computational efficiency, at the cost of very little deterioration of compression performance.

The concordance remains a large and still relatively unstudied component of information retrieval systems. This article concentrated on models of localized clustering. But with sparse bitmaps, such models prematurely leave a cluster or between-cluster state. We could compensate for this by introducing more transitional states; instead further work is being planned in which models which better retain the memory of which state it is in are developed.

APPENDIX

EQUIVALENCE OF ENCODING BITS AND RUN LENGTHS

We have contemplated two different ways of encoding a bitmap: we can encode each bit in turn, using arithmetic encoding to attain the entropy limit; or we can count the number of zeros between ones and encode the gap size. One would expect these to be equivalent. Here we argue this is the case.

We assume a state model. Suppose we have just encoded a one-bit (or are at the beginning of the bitmap) and are in state [S.sub.0]. Let us compute the number of bits required to encode the next sequence of bits up to and including the next one-bit, or until we reach the end of the bitmap.

To establish our notation, if we are in state [S.sub.i] and generate a zero-bit, we make a transition into state [S.sub.i+1]. If we are in state [S.sub.i], the probability of generating a one-bit is [[Gamma].sub.i]. Note that we are not making any assumptions on the memory size of the model.

Suppose we have D bits left to encode. Then the probability of a string of r zeros is given as follows:

r [P.sub.r]

0 [[Gamma].sub.0]

1 (1 - [[Gamma].sub.0])[[Gamma].sub.1]

2 (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1])[[Gamma].sub.2]

. . .

D - 1 (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . (1 - [[Gamma].sub.D-2])[[Gamma].sub.D-1]

D (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . (1 - [[Gamma].sub.D-2])(1 - [[Gamma].sub.D-1])

Thus the expected length to encode the r value is

H = -[[Gamma].sub.0] log([[Gamma].sub.0])

-(1 - [[Gamma].sub.0])[[Gamma].sub.1] log((1 - [[Gamma].sub.0])[[Gamma].sub.1])

- . . .

-(1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . (1 - [[Gamma].sub.D-2])[[Gamma].sub.D-1] log((1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1])

. . . (1 - [[Gamma].sub.D-2][[Gamma].sub.D-1])

- (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . (1 - [[Gamma].sub.D-2])(1 - [[Gamma].sub.D-1]) log(1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1])

. . . (1 - [[Gamma].sub.D-2])(1 - [[Gamma].sub.D-1])

Expanding the logarithms, we find that

H = [[Gamma].sub.0][-log [[Gamma].sub.0]]

+ (1 - [[Gamma].sub.0])[[Gamma].sub.1][-log(1 - [[Gamma].sub.0]) - log([[Gamma].sub.1])]

+ . . .

+ (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . (1 - [[Gamma].sub.D-2])[[Gamma].sub.D-1] [-log(1 - [[Gamma].sub.0]) - log(1 - [[Gamma].sub.1])

- . . . - log(1 - [[Gamma].sub.D-2]) - log([[Gamma].sub.D-1])]

+ (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . (1 - [[Gamma].sub.D-2])(1 - [[Gamma].sub.D-1])[- log(1 - [[Gamma].sub.0])

- log(1 - [[Gamma].sub.1]) - . . . - log(1 - [[Gamma].sub.D-2]) - log(1 - [[Gamma].sub.D-1])]

But note that if we are in state [S.sub.i], it requires -log(1 - [[Gamma].sub.i]) bits to encode a single zero-bit and -log([[Gamma].sub.i]) bits to encode a single one-bit. Thus we see that the terms in brackets are just the total number of bits that would be required to encode the string of zero-bits followed by a one-bit (or terminating zero-bit) for the string represented by the term, if we encode a single bit at a time. Thus H is also the expected value of the code size of the next run of zero-bits, encoding each bit individually, as was to be shown.

The second form can be reorganized to give a very nice alternative expression for H. To do this, we bring together terms with the same log [[Gamma].sub.i] and log(1 - [[Gamma].sub.i]):

H = -[[Gamma].sub.0] log [[Gamma].sub.0] - (1 - [[Gamma].sub.0])log(1 - [[Gamma].sub.0])[[[Gamma].sub.1] + (1 - [[Gamma].sub.1]) [[Gamma].sub.2] + (1 - [[Gamma].sub.1])

(1 - [[Gamma].sub.2])[[Gamma].sub.3] + . . . + (1 - [[Gamma].sub.1])(1 - [[Gamma].sub.2]) . . . (1 - [[Gamma].sub.D-1]) [[Gamma].sub.D] + (1 - [[Gamma].sub.1])

(1 - [[Gamma].sub.2]) . . . (1 - [[Gamma].sub.D-1])(1 - [[Gamma].sub.D])]

- . . . -

(1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . [[Gamma].sub.D-2] log [[Gamma].sub.D-2] - (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1])

. . . (1 - [[Gamma].sub.D-2])log(1 - [[Gamma].sub.D-2])[[[Gamma].sub.D-1] + (1 - [[Gamma].sub.D-1])]

-(1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . [[Gamma].sub.D-1]log[[Gamma].sub.D-1] - (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1])

. . . (1 - [[Gamma].sub.D-1])log(1 - [[Gamma].sub.D-1]).

Each sum of terms in brackets equals one (as can be seen by telescoping backward), leaving an expansion of the form

[H.sub.0] + (1 - [[Gamma].sub.0])[H.sub.1] + . . . + (1 - [[Gamma].sub.0])(1 - [[Gamma].sub.1]) . . . (1 - [[Gamma].sub.D-2])[H.sub.D-1],

where [H.sub.i] is just the entropy -[[Gamma].sub.i] log[[Gamma].sub.i] - (1 - [[Gamma].sub.i]) log(1 - [[Gamma].sub.i]), which is the expected number of bits required to encode the next bit position in state [S.sub.i], provided we get that far. But each [H.sub.i] is multiplied by the probability of indeed getting that far, so the sum cumulates the expected contributions to the encoding of the next run of zeros of each bitmap position.

1 More precisely, identifying the probability of generating a 1 in state [X.sub.2] with that of state C.

REFERENCES

ASAI, K., HAYAMIZU, S., AND ONIZUKA, K. 1993. HMM with protein structure grammar. In Proceedings of the 26th Hawaii International Conference on System Sciences. IEEE, New York, 783-791.

BELL, T. C., WITTEN, I. H., AND CLEARY, J. G. 1989. Modeling for text compression. ACM Comput. Surv. 21, 557-591.

BOOKSTEIN, A., KLEIN, S. T., AND RAITA, T. 1992a. Model based concordance compression. In Proceedings of the Data Compression Conference DCC-92. IEEE Computer Society, Washington, D.C., 82-91.

BOOKSTEIN, A., KLEIN, S. T., AND RAITA, T. 1994. Markov models for clusters in concordance compression, In Proceedings of the Data Compression Conference DCC-94. IEEE Computer Society, Washington, D.C., 116-125.

BOOKSTEIN, A., KLEIN, S. T., AND ZIFF, D. A. 1992b. A systematic approach to compressing a full text retrieval system. Inf. Process. Manage. 28, 795-806.

CHOUEKA, Y., FRAENKEL, A. S., AND KLEIN, S. T. 1988. Compression of concordances in full-text retrieval systems. In Proceedings of the 11th ACM-SIGIR Conference. ACM, New York, 597-612.

CHOUEKA, Y., FRAENKEL, A. S., KLEIN, S. T., AND SEGAL, E. 1986. Improved hierarchical bit-vector compression in document retrieval systems. In Proceedings of the 9th ACM-SIGIR Conference. ACM, New York, 88-97.

CORMACK, G. V. AND HORSPOOL, R. N. 1987. Data compression using dynamic Markov modeling. Comput. J. 30, 541-550.

ELIAS, P. 1975. Universal codeword sets and representation of the integers. IEEE Trans. Inf. Theory IT-12, 194-203.

EVEN, S. 1973. Algorithmic Combinatorics. The Macmillan Company, New York.

FELLER, W. 1968. An Introduction to Probability Theory and Its Applications. Wiley, New York.

FIALA, E. R. AND GREENE, D. H. 1989. Data compression with finite windows. Commun. ACM 32, 490-505.

FRAENKEL, A. S. 1976. All about the Responsa Retrieval Project you always wanted to know but were afraid to ask, expanded summary. Jurimetrics J. 16, 149-156.

HAUSSLER, D., KROGH, A., MIAN, S., AND SJOLANDER, K. 1993. Protein modeling using hidden Markov models: Analysis of globins. In Proceedings of the 26th Hawaii International Conference on System Sciences. IEEE Computer Society, Washington, D.C., 792-802.

KEMENY, J. G. AND SNELL, J. L. 1960. Finite Markov Chains. Van Nostrand, New York.

KLEIN, S. T., BOOKSTEIN, A., AND DEERWESTER, S. 1989. Storing text retrieval systems on CD-ROM: Compression and encryption considerations. ACM Trans. Inf. Syst. 7, 230-245.

KNUTH, D. E. 1973. The Art of Computer Programming. Vol 1, Fundamental Algorithms. Addison-Wesley, Reading, Mass.

LINOFF, G. AND STANFILL, C. 1993. Compression of indexes with full positional information in very large text databases. In Proceedings of the 16th ACM-SIGIR Conference. ACM New York, 88-95.

LLEWELLYN, J. A. 1987. Data compression for a source with Markov characteristics. Comput. J. 30, 149-156.

MITTENDORF, E. AND SCHAUBLE, P. 1994. Document and passage retrieval based on Hidden Markov models. In Proceedings of the 17th ACM-SIGIR Conference. ACM, New York, 318-327.

MOFFAT, A. AND ZOBEL, J. 1992. Coding for compression in full-text retrieval systems. In Proceedings of Data Compression Conference DCC-92. IEEE Computer Society, Washington, D.C., 72-81.

RABINER, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (Feb.) 257-286.

TEUHOLA, J. 1978. A compression method for clustered bit-vectors. Inf. Proc. Lett. 7, 308-311.

WISNIEWSKI, J. L. 1986. Compression of index term dictionary in an inverted file oriented database: Some efficient algorithms. Inf. Process. Manage. 22, 493-501.

WITTEN, I. H., BELL, T. C., AND NEVILL, C. G. 1991. Models for compression in full-text retrieval systems. In Proceedings of Data Compression Conference DCC-91. IEEE Computer Society, Washington, D.C., 23-32.

WITTEN, I. H., MOFFAT, A., AND BELL, T. C. 1994. Managing Gigabytes, Compressing and Indexing Documents and Images. International Thomson Publishing, London.

Printer friendly Cite/link Email Feedback | |

Author: | Bookstein, A.; Klein, S.T.; Raita, T. |
---|---|

Publication: | ACM Transactions on Information Systems |

Date: | Jul 1, 1997 |

Words: | 15388 |

Previous Article: | Data structures for efficient broker implementation. |

Next Article: | Recursive hashing functions for n-grams. |

Topics: |