Printer Friendly

PIC Matrices: A Computationally Tractable Class of Probabilistic Query Operators.


Howard Turtle, in his doctoral dissertation [Turtle 1990], developed a probabilistic model for information retrieval formulated in terms of a Bayesian Network. The inference network is a general framework which makes possible the consideration of multiple sources of evidence in the process of ranking documents in response to a user's information need. Evidence due to multiple document representations (e.g., titles, abstracts, document bodies, manually produced indices), multiple query formulations (e.g., Boolean, natural language), and even multiple belief systems can be combined in a principled way. An attractive aspect of the inference network approach is that it provides a direct, natural, computationally efficient, probabilistically motivated method for modeling query operators.

In this article, we present a class of query operators which, for reasons to be explained, we have called PIC operators. The PIC operators comprise a subclass of the general class of query operators that are expressible within the inference network framework. The subclass has been chosen to satisfy two important design criteria: (1) the operators belonging to the subclass are computationally tractable, and (2) they are intuitively plausible candidates for the modeling of query operators.

A wide variety of query operators can be defined as PIC operators. The work reported has been principally motivated by the search for more effective Boolean operators. Other approaches to the modeling of Boolean query operators have been tried. As with the inference network, the goal has been to generalize the classical, strict Propositional Logic interpretation of the query operators. The objective of this generalization is twofold: on the one hand, to allow for graduated inputs to the Boolean operators so that the representation of documents in terms of vectors of Boolean characteristics can be extended to vectors of feature weights, while, on the other hand, to generate real-valued operator output so that dichotomous relevance judgments can be replaced by document ranking in response to user's queries.

A particularly notable success in this pursuit was reported by Salton et al. [1983a; 1983b]. Grounded in the geometric metaphor of the vector space model, they defined a general class of "pnorm" operators which extend the traditional operators in a natural way. The experimental results achieved were quite positive. Recent experimentation has shown that system performance is improved by replacing the inference network Boolean operator calculation used in the INQUERY retrieval system [Callan et al. 1992] by the pnorm and and or operator computations. Despite these results, the pnorm operator computation has not been incorporated in the INQUERY system because it does not conform to the inference network model.

This situation has prompted the search for an alternative to the current scheme for modeling operators within the inference network framework that might improve system performance. In this article we describe a set of parameterized functions that we have explored for the modeling of Boolean query operators, and how we were able to experimentally determine values for the parameters that resulted in good retrieval effectiveness. A major consideration in this work is the need for the computations associated with the model adopted to respect the computational limitations imposed by modern large-scale information retrieval systems. As we shall see, this can be problematic within the confines of the inference network framework.

The following section provides a brief summary of the inference network framework with emphasis on how queries, particularly Boolean queries, are modeled. In Section 3, the pnorm approach to extended Boolean retrieval is reviewed. The pnorm approach will provide the central point of comparison for the PIC matrix extension of the inference network framework which is presented in this article. Section 4 presents the formal definition of PIC matrices and the PIC-EVAL algorithm that has been designed for their evaluation. Included in this section are discussions of a subclass of the PIC matrices, the piecewise-linear functions, which can be evaluated particularly efficiently, and an extension of the PIC matrix definition that allows for weighted inputs. A major motivation for the definition of an efficient subclass of link matrices was for the modeling of Boolean operators. In Section 5, we discuss experimentation that resulted in the development of PIC matrices for the modeling of Boolean operators that appear to be superior to those previously used in the INQUERY system.


The INQUERY inference network is a Bayesian Network [Charniak 1991; Kim and Pearl 1983; Pearl 1988] designed for supporting information retrieval. The nodes of the network correspond to propositions and are divided into two parts: the document subnetwork and the query subnetwork. In the document subnetwork, the propositions associated with the nodes pertain to the observation of documents, representations of the documents, and representations of document content. Nodes of the query subnetwork correspond to propositions regarding the presence of query concepts, the satisfaction of queries, and the satisfaction of information needs. The subject of this article, the modeling of Boolean queries, is directly concerned with only the concept and query nodes, so our description will focus on this part of the network. The reader is referred to Turtle and Croft [1990] for a more general description of the inference network framework. Callan et al. [1992] discuss the implementation of the inference network in the INQUERY retrieval system.

Within the inference network formulation, a concept node is associated with the presence of a "concept" in the document currently being analyzed. For unstructured queries, there is exactly one query node in the network. This query node is connected to the concept nodes corresponding to the terms of the query. For example, Figure 1 shows the relevant part of the network for the query "baseball umpire strike". The leftmost node corresponds to a proposition asserting the presence in the document of the concept associated with the word baseball. Similarly, there are nodes for the umpire and strike concepts. The query node is associated with the proposition that the user's query is satisfied.


2.1 Bayes Nets

In general, a Bayesian Network encodes a joint probability distribution. The nodes of the network correspond to random variables. In the INQUERY inference network, each random variable may assume a value of either true or false. The topology of the network is interpreted as encoding a set of conditional independence relations among the variables. If the nodes corresponding to the variables, [P.sub.1], ... , [P.sub.n], are the immediate predecessors (parents) of a node, Q, and [Z.sub.1], ... , [Z.sub.s] are all other nodes that are not descendents of (i.e., are not reachable from) Q, as shown in Figure 2, then Q is considered to be conditionally independent of [Z.sub.1], ... , [Z.sub.s] given [P.sub.1], ... , [P.sub.n]:

p(Q | [P.sub.1], ... , [P.sub.n], [Z.sub.1], ... , [Z.sub.s]) = p(Q | [P.sub.1], ... , [P.sub.n])

where p (X | Y) refers to the probability of event X conditioned on event Y.


Given probabilities for the root nodes (i.e., nodes with no parents), a singly connected network, or polytree [Pearl 1988], may be processed in a top-down fashion in order to produce the probabilities relevant to each of its nodes. As a consequence of the conditional independence assumptions implicit in the topology of a Bayesian Network, once the probabilities, [p.sub.1], ... , [P.sub.n], have been produced for the parents of a node, Q, the probability that Q assumes the value y [element of] [D.sub.q] is given by


where [D.sub.1],..., [D.sub.n], [D.sub.q] are the sets of values that may be assumed by the variables, [P.sub.1],..., [P.sub.n], and Q, respectively. For the INQUERY inference network, [D.sub.1] = [D.sub.2] = ... [D.sub.n] = [D.sub.q] = {true, false}.

2.2 Binary-Valued Random Variables

In INQUERY each node corresponds to a proposition, i.e., a variable that may take on one of two values: true or false. For example, in Figure 2, each [P.sub.i] might correspond to the proposition that some document under consideration is about some concept, [c.sub.i], while Q corresponds to the proposition that the query is satisfied.

Since all the variables are binary valued in INQUERY, the dependence of a child on its parents can be given via the specification of

p(Q is true | [P.sub.1] = [b.sub.1], ... , [P.sub.n] = [b.sub.n])


p(Q is false | [P.sub.1] = [b.sub.i], ... , [P.sub.n] = [b.sub.n])

for each

[is less than] [b.sub.1], ... , [b.sub.n] [is greater than] [element of] [{true, false}.sup.n].

Equivalently, the values

[[bar][Alpha].sub.R] = p(Q is false | i [element of] R ?? [P.sub.i] is true i [is not element of] R ?? [P.sub.i] is false)


[[Alpha].sub.R] = p(Q is true | i [element of] R ?? [P.sub.i] is true i [is not element of] R ?? [P.sub.i] is false)

must be specified for every possible subset, R, of {1, ... , n}. In terms of these conditional properties, the probability that a child node is true can be calculated once the probabilities of truth, [p.sub.1], ... , [P.sub.n], are known for each of its parent nodes:




The set of coefficients involved is conveniently organized as a 2 x [2.sup.n] matrix:

[P.sub.0...000] [P.sub.0...001] [P.sub.0...010] ... [P.sub.1...111]

Q false [[bar][Alpha].sub.0...000] [[bar][Alpha].sub.0...001] [[bar][Alpha].sub.0...010] ... [[bar][Alpha].sub.1...111]

Q true [[Alpha].sub.0...000] [[Alpha].sub.0... 001] [[Alpha].sub.0...010] ... [[Alpha].sub.1...111]

where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is the probability that Q is true subject to the condition that the parents [P.sub.i] such that [b.sub.i] = 1 are true, and the parents [P.sub.i] such that [b.sub.i] = 0 are false; [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is the corresponding probability that Q is false and is equal to 1 - [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. This matrix, known as a link matrix, may be visualized as linking the child node with the parent nodes as is shown in Figure 3


2.3 Link Matrices as Query Operators

Conceptually, INQUERY can be thought of as examining documents one by one. For each, the inference network is used to evaluate evidence that the document satisfies an information need expressed by the user. A given link matrix form can be viewed as defining an operator for combining evidence. For example, suppose the propositions [P.sub.1], [P.sub.2], [P.sub.3] state that three queries, [q.sub.1], [q.sub.2], [q.sub.3], respectively, have been (in some sense) satisfied by the document currently under scrutiny. A link matrix connecting the parents nodes, [P.sub.1], [P.sub.2], [P.sub.3], with the child node, Q, can be viewed as a way of forming a query, q, that is a composite of the individual subqueries. The child node, Q, would correspond to the proposition that the combined query, q, has been satisfied. The specification of the coefficients of the link matrix defines the way the subqueries are combined in that it gives all the information necessary for determining the probability that q has been satisfied given the probabilities, [p.sub.1], [p.sub.2], [p.sub.3], that the individual subqueries have been satisfied.

In the example of Figure 1, this means that once the three probabilities, [p.sub.b], [p.sub.u], [p.sub.s], that the document under scrutiny is about baseball, umpire, and strike, respectively, have been determined, the probability that the document is relevant to the query can be calculated. The computation requires that the requisite conditional probabilities have been specified. One such probability, for example, is p(Q | [bar]BU[bar]S), the probability that the query is satisfied, given that the document is about umpire, but not about either baseball or strike. There are eight possible truth assignments for the set of three variables associated with the concept nodes of this example. The probability that the query is satisfied is then given by

(1) p(Q) = p(Q | [bar]B[bar]U[bar]S) [multiplied by] (1 - [p.sub.b]) [multiplied by] (1 - [p.sub.u]) [multiplied by] (1 - [p.sub.s])

+ p(Q | [bar]B[bar]US) [multiplied by] (1 - [p.sub.b]) [multiplied by] (1 - [p.sub.u]) [multiplied by] [p.sub.s]

+ p(Q | [bar]BU[bar]S) [multiplied by] (1 - [p.sub.b]) [multiplied by] [p.sub.u] [multiplied by] (1 - [p.sub.s])

. . .

+ p(Q | [bar]BU[bar]S) [multiplied by] [p.sub.b] [multiplied by] [p.sub.u] [multiplied by] [p.sub.s].

One, admittedly arbitrary, way of defining a trinary query composition operator for forming the query, q, from the subqueries, [q.sub.1], [q.sub.2], [q.sub.3], might be to specify that q is to be considered satisfied with

--80% probability if [q.sub.1] is satisfied, independent of the whether or not [q.sub.2] and [q.sub.3] are satisfied;

--50% probability if [q.sub.1] is not satisfied, but [q.sub.2] and [q.sub.3] are both satisfied; and

--10% probability in any other situation.

This particular operator corresponds to the link matrix(1)


where the column under [P.sub.011], for example, gives the probability that Q is true given that [P.sub.1] is false, [P.sub.2] is true, and [P.sub.3] is true. Clearly, this link matrix has the desired effect. Typically, none of the parents will be known to be either true or false with certainty. Rather, evidence corresponding to each of [P.sub.1], [P.sub.2], [P.sub.3] will, in general, be estimated to be present with certain probabilities: [p.sub.1], [p.sub.2], [p.sub.3]. Given these probabilities, the probability that Q is true can be calculated as


2.4 Boolean Queries

Turtle [1990] describes how the inference network can be used to model Boolean queries. The section of network shown in Figure 4, for example, could be constructed for the following query:

baseball and (player or umpire) and strike


The link matrix for a node labeled and would contain a one in the column corresponding to all parent nodes being true. All other columns would be zeros. This is equivalent to saying that it is certain that an and query is satisfied if all of its parents are (certain to be) true, and it is certain not to be satisfied when one or more of the parents is (certain to be) false. A matrix for a trinary #and operator would therefore be


An arbitrary member of the and family (binary and, trinary and, quadrary and, etc.) would have 1.0 in the column corresponding to all parents being true, indicating that we are certain that the and query is satisfied if all of the parents are true. All other columns would be 0.0, indicating that we are certain that the and query is not satisfied if any one of the parents is false.

Analogously, an or operator would have 1.0 in all cells except for a 0.0 for all parents false, the trinary or being given by


These link matrices are natural interpretations for the Boolean operators; and they can be evaluated very efficiently. They may be questioned on two counts, however.

First, the interpretation given to the and and or operators, while natural, might be considered overly strict. We ignore, for the moment, the questionable practice of ascribing certainty, in the form of 0.0 and 1.0 probabilities, to any belief under any circumstances. That excepted, we might still find it somewhat unsettling that an and with 10 parents makes no distinction between an event in which nine of these 10 are true and another in which only eight of the 10 are true. In fact no distinction is made between nine true parents and no true parents! Applying Eq. (1) with the [L.sub.and] matrix, we see that p(Q) = 0 if any one of the input probabilities is 0, regardless of the probabilities associated with the rest of the parent nodes.

Faced with a user need expressed in terms of and, it is likely that most people's belief system would be inclined toward assigning greater, albeit perhaps only slightly greater, probability to the query being satisfied for those events for which a greater number of parents is true. An analogous argument can clearly be made for the case of the or operator.

With respect to the issue of the strict interpretation of Boolean queries, a subtle distinction merits discussion. Although the interpretation discussed above can be said to be strict, it is nonetheless a generalization of the interpretation traditionally applied to Boolean queries. Under the conventional interpretation, the input to a Boolean operator is either 0 or 1. Were this the case in the inference network, the operators discussed above would indeed be equivalent to this strict interpretation. In the inference network, however, inputs to the Boolean operators are probabilities of truth, which may assume arbitrary values in the [0.0,1.0] interval, rather than dichotomous conditions which are either strictly true or false. The Boolean operators produce probabilities of truth which also range over the entire [0.0,1.0] interval. The and operator, for example, will not produce a hard 0/1 value, unless one of the inputs is exactly 0.0, or unless all of the inputs are exactly 1.0. If all of the inputs to the operator are probabilities in the open interval (0.0,1.0), the output will also lie in this open interval.

These observations do not obviate the comments made above with respect to the strictness of the operators as they are currently implemented, however. The computations performed for inputs [p.sub.1],..., [p.sub.n], though none are exactly 0.0 nor 1.0, are logical consequences of the specification of the probabilities conditioned on certain knowledge of the values of the parent nodes. There is motivation, therefore, to consider the possibility of an interpretation for the and operators that is not strict in the sense that it differentiates between all parent nodes being false and only some of the parent nodes being false, giving some extra credence to the probability of the query when a greater number of parents is true. Similarly, an or operator may be considered that is to some, perhaps minor, degree sensitive to the number of parents that are true. It must be borne in mind that changes to the specification of the link matrices will affect the result of computations for all combinations of input values.

The second cause for concern with regard to the operators, as they are currently defined, is that experiments have shown that they are not as effective as might be hoped for. Experiments performed by Larkey have shown that the pnorm operators yield retrieval performance superior to that of the and and or operators as they are currently implemented (personal communication, L. S. Larkey, 1996).


A significant advance in the processing of Boolean queries was presented by Salton et al. [1983a; 1983b]. Working within the context of the vector space model, they were able to extend conventional Boolean retrieval in a way that allowed for (1) the weighting of document features, and (2) the production of ranked output. This extended Boolean model was shown to provide significant improvements in retrieval performance. The goal of matching this performance in a manner consistent with the inference network model originally motivated the work presented in this article, as well as providing a benchmark against which performance of Boolean operators based on PIC matrices could be compared.

Salton et al. normalized weights so that all vector components were in the range of 0.0 to 1.0 and, hence, so that all vectors lie in the unit hypercube. Intuitions based on a geometric view of information retrieval led them to consider the output of an operator as a measure of distance from the point <0.0,0.0,..., 0.0>. So, for the query(2)

#or([t.sub.1], ... , [t.sub.n])

a document with term weights [w.sub.1], ... , [w.sub.n] for terms [t.sub.1], ... , [t.sub.n], would receive a score of


where division by n in this formula guarantees that the output also lies in the unit interval [0.0, 0.1].

This operator can be considered or-like. When all weights are 0.0, it will produce a value of 0.0. When all weights are 1.0, it will produce a value of 1.0. When a variety of weights are presented as inputs, the output is more heavily influenced by the weights at the higher end of the scale. The sum produced, prior to the application of the square root, can be viewed as a weighted average of the input values, with each value receiving a weight equal to its own value:


Hence, higher values can be viewed as receiving greater weights. The result is that lower values can be overpowered, to a degree, by a single large value. This is reasonable behavior for a generalization of an or operator. We may presume, that, on specifying an or, the user is indicating that the lack of evidence with respect to most of the terms can be downplayed in the presence of strong evidence with regard to just one of the terms.

The beauty of this approach to extending Boolean operators is that the distance measure shown in Eq. (2) can be viewed as a special case of (a normalized version of) the more general vector norm measure [Ortega 1972]:(3)


The normalized Euclidean distance measure is, then, the vector norm measure with p = 2, but other values of p can be considered. For smaller values of p, the operator is less or-like. At p = 1, the operator degenerates to a simple average, and the Boolean structure of the query is essentially ignored. For larger values of p, the calculation can still be understood as effecting a type of weighted averaging. As p grows, more and more weight is given to the larger input components. We can view each input [w.sub.i] as being weighted by [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] relative to the others.


For very large values of p, the contribution of all but the largest component value is negligible. In the limit, the calculation is equivalent to that of the max operator, which is the standard operator used in models of information retrieval based on fuzzy set theory [Bookstein 1980; 1981; Buell and Kraft 1981; Noreault et al. 1977; Waller and Kraft 1979] and corresponds to a strict Boolean or when the input components are limited to strict Boolean values.

In the pnorm model, the and operator is defined as the distance of the input vector to the point <1.0,1.0, ... , 1.0>, which is given by


For concreteness, we have concentrated on the or operator, but the and operator can be seen as the dual of or, and all aspects of the above discussion have their counterpart with respect to and as well.

Salton et al. [1983b] reports experimental results in which use of the pnorm model consistently improves performance over basic Boolean retrieval on four relatively small collections. Further experimentation comparing pnorm with fuzzy logic models is discussed by Fox et al. [1992]. More recently, Joon Ho Lee has run experiments on the larger TREC subcollection, the Wall Street Journal Disk 2 [Lee 1995]. He shows that pnorm performs favorably compared to a variety of other formalisms. A series of experiments at Virginia Tech combining unstructured queries with Boolean formulations interpreted using pnorm are reported in Fox and Shaw [1994] and Shaw and Fox [1995]. Further work along these lines, in conjunction with work done at Rutgers, is discussed in Belkin et al. [1995].

In all of this work pnorm has performed exceedingly well. For those interested in investigating probabilistic IR models, however, the need to explore alternatives remains. Our interest is in improving the performance of the Boolean operator interpretation within the framework of the inference network. When pnorm is used as an operator in the INQUERY system, performance is superior to that of the Boolean operators currently used. This gives us reason to believe that the modeling of Boolean operators can be improved. At the same time it sets a performance target to be aimed for.

4. PIC Matrices and the PIC-EVAL Algorithm

In this section, the PIC class of link matrices is defined. The PIC-EVAL algorithm, which is able to efficiently evaluate these matrices, is then presented and explained. While the main motivation for the development of the PIC matrices has been for the modeling of Boolean operators, the potential for their application is more extensive, and worthy of study in its own right. Before entering into the experimentation with use of the PIC matrices for the modeling of Boolean operators in the following section, we discuss two aspects of the work on PIC matrices that are of independent interest. Section 4.3 discusses the subclass of PIC matrices associated with piecewise-linear functions and explains how the PIC-EVAL algorithm can be made even more efficient in this case. Section 4.4 describes an extension of the PIC matrix class which allows for weighted inputs and shows how the PIC-EVAL algorithm can be modified to evaluate these matrices.

A wide range of functions can be represented as link matrices. Unfortunately, the evaluation of an arbitrary link matrix for an arbitrary set of input probabilities requires O([2.sup.n]) floating-point operations. Turtle [1990] shows that closed-form expressions exist for some matrices, such that they may be evaluated in time linear in the number of parents, for arbitrary inputs. The matrices used for Boolean operators have that characteristic. It is not difficult to show that the probability that the child node is true can be given by

p(Q is true) = [p.sub.1][p.sub.2]...[p.sub.n]

for [L.sub.and] and

p(Q is true) = 1 - (1 - [p.sub.1])(1 - [p.sub.2])...(1 - [p.sub.n])

for [L.sub.or], where [p.sub.i] is the probability that parent [P.sub.i] is true.

4.1 The PIC Matrices

The desire to explore new possibilities for Boolean query operators motivates the search for wider classes of computationally tractable link matrices. A natural candidate for consideration is the class of matrices for which the conditional probability that the child node is true is determined solely by the number of parents that are true, independent of which of them happen to be true and which are false.

We will say that a link matrix satisfies the parent indifference criterion, or simply that it is a PIC matrix if, given the parent nodes, [P.sub.1], ... , [P.sub.n], we have


where, for each R [subset or equal to] {[P.sub.1], ... , [P.sub.n]}, [[Alpha].sub.R] is the matrix coefficient associated with the event that the [P.sub.i] belonging to R are true and the [P.sub.i] not in R are false.

A PIC matrix for n parents requires the specification of n + 1 parameters, [[Alpha].sub.0], ... , [[Alpha].sub.n], as compared to the [2.sup.n] parameters required for an arbitrary link matrix. Each [[Alpha].sub.i] specifies the common value given by [[Alpha].sub.R] for each R [subset or equal to] {[P.sub.1], ... , [P.sub.n]} such that |R| = i. The sequence of link matrix coefficients, [[Alpha].sub.0], ... , [[Alpha].sub.n], can be viewed as a function from the integers {0,1, ... , n} to the interval [0,1].

It is appropriate to note that we have in mind, and will be exploring, coefficients corresponding to nondecreasing functions, such as that shown in Figure 5. Viewing the PIC matrices as operators for the combination of evidence, our interest is in those operators for which more pieces of individual evidence (i.e., greater number of true parents) translates to greater probability that the combined evidence is present. Nonetheless, the PIC matrix class is not limited to such functions.


The [L.sub.sum] matrix discussed earlier satisfies the parent indifference criterion. The coefficients for the general n-ary [L.sub.sum] matrix are 0, 1/n, 2/n, ... , (n - 1)/n, I as shown graphically in Figure 6(a). If the probabilities are viewed as weights of evidence, the matrix can be viewed as an operator that averages these weights, i.e.,

p(Q is true) = [p.sub.1] + [p.sub.2] ... + [p.sub.n]/n.


The [L.sub.and] and [L.sub.or] matrices used for the Boolean operators also meet the parent indifference criterion, as shown in Figures 6(b) and (c).

Turtle [1990] shows how a fourth type of matrix, the weighted-sum matrix, is made to be computable in linear time as well. This matrix does not satisfy the parent indifference criterion. It will, however, be shown in Section 4.4 to satisfy a generalization of the criterion. This generalization contemplates a relative weighting of the importance of the parent nodes in determining the probability that the child node is true. A simple extension of the PIC-EVAL algorithm allows for the calculation of probabilities involving these weighted PIC matrices in O([n.sup.2]) time as well.

The PIC matrices comprise a wide and natural class of link matrices. Clearly not all matrices are covered by the PIC class, since a PIC matrix for n inputs is fully described by only n + 1 parameters, whereas each possible setting of [2.sup.n] parameters describes a unique link matrix. The PIC matrix subset can be characterized as those link matrices that correspond to symmetric functions. That is, functions whose value does not vary when its inputs are permuted in any way. The value of a symmetric function for binary inputs can be viewed as a function of the number of inputs that are set to 1. This being the defining characteristic of PIC matrices, we see that the PIC matrices cover all link matrices and only those link matrices associated with symmetric functions.

It should also be noted that in the current implementation of INQUERY the variables associated with concept nodes are considered independent. However, this is not an intrinsic restriction of the inference network formalism. The inference network framework does allow for the representation of more complex interactions of variables, at the cost of greater complexity in the evaluation of the network.

4.2 The PIC-EVAL Algorithm

The PIC matrices would not constitute a useful class if they could not be evaluated in better than exponential time. Figure 7 shows an algorithm for the efficient evaluation of matrices that satisfy the parent indifference criterion. Starting with the original link matrix, [L.sub.0], PIC-EVAL in effect generates a sequence of smaller and smaller link matrices, [L.sub.1], [L.sub.2], ... , [L.sub.n]. In the process, it eliminates from consideration each probability in turn (Figure 8).


To begin, the probability associated with parent node, [P.sub.i], is fixed, and the matrix, [L.sub.0], with n + 1 coefficients and n parents (Figure 8(a)) is converted to an n-coefficient link matrix with n - 1 parents (Figure 8(b)). As shown in Figure 8, each matrix, [L.sub.i], connects with one fewer parent node than the previous one. Corresponding to this, the coefficients, [[Alpha].sub.i],0, [[Alpha].sub.i, 1], ... , [[Alpha].sub.i, n-i], of each matrix are fewer as well. The matrices, [L.sub.i], that result from this sequence of transformations are equivalent to the original matrix, [L.sub.0], in the following sense:

for any set of probabilities for parents,

[P.sub.i+1], ... , [P.sub.n],

[L.sub.i] yields the same p(Q is true) that [L.sub.0] would produce given the same probabilities for

[P.sub.i+1], ... , [P.sub.n],

together with the probabilities

[p.sub.1], ... , [p.sub.i] for [P.sub.1], ... , [P.sub.i].

After n iterations we arrive at [L.sub.n] with exactly one coefficient, [[Alpha].sub.n, 0], as seen in Figure 8(e). Now that all parent probabilities have been accounted for, this coefficient can be interpreted as the desired probability that the child is true. A formal proof of the correctness of the PIC-EVAL algorithm can be found in Appendix A.

To make this a bit more concrete, suppose we need to evaluate a PIC matrix with coefficients

[[Alpha].sub.0], [[Alpha].sub.1], [[Alpha].sub.2], [[Alpha].sub.3]

for the three input probabilities, [p.sub.1], [p.sub.2], and [p.sub.3]. After the first iteration of the algorithm, the problem has been reduced to that of evaluating a two-input matrix with coefficients

[[Alpha].sub.0][[bar]p.sub.1] + [[Alpha].sub.1][p.sub.1],

[[Alpha].sub.1][[bar]p.sub.1] + [[Alpha].sub.2][p.sub.1],

[[Alpha].sub.2][[bar]p.sub.1] + [[Alpha].sub.3][p.sub.1]

for the input probabilities, [p.sub.2] and [p.sub.3]. After the second iteration, the problem has been further reduced to the evaluation of a matrix with the two coefficients


and one input probability, [p.sub.3]. After the final iteration, the data have been reduced to one single coefficient,


which gives the value of the desired output probability. We see in this example that each combination of the three probabilities and their complements are represented in the final result, and that each combination is multiplied by the appropriate coefficient. For instance, the combination [p.sub.1][[bar]p.sub.2][p.sub.3] is multiplied by [[Alpha].sub.2]. This is correct, since [[Alpha].sub.2] is the conditional probability associated with two true parents.

The PIC-EVAL algorithm is executed in O([n.sup.2]) time. The initialization requires O(n) time, and the main loop is executed precisely [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] times. Also, the constant factor is small. On top of the base iteration control overhead, two multiplications and one addition are required in each iteration. Although, for the purposes of exposition, the algorithm has been shown as requiring O([n.sup.2]) space as well as time, O(n) space is easily achieved, since only one row's worth of cells need be maintained at any one time.

4.3 Piecewise Linear Functions

The PIC-EVAL algorithm evaluates a matrix with arbitrary PIC coefficients in O([n.sup.2]) time. In this section we look at certain PIC matrices for which the evaluation algorithm can be made more efficient.

The PIC-EVAL algorithm can be viewed as filling a triangular portion of a square matrix as shown in Figure 9(a). The first row is initialized with the [[Alpha].sub.j] values, and then each row from 1 to n is processed in turn. Within each row, the cells are set from left to right, with each cell set to the sum of

--the cell immediately above it and

--the cell above it and to the right.


As a consequence of this, the value of each cell, [[Alpha].sub.i, j], is dependent only on the values of cells in a triangle extending directly above it on the left and at a 45 [degrees] angle to the right, as is shown in Figure 9(b). That is, [[Alpha].sub.i,j] depends on only those [[Alpha].sub.k,l] such that

0 [is less than or equal to] k [is less than] i

j [is less than or equal to] l [is less than] j + i - k

or, more concretely, such that [[Alpha].sub.i,j] depends on precisely


When a subsequence of the PIC matrix coefficients forms an arithmetic progression,

[[Alpha].sub.m], [[Alpha].sub.m] + [Delta], [[Alpha].sub.m] + 2[Delta], ... , [[Alpha].sub.m] + s[Delta],

the cell values in the triangular subsection immediately below these coefficients can be expressed directly in terms of [[Alpha].sub.m], [Delta], and the parent probabilities, [p.sub.1], ... , [p.sub.s]. Of these cell values, the only ones that are needed for calculations outside of the triangle are those on the leftmost edge:

[[Alpha].sub.0,j], ... , [[Alpha].sub.i-1,j]

Hence, if there were a direct method for determining these cell values, the rest of the cells in the triangle need not be calculated at all. Exactly, how this may be accomplished follows from the following lemma, a proof of which is given in Appendix B.

LEMMA 1. Given a PIC matrix whose coefficients

[[Alpha].sub.m], [[Alpha].sub.m+1], [[Alpha].sub.m+2], ... , [[Alpha].sub.m+s]

are of the form

[[Alpha].sub.m], [[Alpha].sub.m] + [Delta], [[Alpha].sub.m] + 2[Delta], ... , [[Alpha].sub.m] + s[Delta],

i.e., are such that

[[Alpha].sub.m+j] = [[Alpha].sub.m] + j[Delta] [inverted]Aj = 0, ... , s,

then, [inverted]Ai = 0, ... , s, [inverted]Aj = 0, ... , s - i:


In particular, each of the cells at the left edge of the triangle can be computed as


which requires a total of only s additions and s multiplications, replacing the [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] general cell computations which would normally be executed.

This technique can result in substantial savings if the number of coefficients involved, and hence the size of the triangle involved, is large. For example, Figure 10(a), shows a function with three linear pieces. The matrix coefficients in this case comprise three arithmetic series, each associated with a corresponding savings in cost of evaluation. Since the domain of the function is discrete, it is not strictly necessary that the pieces connect. Therefore, when speaking of piecewise linear functions, we shall also consider functions such as that shown in Figure 10(b).


The form of the two-piece piecewise linear functions shown in Figures 10(c) and (d) are of interest because they can be interpreted as generalizations of the [L.sub.and] and [L.sub.or] link matrices. We will see in a moment that evaluation of these functions is particularly efficient. The function shown in Figure 10(c), for instance, generalizes the [L.sub.and] matrix in that the conditional probability that Q is true may

--take on a (presumably small) value greater than 0 when all parents are false,

--rise (presumably slowly) at a constant rate as more parents are known to be true,

--rise suddenly for some number of true parents m less than n, although m must be less than n by some (presumably small) constant value, independently of n,

--rise at a constant rate as the number of parents known to be true goes from m to n, and

--take on a value less than i when all parents are true.

In a similar fashion, the form shown in Figure 10(d) generalizes the [L.sub.or] matrix.

While the savings realized for a piecewise linear function may be significant, the asymptotic cost of execution will not be affected if two or more segments grow with n. All but one of the pieces of the function must cover a constant number of coefficients if the O([n.sup.2]) cost is to be reduced.(4) If [[Alpha].sub.m], [[Alpha].sub.m+1], ... , [[Alpha].sub.s], is an arithmetic series, then s(s + 1)/2 cell operations can be eliminated in favor of s additions. This leaves

n(n + 1)/2 - s(s+ 1)/2 = ([n.sup.2] + n - [s.sup.2] - s)/2

= ns - [s.sup.2] + ([n.sup.2] - ns + n + [s.sup.2] - ns - s)/2

= (n - s)s + (n(n - s + 1) - s(n - s + 1))/2

= (n - s)s + (n - s)(n - s + 1)/2.

When (n - s) is constant, the second term is constant, while the first term grows with s. Since s must grow with n if n - s is to be kept constant, the cost of evaluation is O(n). Figure 11 shows, pictorially, how the array for the generalized OR function is evaluated. The cells that are evaluated are restricted to a fixed width strip at the left of the array whose length grows with n.


4.4 Weighted Parents

In Section 4.2 an O([n.sup.2]) algorithm was given for the class of link matrices for which the conditional probabilities are dependent only on the number of parents that are true. In this section we generalize this result, beginning with the class of link matrices under consideration.

The goal of this generalization is to allow for the possibility that the truth of each proposition, [P.sub.i], may have a different impact on the probability, p (Q is true). There will always be one parent whose impact is at least as great as any other. For the purposes of this exposition we will assume, without loss of generality, that the parent, [P.sub.1], has this property. The impact of each other parent may, then, be weighted by some factor, 0 [is less than or equal to] [w.sub.i] [is less than or equal to] 1.0. For generality, we will say that all parents are weighted and that the weight, [w.sub.1], of parent [P.sub.1] is equal to 1. We shall say that a link matrix satisfies the weighted-parent indifference criterion when the probability that the child is true is strictly a function of the number of parents that are true, once the weights of the parents have been taken into consideration. Formally, this is expressed as follows:

Definition 1. We shall say that the weighted-parent indifference criterion is met when


We shall refer to a link matrix satisfying the weighted-parent indifference criterion as a WPIC matrix. A WPIC matrix is completely determined when two sets of parameters have been given: [[Alpha].sub.0], ... , [[Alpha].sub.n] and [w.sub.1], ... , [w.sub.n]. When all the parents have a weight of [w.sub.i] = 1, the WPIC matrix reduces to a simple PIC matrix, where the coefficient, [[Alpha].sub.j], specifies the probability that Q is true when it is known that j parents are true.

For the general WPIC matrix, however, rather than specify what the probability that Q is true is, for j true parents, the coefficient, [[Alpha].sub.j], specifies what that probability would be for j true parents, if all the parents had the same impact, i.e., if all the weights were 1. A weight [w.sub.i] [is less than] i indicates, that, when [P.sub.i] is one of the true parents, the probability p (Q is true) is less than it would be were [P.sub.i] to have the same weight as [P.sub.1]. The weight, [w.sub.i], gives the factor by which p(Q is true) must be discounted when [P.sub.i], rather than a parent whose impact is equal to that of [P.sub.1], is one of the true parents.(5)

The basic algorithm requires only a minor modification in order that WPIC matrices be processed correctly. The WPIC-EVAL algorithm incorporating the necessary modification is given in Figure 12.

Fig. 12. PIC-EVAL algorithm.


* n weights, [w.sub.1], ... , [W.sub.n]

* n + 1 coefficients, [[Alpha].sub.0], ... , [[Alpha].sub.n], of a pic matrix

* probabilities [p.sub.1], ... , [p.sub.n], for the n parent nodes. output:

* the probability that the child node is true.



return [Alpha][n, 0]

The notation, lemma, and theorem introduced in Appendix A to demonstrate the correctness of the PIC-EVAL algorithm can generalized to contemplate weights. The reasoning used in the proof given there applies with very minor modifications to the situation pertaining to the WPIC-EVAL algorithm. A complete proof of the correctness of the weighted version of the PIC-EVAL algorithm can be found in Greiff [1996].

It should be observed here that, with respect to computational efficiency, the only difference between the two versions of the algorithm is that the weighted algorithm requires an extra multiplication during each execution of the main loop.


In order to test the potential of the PIC matrices as candidates for the implementation of Boolean query operators, a number of experiments were run with the INQUERY system. The purpose of these experiments was to gain experience with the use, and insight into the behavior, of the PIC class of matrices. Experiments were conducted to test the potential of the PIC matrices as candidates for the implementation of Boolean query operators. The goal was to find a model for these operators whose retrieval effectiveness would be superior to the model currently in use. Intuitions concerning the use of Boolean operators would guide the search for plausible PIC matrices to test as models.

For these experiments, two test collections were used: the INSPEC collection and volume I of the TIPSTER collection. For the TIPSTER collection, tests were run using two Boolean versions of queries 51 to 100. In this article, we shall refer to these two sets of queries as TIPSTER-1 and TIPSTER-2. Two Boolean query sets, which we will call INSPEC-1 and INSPEC-2, were also used for testing with the INSPEC collection. For each collection, the two query sets were created by having two different people take the same set of unstructured natural language queries and reformulate them using Boolean operators [Belkin et al. 1993]. For all experiments discussed in this article, the INQUERY system using the and and or operators as defined in Turtle [1990] shall be considered the baseline system for purposes of comparison. For these query sets, experiments have shown that pnorm operators perform from 7% to 28% better than the baseline system in terms of average precision. Our goal was to achieve similar performance using PIC matrices.

A number of different types of PIC matrices were tried using the standard INQUERY formula for estimating the probability that a document is about a concept:



tf = term frequency (number of occurrences of term in document)

dl = document length (number of tokens in document)

avg_dl = average document length (over collection)

dc = document count = number of documents in collection

df = document frequency = number of documents containing term

This formula has been adapted from the formula developed in Robertson and Walker [1994], and used in the Okapi retrieval system [Robertson et al. 1995]. In INQUERY, the formula estimates the probability that a document is about a given concept and incorporates a default belief. In Eq. (3), 0.4 is this default belief: the probability of a document being about the concept of interest that will be assigned when the associated term count is zero. Experimentation has indicated, that, for the purposes of the Boolean operators, 0.4 is not the optimal setting.

Intuitively, one might expect that a lack of evidence in favor of a subquery, in the form of a zero term count, should correspond to the same default belief value, independent of whether or not an operator is involved, and when an operator is specified, independent of which operator it is. Take the query

#OR(t1, #AND(t2, t3), #AND(t4, t5, t6))

for example, and let us assume that all term counts are zero. If the probability associated with t1 is [Beta], we could argue that the probabilities associated with #AND (t2, t3) and #AND (t4, t5, t6) should both be [Beta] also; and, further, we could argue that the belief that the entire query is satisfied should be [Beta], as well. That is, there is no evidence favoring the proposition that the document is about the concept associated with t1. The default belief reflects the possibility that the document is about t1 in spite of the absence of evidence supporting that claim. The default belief for #AND(t2, t3) similarly reflects the possibility that the document is about this concept even though there is no supporting evidence, similarly for #AND(t4, t5, t6), as well as for the full #OR query. In general, we might posit the following principle:

[Inverted]An: [Beta] = default belief [conjunction] p([P.sub.1]) = p([P.sub.2]) = ... = P([P.sub.n]) = [Beta]

?? p(#AND([P.sub.1], ... , [P.sub.n])) = p(#OR([P.sub.1], ... , [P.sub.n])) = [Beta]

Although the applicability of this principle to the interpretation of Boolean queries is not beyond question, it seemed natural to us to explore belief systems that obey it.

The PIC class is sufficiently rich so that, for any preset value of the default belief, matrix families can be engineered such that the above principle is satisfied. With the default belief set to 0.0 robust behavior of the Boolean operators was obtained, at a level of performance comparable to that of the pnorm operators. We return to the point of the default belief value setting in Section 6.

The experiments described here concentrated on a family of operators, initially described in Turtle [1990], that varied in a simple way from the and and or operators used in the current INQUERY system. The standard and operator can be viewed as a linear sequence of n - i points, [[Alpha].sub.0], [[Alpha].sub.1], ... , [[Alpha].sub.n-1], together with the coefficient [[Alpha].sub.n] = 1.0. We can think of the sequence of n - i coefficients as rising with a slope of 0.0. In these experiments we consider PIC matrices for the and operator that are of the same form, but for which the slope may rise with a slope of [[Gamma].sub.and] [is greater than] 0. Presumably, the rise will be gradual, with [[Gamma].sub.and] set to a fairly small fraction, as shown in Figure 13(a) where [[Gamma].sub.and] = 0.33. To simplify the experimentation, the slope was maintained constant over the entire family of and operators. That is, for a given experiment, the coefficients for each different arity of the and operator were determined based on the same setting of [[Gamma].sub.and]


The PIC matrix candidates for the or operators were defined in an analogous manner as is shown in Figure 13(b). As with the and operator, [[Alpha].sub.0] and [[Alpha].sub.n] are fixed at 0.0 and 1.0, respectively. In the case of the or operator, [[Gamma].sub.or], corresponds to the rate of decrease in the values of the coefficients as the number of true parents decreases in the sequence [[Alpha].sub.n], [[Alpha].sub.n-1], ... , [[Alpha].sub.2].

We shall see, that, in the case of the [[Gamma].sub.and] parameter, we experimented with values greater than 1.0. As shown in Figure 13(c), for these values the matrix is defined such that the PIC coefficients rise with the slope, [[Gamma].sub.and], up to a maximum of 1.0. Values which would otherwise be greater than 1.0 are truncated to 1.0.

Experiments were run first with the INSPEC collection. Table I shows the percent improvement in 11-point average precision with respect to the baseline system, for different settings of the [[Gamma].sub.and] and [[Gamma].sub.or] parameters. For all values of [[Gamma].sub.and] tested (including values not shown in Table I), optimal performance was achieved with [[Gamma].sub.or] set to 0.6. For all values of [[Gamma].sub.and], performance improves monotonically with increasing values of [[Gamma].sub.and], not reaching a maximum value until [[Gamma].sub.and] is at 1.0. This is surprising because and operators with greater values of [[Gamma].sub.and] can be interpreted as operators that increasingly ignore the semantics usually associated with and, namely, that and is a simple conjunction. When [[Gamma].sub.and] = 1, the and operator behaves exactly like a sum operator. When [[Gamma].sub.and] [is greater than] 1 the operator takes on a distinctly or-like flavor; the presence of a relatively small number of conjoined terms leads to relatively large belief scores.

Table I. INSPEC-1 queries on INSPEC: PIC with [[Gamma].sub.and], [[Gamma].sub.or] [is less than or equal to] 1.0. Boldface denotes greatest improvement.
                            [[Gamma].sub.or] Value

[Gamma] and      0.2      0.4      0.6        0.8       1.0

0.2            +14.3    +15.3    +16.6      +16.0     +14.5
0.4            +16.9    +17.8    +18.7      +17.5     +14.4
0.6            +17.6    +18.2    +19.0      +17.6     +14.9
0.8            +18.7    +18.9    +19.4      +17.6     +15.4
1.0            +20.0    +20.3    +20.4      +18.4     +15.5

The consistent peak of performance at [[Gamma].sub.and] = 1 suggested tests with [[Gamma].sub.and] [is greater than] 1. For all values of [[Gamma].sub.or], performance continues to improve as [[Gamma].sub.and] increases until [[Gamma].sub.and] reaches 2.0 (Table II). This suggests that users do not necessarily interpret and as a logical operator. The use of the and connector appears to indicate that the likelihood that a document will be judged relevant grows quickly with the number of terms present in the document. This is in stark contrast with the normal propositional logic interpretation in which high likelihoods are assigned only to documents containing all of the terms.

Table II. INSPEC-1 Queries on INSPEC: PIC with [[Gamma].sub.and] [is greater than or equal to] 1.0. Boldface denotes greatest improvement.
                                [[Gamma].sub.or] Value

[[Gamma].sub.and]      0.2      0.4      0.6      0.8      1.0

1.0                  +20.0    +20.3    +20.4    +18.4    +15.5
2.0                  +23.9    +26.1    +24.9    +22.3    +17.6
3.0                  +25.6    +26.1    +25.3    +22.4    +17.1
4.0                  +25.1    +25.9    +24.4    +21.1    +15.3
5.0                  +25.1    +25.9    +24.5    +21.2    +15.4
6.0                  +25.1    +25.9   +f24.5    +21.1    +15.4
7.0                  +25.1    +25.9    +24.5    +21.1    +15.4

Table III shows the performance of the pnorm operator for various values of [p.sub.and] and [p.sub.or], the settings of the parameter, p, for and and or respectively. We see that, as with the PIC matrix operators, the pnorm operators perform robustly over a wide range of parameter settings. Performance of the PIC matrix operator is clearly superior to that of the pnorm operators for this set of queries against the INSPEC collection. Peak performance improvement for the PIC matrix version of the operators reaches 26.1% as compared to 18.2% for the pnorm version. Tables IV and V show the comparative performance for INSPEC-2, an alternative Boolean formulation of the same queries, against the same collection. Performance is similar to that of the previous query set with the optimal performance of the PIC operators slightly lower at 22.8% improvement as compared to 18.0% for pnorm.

Table III. INSPEC-1 Queries on INSPEC: pnorm. Boldface denotes greatest improvement. Pot Value
                                [p.sub.or] Value

[p.sub.and]     1.0     2.0     3.0     4.0     5.0     6.0     7.0

1.0           +15.5   +16.1   +16.0   +15.7   +15.6   +16.0   +16.4
2.0           +16.4   +17.0   +17.0   +17.1   +17.7   +17.1   +16.1
3.0           +16.6   +17.0   +17.2   +17.8   +17.9   +17.1   +16.2
4.0           +16.7   +18.2   +18.0   +17.7   +17.5   +16.8   +16.7
5.0           +17.6   +17.8   +17.2   +16.6   +16.4   +15.8   +15.5
6.0           +16.4   +16.5   +15.7   +15.5   +15.5   +15.0   +14.3

Table IV. INSPEC-2 Queries on INSPEC: PIC. Boldface denotes greatest improvement.
                          [[Gamma].sub.or] Value

[p.sub.and]     0.0     0.2     0.4     0.6      0.8      1.0

1.0           +18.6   +19.5   +20.0   +20.1    +18.9    +14.3
2.0           +21.4   +22.0   +22.4   +22.8    +20.7    +14.1
3.0           +21.5   +22.3   +22.6   +22.7    +20.1    +13.3
4.0           +21.7   +22.5   +22.8   +22.8    +20.1    +12.5
5.0           +21.6   +22.5   +22.8   +22.8    +20.1    +12.5
6.0           +21.6   +22.5   +22.8   +22.8    +20.1    +12.5

Table V. INSPEC-2 Queries on INSPEC: pnorm. Boldface denotes greatest improvement.
                                  [P.sub.or] Value

[P.sub.and]     1.0     2.0     3.0     4.0     5.0     6.0     7.0

1.0           +14.3   +15.7    +163   +16.8   +17.1   +17.5   +16.9
2.0           +15.2   +16.4   +17.5   +17.6   +17.9   +17.8   +16.8
3.0           +15.5   +16.3   +16.9   +17.2   +17.9   +17.7   +16.9
4.0           +15.9   +16.9   +1.72   +17.5   +18.0   +18.0   +17.1
5.0           +15.2   +16.4   +1.67   +16.9   +17.4   +17.5   +16.7
6.0           +14.9   +15.4   +1.58   +16.5   +16.7   +16.8   +16.1

A similar set of experiments were run on the larger, TIPSTER collection. Table VI summarizes the performance of the two approaches. We see here that both classes of operators are capable of significantly outperforming the baseline system on both query sets. On the first query set, the best parameter setting for the pnorm operators slightly outperforms the best settings for the PIC operators. On the second query set, it is the best settings of the PIC operators that yield superior performance. The smaller improvements for both systems on the second query set is due to the better performance of the baseline system. It is unclear why the baseline system is apparently so much more sensitive to the differences in the two query formulations than the pnorm and PIC versions of the system. Surprisingly, in light of the previous results, this best performance for the PIC system on the second query set is obtained with the [[Gamma].sub.and] coefficient set as low as 0.1. To date, we have been unable to explain this phenomenon.

Table VI. Comparison of PIC and pnorm on All Four Query Sets--Optimal Parameter Settings
                                   Percent                Percent
            Baseline     pnorm   Improvement     PIC    Improvement

INSPEC-1      26.5      4.0/2.0      18.2      2.0/0.4      26.1
INSPEC-2      25.6      4.0/5.0      18.0      4.0/0.4      22.8
TIPSTER-1     20.8      9.0/1.0      28.8      2.0/0.8      27.8
TIPSTER-2     25.0     10.0/1.0       7.8      0.1/0.6       9.8

Table VII compares the two operators with pnorm settings of [p.sub.and] = 6.0, [p.sub.or] = 3.0 and PIC settings of [[Gamma].sub.and] = 2.0, [[Gamma].sub.or] = 0.6. These parameter settings were chosen so as to give a best overall performance profile for each of the operator classes. In general, the optimal settings for the pnorm operator in these experiments tend to be somewhat higher than those reported in Salton et al. [1983b]. For comparable experiments they found that values between 1.0 and 2.0 produced best results.

Table VII. [p.sub.and]/[p.sub.or] = 6.0/3.0; and [[Gamma].sub.and]/[[Gamma].sub.or] = 2.0/0.6
                                  Percent              Percent
             Baseline   pnorm   Improvement   PIC    Improvement

INSPEC-1       26.5      30.7       15.7      33.1      24.9
INSPEC-2       25.6      29.6       15.8      31.4      22.8
TIPSTER-1      20.8      26.5       27.3      26.4      26.9
TIPSTER-2      25.0      26.7        6.8      26.4       5.5


From the work reported here, we may conclude that combining functions with a well-defined probabilistic interpretation can be associated with Boolean query operators which

--appear to perform as well as the pnorm operators and

--can be realized at reasonable computational cost.

6.1 Probabilistic interpretation

The definition of the pnorm functions is an outgrowth of intuitions grounded in the vector space model of information retrieval [Salton 1989]. The PIC matrices, on the other hand, are the result of viewing the scoring of documents from a probabilistic standpoint. The ith coefficient of an n-ary PIC matrix corresponds to the conditional probability that the compound query is satisfied by an arbitrary document, given that exactly i of the n component subqueries are satisfied by that document.

The existence of such an interpretation for the combining functions means that these functions may be analyzed within the probabilistic framework. Their comparative behavior on varying collections and/or varying query sets can be studied from a probabilistic vantage point. Attempts to improve overall performance of the operators can be guided by researchers' intuitions as to the probabilities involved. Coherent techniques for combining Boolean operators with others, such as proximity or phrase operators, can be developed in accord with the laws of probability theory.

The incorporation of these Boolean operators, in the context of an encompassing probabilistic IR system, allows for the meaningful analysis of the overall system in probabilistic terms. If the ranking scores produced by a system, as well as the intermediate scores manipulated by that system in arriving at these final values, can justifiably be interpreted as probabilities, measurements can be made of how well these numbers correspond to observed frequencies [Brier 1950; Dawid 1989]. Analysis of these measurements can result in insights into the strengths and weaknesses of the system, unwarranted (possibly, hidden) assumptions, suspect parameter settings, and possibilities for improvement. Such insights may not be easily forthcoming in the absence of semantics for the numbers involved and a theoretical framework within which the manipulation of these numbers can be understood.

6.2 Retrieval Performance

From the four different query set formulations studied, it appears that combining functions based on PIC matrices perform as well as the pnorm functions. For the researcher inclined toward the probabilistic approach, this is comforting. The definition of the pnorm operators is an excellent example of how a mathematical model, in this case the vector space model, can guide the researcher toward the development of fruitful ideas. We have shown here that, at least as far as the current state of the art with respect to Boolean operators is concerned, a probabilistic theory of information retrieval can be equally beneficial in this regard. It is our belief, and apparently that of other "probabilists" (for example, see Cooper [1994] for an interesting exposition of the issues), that in the long run, models founded in probability theory will prove to be more fruitful in this regard. We believe that probability theory will be found to be well suited to the task of modeling the complexities of the information needs of users, the information content of documents, and the matching of one to the other. We think it will ultimately be found to be more useful than a geometric model where the mapping between the formal constructs and the aspects of information retrieval that are being modeled is, to our way of thinking at least, less obvious.

The results obtained here also suggest that some previous experimental results might merit a second look. For example, Turtle [1994] reports that natural language queries consistently outperform Boolean queries for experiments run on legal collections. The difference observed, however, is within the range of the improvements we have been able to obtain, suggesting that more effective implementations of the Boolean query operators might close the apparent gap.

There is some evidence that Boolean queries are not often used in retrieval situations, particularly retrieval realized over the World Wide Web [Jansen et al. 1998]. Nonetheless, there is still a clear need for continued research into the use of Boolean queries, particularly Boolean queries models that are not strict in their interpretation of the operators and can be utilized in the context of ranked retrieval.

6.3 Efficiency

Probability calculation involving PIC matrices can be realized in O([n.sup.2]) time, where n is the number of input probabilitites. The CPU time for the PIC matrix version of INQUERY was compared against the baseline system for each of the query sets discussed in this article. An increase of from 35% to 65% in CPU time was observed. This would seem to be a reasonable price to pay for the corresponding increases in retrieval accuracy, especially, since the percentage increase in overall response time can be expected to be significantly smaller when I/O time is factored in. Exactly what that overall increase would be will depend, of course, on the characteristics of the particular hardware used to run the system.(6)

6.4 Future Work

The experimentation reported here indicates that PIC matrices can be advantageously applied to the modeling of Boolean queries, and we are considering extending this work as follows:

6.4.1 Alternative Subclasses of the PIC Matrices. The PIC matrices cover a wide range of functions which can be chosen for modeling belief-combination operators. The definition of an operator family requires the specification of n + i parameters for each arity, n = 2,3,.... We have had success with a subclass of the PIC matrices that can be specified with two parameters: the increasing slope of the and operator and the decreasing slope of the or operator. This class may not be optimal. For one thing, the definition of the slope parameter constrains the rate of rise from 0 to n - 1 for the different members of the and family to be related in a given way. Specifically, the increase in belief is [[Gamma].sub.and]/n for each additional true parent. Equivalently, the belief at the knee (n - i true parents) is (n - 1) [[Gamma].sub.and]/n, similarly for the or family. Other relationships among the members of the family could be explored. For example, the position of the probability at the knee could be held constant over the entire family, its value being an empirically determined parameter setting. Alternatively, some simple, but nonlinear, function for the increase in probability from 0 to n - 1 for and (and decrease for or) might give better results. Another possibility worth exploring would be to allow for the setting of and/2 and or/2 to be independent of the settings chosen for arities of n = 3, 4....

6.4.2 Experimenting with Default Probabilities. We have obtained our best results by eliminating the default setting of 0.4 in favor of a 0.0 setting. It is possible that default settings at some value somewhat above 0.0 will yield small, but consistent, performance gains. When allowing for a nonzero default setting, [Beta], it would probably be best to restrict the operator families such that each operator produces an output of [Beta] when all inputs are at [Beta]. This would impose an algebraic constraint that could be satisfied in a number of ways. There is as little justification for assuming certainty of relevance (p(Q) = 1.0) as there is for assuming certainty of nonrelevance (p(Q) = 0.0) even in the (hypothetical) case of certainty with respect to the truth or falsity of the parent propositions.

6.4.3 Correlating Beliefs with Long-Term Frequencies. If the design of the Boolean operators follows probabilistic intuitions, it is reasonable to expect that their success in practice will be correlated with the extent to which the values being manipulated correspond to probabilities in the real world. It should be instructive to analyze the behavior of the operators after performing some kind of regression to convert the document scores into scores more accurately reflecting observed frequencies of relevance for large text collections.


The authors would like to express their gratitude to Leah Larkey whose original experimentation with pnorm operators in the context of the INQUERY inference network was the starting point for the experimentation reported here.

(1) Since each column must sum to 1.0, only the row corresponding to Q = true will be shown from this point on.

(2) #or(...) and #and(...) are the syntactic forms for the specification of Boolean and and or operators in the INQUERY system.

(3) It should be said that the pnorm formula is more general than the one shown in that it allows for the weighting of query terms as well as document terms. As this aspect of the formula is not relevant to the work discussed here, we focus our attention on a somewhat simplified version of the formula.

(4) Prof. David Barrington (personal communication) has pointed out that computational improvement may be realize even if this condition is not fully met. For example, if one piece grows with n, while all others grow only as log(n), the asymptotic running time will be only O(n log(n)).

(5) Another way of understanding the WPIC matrices is in terms of gated inputs. Each parent is viewed as one input to an and gate whose other input corresponds to an independent activation variable. Only when both the parent and the activation variable are both true is a value of true seen by the Q node. The parent weight becomes the probability that the activation variable is true. From this, we see that the class of WPIC matrices can be viewed as a generalization of the noisy or matrices often utilized in Bayesian Network applications.

(6) As mentioned in Section 4.3, a linear time version of the PIC-EVAL algorithm is applicable when the PIC matrix is a piecewise linear function. Although this more efficient version of the algorithm would be applicable to the PIC matrices utilized for these particular experiments, the general PIC matrix algorithm was used. The times reported correspond to the general version of the algorithm.


BELKIN, N. J., COOL, C., CROFT, W. B., AND CALLAN, J. P. 1993. The effect multiple query representations on information retrieval system performance. In Proceedings of the 16th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '93, Pittsburgh, PA, June 27-July), R. Korfhage, E. Rasmussen, and P. Willett, Eds. ACM Press, New York, NY, 339-346.

BELKIN, N. J., KANTOR, P., Fox, E. A., AND SHAW, J.A. 1995. Combining the evidence of multiple query representations for information retrieval. Inf. Process. Manage. 31, 3 (May-June), 431-448.

BOOKSTEIN, A. 1980. Fuzzy requests: An approach to weighted Boolean searches. J. Am. Soc. Inf. Sci. 31, 4 (July), 240-247.

BOOKSTEIN, A. 1981. A comparison of two systems of weighted Boolean queries. J. Am. Soc. Inf. Sci. 32, 4 (July), 275-279.

BRIER, G. W. 1950. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78, 1 (Jan.), 1-3.

BUELL, D. A. AND KRAFT, D. H. 1981. Threshold values and Boolean retrieval system. Inf. Process. Manage. 17, 3, 127-136.

BALLAN, J. P., CROFT, W. B., AND HARDING, S. U. 1992. The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications 78-83.

CHARNIAK, E. 1991. Bayesian networks without tears. AI Mag. 12, 4 (Apr.), 50-63.

COOPER, W. S. 1994. The formalism of probability theory in IR: A foundation or an encumbrance?. In Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '94, Dublin, Ireland, July 3-6), W. B. Croft and C. J. van Rijsbergen, Eds. Springer-Verlag, New York, NY, 242-247.

DAWID, A. P. 1989. Probability forecasting. In Encyclopedia of Statistical Sciences John Wiley & Sons, Inc., New York, NY, 210-218.

FOX, E., BETRASET, S., KOUSHIK, M., ANO LEE, W. 1992. Extended Boolean models. In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Prentice-Hall, Inc., Upper Saddle River, NJ, 393-418.

FOX, E. A. AND SHAW, Z.A. 1994. Combination of multiple searches. In Proceedings of the 2nd Text Retrieval Conference (TREC-2), D. K. Harman, Ed. 243-248.

GREIFF, W. R. 1996. Computationally tractable, conceptually plausible classes of link matrices for the INQUERY inference network. Tech. Rep. TR-96-66. Department of Computer Science, University of Massachusetts, Amherst, MA.

JANSEN, B. J., SPINK, A., BATEMAN, J., AND SARACEVIC, T. 1998. Real life information retrieval: a study of user queries on the Web. SIGIR Forum 32, 1, 5-17.

KIM, J. H. AND PEARL, J. 1983. A computational model for combined causal and diagnostic reasoning in inference systems. In Proceedings of the 8th International Joint Conference on Artificial Intelligence (Karlsruhe, Germany) 190-193.

LEE, J. H. 1995. Analyzing the effectivness of extended Boolean models in information retrieval. Tech. Rep. TR95-1501. Cornell University, Ithaca, NY.

NOREAULT, T., KOLL, M., AND MCGILL, M. J. 1977. Automatic ranked output from Boolean searches in SIRE. J. Am. Soc. Inf. Sci. 28, 6, 333-339.

ORTEGA, J. M. 1972. Numerical Analysis: A Second Course. Academic Press, Inc., New York, NY.

PEARL, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufmann series in representation and reasoning. Morgan Kaufmann Publishers Inc., San Francisco, CA.

ROBERTSON, S. r. AND WALKER, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '94, Dublin, Ireland, July 3-6), W. B. Croft and C. J. van Rijsbergen, Eds. Springer-Verlag, New York, NY, 232-241.

ROBERTSON, S. E., WALKER, S., AND HANCOCK-BEAULIEU, U. M. 1995. Large test collection experiments on an operational, interactive system: Okapi at TREC. Inf. Process. Manage. 31, 3 (May-June), 345-360.

SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Series in Computer Science. Addison-Wesley Longman Publ. Co., Inc., Reading, MA.

SALTON, G., BUCKLEY, C., AND FOX, E.A. 1983a. Automatic query formulations in information retrieval. J. Am. Soc. Inf. Sci. 34, 4 (July), 262-280.

SALTON, G., Fox, E. A., AND WU, H. 1983b. Extended Boolean information retrieval. Commun. ACM 26, 11 (Nov.), 1022-1036.

SHAW, J. A. AND FOX, E. A. 1995. Combination of multiple searches. In Proceedings of the 3rd Text Retrieval Conference (TREC-3) 105-108.

TURTLE, H. 1994. Natural language vs. Boolean query evaluation: a comparison of retrieval performance. In Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '94, Dublin, Ireland, July 3-6), W. B. Croft and C. J. van Rijsbergen, Eds. Springer-Verlag, New York, NY, 212-220.

TURTLE, H. AND CROFT, W. B. 1990. Inference networks for document retrieval. In Proceedings of the 13th International Conference on Research and Development in Information Retrieval (SIGIR '90, Brussels, Belgium, Sept. 5-7), J.-L. Vidick, Ed. ACM Press, New York, NY, 1-24.

TURTLE, H. R. 1991. Inference networks for document retrieval. Ph.D. Dissertation. Department of Computer Science, University of Massachusetts, Amherst, MA. WALLER, W. G. AND KRAFT, D. H. 1979. A mathematical model for a weighted Boolean retrieval system. Inf. Process. Manage. 15, 5, 235-245.

Received: April 1998; revised: August 1998 and February 1999; accepted: March 1999



The correctness of the PIC algorithm follows directly from Lemma 2, which expresses the output of the PIC-EVAL algorithm in terms of the coefficient produced during the ith iteration. Presentation and proof of the lemma will be facilitated by the introduction of notation for the abbreviation of complex expressions that will continually arise. Specifically, notation is introduced for the expression giving the probability that a specified subset of the parents is true, and another for the probability that a specified number of parents is true.

NOTATION 1. With respect to the execution of the PIC matrix evaluation algorithm, [L.sub.i] shall refer to the link matrix generated during the ith iteration, and the coefficient of [L.sub.i] associated with j parents being true shall be referred to as [[Alpha].sub.i,j]. We note that for the initial matrix [L.sub.0], we have

[[Alpha].sub.0,j] = [[Alpha].sub.j] [Inverted A]j = 0,..., n.

NOTATION 2. Given an arbitrary subset, R, of {[i.sub.1], ... , [i.sub.2]}, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] shall denote the probability that precisely the parents, [P.sub.i] such that i [element of] R, are true, while those parents, [P.sub.i] such that i [element of] R, are false. Equivalently, we have


NOTATION 3. The notation [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] will be used to denote the probability that exactly j of the propositions of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] are true. We observe that


(When [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], is to be interpreted as the probability that exactly j of the parents in the empty set of parents, {}, are true, i.e., [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] 1 for [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for j [is greater than or equal to] 1.)

It is worth noting that the probability that Q is true can be given by


(recalling that [[Alpha].sub.R] has been defined to be the probability that Q is true given that the parents corresponding to elements of R, and only those parents, are true). Also, when the parent indifference criterion is met, we have


Proof of the correctness of the PIC-EVAL algorithm will be a direct consequence of the following lemma. The lemma effectively states that the set of coefficients, [[Alpha].sub.i, 0], ... , [[Alpha].sub.i, n-i], can be interpreted as specifying a PIC matrix, [L.sub.i], connecting parent nodes [P.sub.i+1], ... , [P.sub.n], to Q, and that this matrix is, in a sense, equivalent to the original matrix, [L.sub.0].

LEMMA 2. Assuming all coefficients [[Alpha].sub.i,j] are those produced by the PIC-EVAL algorithm, then [inverted]Ai = 0,1, ... , n,


PROOF. This lemma is proved by induction on the value of i. For i = 0, we have


Assume the theorem to be true for i = k:


Then the sum for i = k + 1 is


(by step (5) of the algorithm)


(as a result of changing the variable for the second summation)


The term [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] expresses the probability that none of the propositions [P.sub.k+2], ... , [ P.sub.n] is true. It is composed of only one product. Hence, the first term of (4) reduces to


Similarly, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], which expresses the probability that all n - k - 1 Of the propositions {[P.sub.+2], ... , [P.sub.n]} are true, is composed of only one product. Therefore, the last term of (4) reduces to


Combining the middle two terms of (4) gives


Finally, applying Eqs. (5), (6), and (7) to Eq. (4), we have


= p(Q is true) (by the induction hypothesis).

Setting i = n in Lemma 2 immediately yields the following theorem which states that the PIC-EVAL algorithm is correct.

THEOREM 1. Given that [[Alpha].sub.n, 0] has been produced by the PIC-EVAL algorithm

[[Alpha].sub.n, 0] = p(Q is true).


The following lemma shows, that, when a subsequence of PIC coefficients form an arithmetic progression, there exists a simplified expression for the value of key elements of the array computed as part of the PIC-EVAL algorithm. In other words, if when the coefficients are viewed as a function of the number of parents, a segment of the function is linear, a computational shortcut can be effected as part of the PIC-EVAL algorithm.

LEMMA 1. Given a PIC matrix whose coefficients,

[[Alpha].sub.m], [[Alpha].sub.m+1], [[Alpha].sub.m+2], ... , [[Alpha].sub.m+s]

are of the form

[[Alpha].sub.m], [[Alpha].sub.m] [Delta], [[Alpha].sub.m] + 2[Delta], ... , [[Alpha].sub.m ] + s[Delta],

i.e., are such that

[[Alpha].sub.m+j] = [[Alpha].sub.m] + j[Delta] [inverted]Aj = 0, ... , s,

then, [inverted]Ai = 0, ... , s, [inverted]Aj = 0, ... , s - i:


PROOF (BY INDUCTION ON i). The base case follows directly from the hypothesis of the lemma.


Assuming the lemma to be true for i = k - 1, we have


This material is based on work supported in part by the National Science Foundation, Library of Congress, and Department of Commerce under cooperative agreement number EEC9209623. This material is also based on work supported in part by United States Patent and Trademark Office and Defense Advanced Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contract number F19628-95-C-0235. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsors.

Authors' addresses: W. R. Greiff and W. B. Croft, Computer Science Department, University of Massachusetts, LGRC A243 Box 34610, Amherst, MA 01003-0729; email:;; H. Turtle, West Publishing Co., Eagan, MN; email:

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.
COPYRIGHT 1999 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1999 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Publication:ACM Transactions on Information Systems
Article Type:Statistical Data Included
Geographic Code:1USA
Date:Oct 1, 1999
Previous Article:Integrating Geometrical and Linguistic Analysis for Email Signature Block Parsing.
Next Article:Efficient Passage Ranking for Document Databases.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters