Printer Friendly

Probabilistic and genetic algorithms for document retrieval.

PROBABILISTIC AND GENETIC ALGORITHMS FOR DOCUMENT RETRIEVAL Document retrieval involves searching bibliographic databases for references to relevant documents. Usually, the record stored in a bibliographic database to describe a document consists of a set of subject terms or keywords (sometimes with associated weights). In this research, documents receive multiple descriptions in an attempt to resolve problems arising from different inquirers seeking the same document in dissimilar ways. A probabilistic algorithm is also used to adjust these descriptions and provide a better means of getting documents to just those inquirers who will find them useful. The algorithm is free of assumptions of subject term independence that weaken most probabilistic models.

First, the problem of representing documents should be considered. Particular attention will be paid to probabilistic models, primarily to see how the limitations in implementing these theoretically-based models can inspire other forms of document retrieval. Document redescription will then be considered, and a probabilistic approach to redescription that relies on multiple descriptions of documents will be discussed. Finally, we will see the effectiveness of this approach.


The number of documents that can be stored in computer-accessible fashion is quickly growing due to convenient, inexpensive word-processing and subscription document retrieval services that house large numbers of document references. As a result, both demand and need for effective document retrieval systems are on the rise.

Document retrieval is primarily a problem of representation: representing documents by storing some form of description of them in a database; and representing an inquirer's information needs with queries that can be processed by computer. If forming adequate representations of both documents and users' needs were completely understood, there would be no need for further research in this field. Inquirers would easily find useful documents. Instead, it is actually quite difficult to adequately represent a document and adequately express an information need. Zunde and Dexter have documented the inconsistency among trained experts in attempts to describe identical documents. Blair points out that, if trained experts disagree so considerably in representing documents, one cannot hope for inquirers to be any more consistent in trying to retrieve them. Also, as many people are aware, the results of performing a computerized literature search can often be disappointing. There was recently a report on less than satisfactory full-text retrieval effectiveness.

The difficulty of representing documents adequately has produced various theories on how documents might be best described. For instance, by accounting for the discrimination value of a term, it is argued that specific terms will make it either easier or more difficult to distinguish relevant from non-relevant documents. Models that formally incorporate probability have also been described. Cooper and Maron give a utility theoretic argument for deciding whether a term be used to describe a document: use the term if and only if retrieval satisfaction will be better with the term supplied than without. Harter claims the number of times a term occurs in a document is a theoretically justified indicator of the term's suitability in describing a document. Bookstein and Swanson explore the use of Harter's model as a decision theoretic tool for indexing documents.

Indeed, models based on probability and decision theory are appealing because of their theoretical nature. A discussion of two variants of probabilistic models will point out certain difficulties in their implementation that, if overcome, might lead to enhanced retrieval performance.

In the Maron and Kuhns probabilistic model, each point in the sample space indicates: the query an inquirer would use on a particular occasion, the name of a document that might potentially be furnished for the query, and the relevance evaluation the inquirer would give concerning that document. Thus, the events of primary importance in the model are of the form: P(document, is relevant / particular query Y), Y being a non-empty set of query terms, [y.sub.1., y.sub.z., . . . . , y.sub.n] an inquirer might use to look for documents. In other words, what is suggested is that, at different times, different relevance assessments of the same document will be made for identically constructed queries; and what is to be predicted is the probability that any given document will be relevant to such a query in the future. Further, the model prescribes a method for indexing documents.

Whereas the model just described relies on past inquiring behavior to estimate probabilities about document relevance, the second probabilistic model relies on the keywords (i.e., subject terms) used to describe a document to estimate the same probabilities. In this model, sample points indicate: the subject terms with which a document is described and the relevance judgment for that document with respect to an implicit, undescribed query. The events of importance have this form: P(document.sub.i is relevant / document.sub.i's description is X), where X is a set of binary keywords. We refer to this model as the robertson and Sparck Jones model for their work in deriving query weights, noting historical antecedents to this approach in Bookstein and Swanson. Van Rijsbergen more recently articulated the Robertson and Sparck Jones model further. Croft has extended the model to the case where keywords are assigned probabilistically to documents rather than deterministically.

Under both the Maron and Kuhns and the Robertson and Sparck Jones models, Bayes' theorem is invoked to calculate probabilities. For instance, the latter ultimately ranks documents according to P(X / Rel)/P(X / |Rel) or, identically, by the ranking P(X / Rel)/P(X). (X, Rel, and |Rel indicate the events that a document is described by the set of subject terms, X, a document is relevant; and a document is non-relevant, respectively.)

Whereas the attractiveness of probabilistic models comes from their theoretical grounding, their implementation presents certain difficulties. First, in computing the probability that a document is described with the set of subject terms, X, the assumption is usually made that these terms are distributed independently within both the set of relevant documents and the set of non-relevant documents. (An independence assumption is also used in the Maron and Kuhns model.) The assumption buys mathematical tractability. As Van Rijsbergen and others have pointed out, however, there is a discrepancy between the assumption and the fact that it is precisely subject term combinations (statistical dependencies among subject terms) that indicate the content of documents. Therefore, even though experiments report that probabilistic models based on this independence assumption can improve retrieval performance, efforts have been made to devise simple dependency models which are less restricted by the independence assumption. Such models require costly, computationally inefficient calculations, however, and also neglect dependence among more than two terms.

A second problem that arises in probabilistic models is estimation. The Maron and Kuhns model relies on human estimations of the probability of using given search terms in retrieving and finding a document relevant. Thus, it is hampered by the same problems in its implementation as manual retrieval models: inconsistent and possibly unreliable human judgments.

The Robertson and Sparck Jones model, on the other hand, makes empirically based estimates of the distribution of terms within the relevant and non-relevant sets of documents. Given a small number of retrieved documents (usually ten or twenty) and an inquirer's relevance assessment of them, frequency data are used to calculate both p.sub.i's (the probability that term.sub.i is used in indexing a relevant document) and q.sub.i's (same probability for a non-relevant document). The model relies on this same pattern of subject term distributions existing in the collection as a whole so that relevant and non-relevant documents may be distinguished.

Such estimations of subject term distributions may be misleading, however. First, what should be sought is a population parameter for p.sub.i's and q.sub.i's. If documents are well described by subject terms, documents likely to be found useful together will be similarly described and retrieved together by inquirers searching for information. Individual differences and inconsistencies among inquirers suggest it will never be possible to divide a document collection (here is one set of documents that will be useful together; here is another; etc.) in the same way for all inquirers. At best subject terms can establish sets of similarly described documents which, collectively, are likely to be useful to inquirers as a whole. Therefore, if there is a pattern described by subject terms, it is more likely to serve to demarcate relevant and non-relevant documents for the population of inquirers in general than for any individual making a query. Yet, in implementation, the Robertson and Sparck Jones model demarcates relevant and non-relevant documents based on the assessment of a single inquirer. To the extent that this separation is an artifact as opposed to some existing pattern in the document database at large, it will not lead to useful prediction for retrieval. In addition, the point estimates we obtain for parameters such as p.sub.i and q.sub.i may be quite unreliable owing to the small sample sizes on which they are based. In document retrieval, as Van Rijsbergen points out, "it should now be apparent that the problem of estimating binomial parameters from small samples has no obvious best solution."

A probabilistic model unifying the Maron and Kuhns model with the Robertson and Sparck Jones model has been proposed. The unified model suggests that the distribution of query terms together with document subject terms should provide evidence for retrieval. In this model, X is a set of binary random variables, x.sub.i., each of which indicates, for arbitrary document.sub.k., either that it is described by term.sub.i (i.e., x.sub.i (document.sub.k) = 1)) or is not (x.sub.i (document.sub.k) = 0). Y is a set of binary random variables, y.sub.i., which indicate which terms have been employed in a particular query. Probabilities of the form P(Rel / X, Y) are to be calculated in making a retrieval decision about a document. In practice, by Bayes' rule, this requires the knowledge of the joint distribution of X union Y.

In a life-sized document retrieval system, document descriptions and queries can each be composed from vocabularies that contain thousands of terms. Even in principle, such a joint distribution of subject and query terms is only available at exorbitant cost after analyzing all possible queries in conjunction with any possible document description. Similarly, a Bahadur-Lazarfeld expansion of a joint distribution of 1,000 terms exhibiting third-order dependencies requires the estimation of over 166,000,000

parameters. Thus, any feasible implementation of this model would again rely on less than realistic independence assumptions.

To summarize, if a (probabilistic) retrieval model is to provide satisfactory document retrieval the following three conditions should be present:

(1) the model should not rely on independence assumptions;

(2) feedback data should not be based on single inquiries (or a small set of inquiries);

(3) the computational cost of the model should be acceptable.


Since describing documents well is so difficult, one way to improve document description is to perform the description process repeatedly. Each repetition attempts to improve the description attached to a document. In essence, document redescription is an attempt to determine from past inquiries how a document should have been described so that its description can be modified and made more satisfactory for future inquirers. Such an approach is based on the assumption that there will be identifiable regularities in a large enough set of inquirers' requests for a given document.

Other attempts to modify a document's description have been reported. Brauen adjusts document term weights to improve retrieval effectiveness. Though successful, adaptation is worse when based on feedback from both successful and unsuccessful searches rather than only the former, and retrieval is more effective for control queries than for the test queries toward which a document redescription is directed. More recently, Furnas has described an adaptive indexing system that learns alternate terms to use in identifying various sources of information (i.e., "documents"). In essence, once a sought document is identified, the system uses all the (single-term) queries that the searcher has used unsuccessfully in trying to locate the document to update frequency counts relating it to these terms. Furnas' approach has only been applied to very small databases (255 documents in the most extensive system studied) and only permits analysis for single-term queries. Clearly, customary bibliographic retrieval is of quite a different character requiring correspondingly different approaches.


We have seen that representing documents so that they may be effectively retrieved is difficult. Further, we have noted that an attempt can be made to make document description an iterative process, but that research results have not conclusively established the viability of this approach. Now we will see a means of adaptive document redescription that provides improved retrievability of relevant documents.

As explained earlier, inquirers are bound to disagree about the proper description of a document (that is, they will each be requesting it somewhat differently), so the best indexing results will arise from describing a document to best represent it for the group of inquirers who will find it useful (rather than for any particular individual). As a result, we adopt a novel approach to indexing by simultaneously supplying alternative descriptions to the same document and then deriving better descriptions based on feedback that indicates which of the alternative descriptions best describes the document. The retrieval model (its use of feedback momentarily ignored) is this: a document is described by several complete descriptions (for instance, several sets of keywords, or several sets of keyword weights); an inquirer issues a query; and each description of the document is matched against the query as if the document were described with only a single description. The average of these separate matching scores serves as the basis for retrieving or not retrieving the document. (Other functions of these separate matching scores are possible, too, possibly the maximum or median.)

As mentioned, the retrieval model makes use of feedback to change the way a document is represented, rather than keeping the representation static. This

is done with a genetic (or adaptive) algorithm which is of demonstrated success in many domains [12] and currently is being studied in artificial intelligence research aimed toward promoting learning [13]. Consider a set of objects, each performing an identical task, and assume each object can be represented by a string of symbols. The genetic algorithm operates on such a set of representations, replacing it with another set, then another, and so on. The replacements attempt to produce new sets of objects (more precisely, object representations) in each succeeding generation which, on the whole, are more fit (perform the designated task better) than their predecessors.

The algorithm, applied to the task of document redescription, iterates this two-step process:


1) For any particular document, measure the worth (i.e., performance or fitness) of each of its (fixed number of) descriptions. That is, determine how well each description serves in providing the document to just the right inquirers.

2) Replace the set of descriptions currently associated with that document:

a) Throw away its current set of descriptions.

b) Establish a new set of descriptions out of the set just discarded, using more parts of descriptions that had higher worth. Each of these newly created descriptions will likely be different from all descriptions in the just discarded set.

Until some criterion is obtained.

In other words, what is occurring is a process that attempts to mimic genetics, promoting a population of descriptions built up of parts ("genes") of its fittest members. The first step in the process seeks to determine which descriptions are best doing their jobs (getting a document only to those inquirers who will find it relevant); the second step exploits the information gained in the first but also introduces variety. Together, the two steps seek regularities among the best descriptions, promote descriptions exhibiting such regularities in succeeding generations, and try out novel descriptions in an effort to improve upon the descriptions already tested. Even after adaptation, a document retains a set of descriptions. If deemed desirable, these could then be replaced by a single description derived from this set.

With this background, the results of performing adaptive redescription are now described, and the details of the algorithm are left to the Appendix.


The effectiveness of applying the genetic algorithm to document redescription was tested experimentally. The basic paradigm assumes the same set of queries (with identical relevance judgments) is repeatedly issued to the retrieval system. The algorithm uses only knowledge indicating the queries to which a document is relevant and the queries to which it is not in adapting document descriptions. Thus, to adapt the way a document is described, it was necessary to collect a set of relevant queries (queries to which a document is judged relevant) and non-relevant queries for each edocument studied. The judgments that provided these query sets were made by undergraduate college students [10]. In total, a relevant query set and non-relevant query set were obtained for each of eighteen different documents. On average, a document had seventeen descriptors as well as seventeen relevant and seventeen nonrelevant queries. (Each was a set of unweighted terms without any Boolean connectives.) Using these , a series of eighteen separate simulations was conducted.

For any given document, an initial set of descriptions was needed to begin the simulation. The same students who made the relevance judgments about documents provided these initial descriptions. In fact, the initial set of document descriptions and the set of relevant queries for that document were identical, the assumption being that the query one uses to find a relevant document and the description one would provide for that document ought to be the same. Each of the eighteen documents studied was treated independently, meaning it had its own set of initial descriptions and its own set of relevant and non-relevant queries.

A snapshot of generation.sub.8 of the simulation for hypothetical document.sub.X can be seen in Figure 1. Several things should be noticed. First, there are N generation.sub.8 descriptions of this document. (Each of these is a set of subject terms.) In generation.sub.1., these were the descriptions originally supplied. Subsequently, they will have been modified. Second, there is an associated set of M relevant queries for this document. Third, a Jaccard's score matching function is used to compute the similarity of every description of document.sub.X to each of its relevant queries. (The Jaccard's score association between two sets X and Y is #(X intersect Y)/#(X union Y), #(S) meaning the cardinality of set S. In this case X and Y are the set of terms used to describe a document and the set of terms used in posing a relevant query for that document, respectively.) The Jaccard's score is a common measure of association in document retrieval [19], and use of other association measures would not influence the operation of the simulation or the expected results. Next, notice that the Average Matching Score, that is, worth, of each of the descriptions currently in force is indicated. It is this worth that is exploited by the genetic algorithm in producing descriptions in generation.sub.8+1. Finally, notice that the overall level of association (the Overall Average, G.sub.8.) between descriptions in use during the current generation, g, and the set of relevant queries is indicated. It is this statistic that indicates how well the current generation of descriptions is performing its job.

If genetic adaptation were to succeed in improving document descriptions, then, on the average, the level of association between a document and its relevant queries should be higher in generation 40, upon completion of the simulation, than it was in generation 1 using the original set of descriptions. That is, G.sub.40 should exceed G.sub.1. Such improvement did, in fact, occur for each of the eighteen documents studied (averaging approximately 25 percent improvement in Jaccard score). Further, although a redescribed document exhibited some increase in similarity to its non-relevant queries (used as experimental control), in seventeen of the eighteen cases this was less than the increase relative to relevant queries. The increase in the latter, on the average, was nearly five times as great. See Figure 2 and Table I.

In a second series of simulations, the adaptive procedure was put to a more severe test: Redescription was attempted to raise the document's average level of association to its relevant queries and, at the same time, to reduce the document's average level of association to its non-relevant queries, (which were selected on the basis of their similarity to relevant queries; see [10]). In other words, the attempt was made to redescribe a document so that it would more likely be retrieved by those who would find it useful and less likely retrieved by those who would not. Again, each of the eighteen documents was adaptively redescribed independent of all the others. As before, these simulations were run for forty generations.

Figure 3 provides an in-progress snapshot of the redescription of hypothetical document.sub.X in conformance with the goals just outlined. Notice that a document has one set of descriptions associated with it, but that these will be matched against both relevant queries and non-relevant queries. G.sub.8., as before, indicates how well, on average, the prevailing descriptions of the document are at matching relevant queries. G'.sub.8 makes that same indication for non-relevant queries. For adaptation to succeed, G.sub.8 should increase from its generation.sub.1 level while G'.sub.8 should fall. In practical terms, this would mean that documents are more likely to be retrieved by relevant queries (since they are now described more similarly to these queries) and less likely to be retrieved by non-relevant queries (due to decreased similarity). If G.sub.8 and G'.sub.8 should both rise, adaption still might be deemed successful if the increase in the former is greater than the latter. The advantage, in that case, is that there is now a greater difference between the expected level of association between a document and a relevant query and the expected level of association between that same document and a non-relevant query. Thus, when presented with a query of uncertain relevance, it becomes easier to tell whether or not to retrieve the document.

In conduct, G.sub.8 did increase as a result of adaptation for each of the eighteen documents studied. In fifteen out of the eighteen cases, G'.sub.8 dropped (signifying that, absolutely, non-relevant queries matched worse a document when it was redescribed to match its relevant queries). In the remaining three cases, G'.sub.8 rose slightly, but less than the corresponding rise in G.sub.8 for the same document (signifying that, relative to the increased association between relevant queries and adapted document descriptions, an improvement in filtering non-relevant queries was made). See Figure 4 and Table II (on the following page).


Earlier, we suggested that document retrieval can be improved if: the underlying model is not reliant on independence assumptions; the elicited feedback pertaining to relevance assessments is aggregated across a sufficiently large group of inquirers with similar information needs; and these two criteria are attainable at reasonable computational cost. Success in meeting each of these criteria is now examined for the model of genetic adaptation described.

For any document in the simulation, the distribution of query terms over its set of relevant queries is easily tabulated. Let the proportion of relevant queries (for a given document) using term.sub.i be p.sub.i. Then, one procedure for indexing a document that employs a set of M descriptions suggests that this set employ subject term.sub.i for p.sub.i * M of its descriptions and not employ term.sub.i in the remaining (1 - p.sub.i.)* M cases. Under the assumption that the distribution of any term.sub.i is independent of any term.sub.j., (j [is not =] i), and by making use of complete knowledge of all relevant queries used to retrieve a given document, we would obtain identical distributions of subject terms in both the set of descriptions used to describe a document, and the queries employed in an attempt to retrieve it. Such a theoretically derived set of document descriptions might be considered near optimal on the assumption that maximal similarity between a set of document descriptions and its relevant queries will arise when every term is distributed identically in both sets.

The actual effectiveness of describing a document with such a theoretically derived set of descriptions was compared to the effectiveness of genetically adapting descriptions for the same documents. The results pointed out the superiority of adaptation: for each of the documents studied, the adapted descriptions more effectively matched relevant queries than did the theoretical descriptions, the improvement averaging approximately 25 percent (as measured by Jaccard's score).

The genetic algorithm is responsible for this improvement. By its action, combinations of index terms which best serve to describe a document, rather than just individual terms, proliferate from generation to generation. More technically, promoting competition among a set of objects represented as strings of symbols, and then introducing variability after reproducing these strings in proportion to their effectiveness, will yield an increasing reproportion of the fittest schemata over time. (A schema names a hyperplane, or set. For instance, the binary schema 01#001# stands for the set of four seven-place strings [0100010, 0100011, 0110010, 0110011], "#" standing for instantiate in any valid way possible.) In the case of document descriptions, fit schemata are those subject term combinations that best describe documents.

Thus, descriptions of documents are built up out of index term combinations quite differently, and more effectively, than if index terms were supplied independently to documents. This is consonant with Holland's commentary on the proof of the genetic algorithm: "In effect, useful linkages are preserved and non-linearities (epistases) are explored ... giving a performance ... which is not simply the sum of their [for us, subject terms'] individual performances." And, with a suitable number of descriptions attached to a document, by the central limit theorem, sets of subject terms "with the higher average fitness quickly predominate [12]." In other words, the genetic algorithm produces document descriptions that surpass (in performance) those that could be generated from identical information using assumptions of statistical independence. The operation of the genetic algorithm differs in this respect from other document retrieval feedback techniques (used to alter document descriptions or queries) which modify the weight of any given subject or query term independently of all others.

A second criterion for a successful document retrieval system is that it collect feedback on which to base retrieval from the population of inquirers, rather than an individual query or single inquirer. Such feedback underlies the operation of the system described. The current descriptions associated with a document are created as a result of relevance assessments of the document by all past inquirers who were furnished the document.

In being independence-free and describing documents to meet the needs of a population of inquirers, the model has theoretical value. In addition, the computational cost of the model does not outweigg its advantages. In supplying a document with several (n = 15, for instance) descriptions rather than one, we suffer a linear (i.e., fifteen-fold) increase in both the storage required for descriptions as well as the time required to match queries to document descriptions. Although the increase in storage is inevitable, efficient means of compressing descriptions and decreasing storage costs could mitigate this disadvantage. (It is also possible to multiply describe only the most actively sought documents in the collection or to replace the set of descriptions with which a document is represented by a single, consensual description once adaptation is complete.) Importantly, notice that by incurring this increased storage cost, we obtain a distribution of subject terms within the description set of a document which is more effective than that theoretically derived using an independence assumption together with complete information about relevant queries. In short, we are accomplishing what is so difficult in probabilistic models: obtaining effective probability estimates which can be used to improve the retrieval of documents.

The matching costs are potentially more damaging, yet are more easily dealt with in practice. With clustered files of document descriptions, or by means of file index construction, the documents deemed potentially relevant to a query can be restricted to a greatly reduced fraction of the entire database. Thus, although we have a linear increase in the number of matches to be made, constraining the search to a small subset of the database considerably lessens this concern. (It is also possible to match a query against just a few of the descriptions of the set of documents initially suspected to be relevant. Then, using these matching scores, the retrieval system can continue to match against all descriptions of just those documents indicating greatest likelihood of relevance.) Research into improving the time and space complexity of genetic adaptation should, of course, precede implementation of a system based on this model.

It is important to understand the design principles that underly commercial document retrieval systems. Usually these systems allow full-text searching in documents and/or abstracts, and sometimes keyword searching as well. (Both full-text and keyword retrieval logically represent documents by sets of subject terms. The difference between them is that full-text indexing chooses as subject terms all the words in a document's text or abstract--with the exception of stop words such as prepositions and articles--while keyword indexing is performed by human indexers who normally select fewer than a dozen subject terms to represent a document's subject.)

Since full-text retrieval prescribes that a document be indexed by each of its substantive terms, commercial systems have little control over the logical representation of documents. Nevertheless, current retrieval systems attempt to improve retrieval effectiveness in various ways.

One way to effect improvements in retrieval effectiveness is improvement in presentation. By furnishing documents that match an inquirer's query more rapidly, systems allow inquirers to spend an increased proportion of their time productively browsing through documents to determine which are relevant and less time waiting for text to be displayed. Such improvements in response time can be accomplished by devising new representations for physically representing sets of document terms together with algorithms that exploit these representations for more efficient retrieval. Presentation can also be improved by preserving original details of stored documents for display. For instance, inquirers may come to quicker, more accurate determinations about whether a presented document is relevant if they can see the way its author formatted it (including the author's use of indentation, italics, bullets, etc.).

In addition to improving presentation, full-text systems have tried to boost recall by altering the grain of text that one can search. For instance, some systems allow searching for phrases of text, such as "hostile takeover" or "beta test site," as easily as searching for individual words. Formerly, proximity operators were required to describe such phrases by indicating adjacent words, or words within a specified number of terms of each other. Phrases containing stop words, such as "balance of trade," can today be more easily and reliably located by some retrieval systems. This is accomplished by character position indexing of text tokens rather than word position indexing. That is, knowing a document contains the word "trade" beginning five text positions after the word "balance" is stronger evidence for the phrase "balance of trade" occurring in a document than knowing that "trade" and "balance" co-occur within the document, or even that they are separated by a single word. Searching at a more natural grain of text is also provided by systems that support searching for acronyms, abbreviations, words with punctuation (such as CD-ROM), and even made-up words (such as the part number "111-22-34(7).b").

Finally, commercial systems provide tools for augmenting the basic operation of retrieval. Thesauruses describe semantic relationships among terms (synonym, generalization, specialization, etc.). Built either when the collection is being indexed or just before a search, thesauruses thus can improve retrieval. Additionally, some retrieval systems allow their operations to be modified or enhanced to support customized retrieval. For instance, the product of a full-text search can serve as input to a document ranking algorithm. Thus, two original equipment manufacturers (OEMs), for example, may use the same full-text retrieval software on the same database of documents but provide considerably different retrieval effectiveness by ranking documents differently. Use of rankings and other modifications and enhancements are acknowledgments of some of the difficulties known to beset commercial retrieval, particularly the difficulty of finding the appropriate, precise vocabulary for searching and of composing suitable requests using the Boolean connectives AND, OR, and NOT.


The difficulty in describing documents well has been indicated. Probabilistic models of document retrieval have been discussed to indicate the difficulty in achieving theoretical soundness along with effective implementation. Although less theoretically guided than strict probabilistic retrieval models, an adaptive approach has been described that overcomes the problems probabilistic models suffer: 1) implementations based on independence assumptions; and 2) probability estimations either of uncertain reliability or value, or attained only at prohibitive cost. (We note, too, that even rigorous probabilistic models are, in practice, heuristic to a considerable extent. For instance, the Robertson Sparck Jones model formally calculates probabilities of relevance without regard to queries at all. In practice, though, queries are used heuristically to restrict computations.) The effectiveness of the adaptive approach has been documented.

In an operational retrieval system, an initial set of descriptions for a document could be obtained by means of competing automatic indexing procedures. Alternatively, models suggesting the probable effectiveness of employing given subject terms to documents could be used to stochastically generate an initial set of descriptions.

The simulation experiments described model a document that is repeatedly requested by the same set of queries. In actuality, the way a document should be described is likely to change over time as it is requested differently. Being a retrieval model that redescribes documents rather than leaving their descriptions fixed, the adaptive retrieval model discussed automatically accommodates such changes. Since the model employs multiple descriptions of any document, one of these could include just those query terms from a recent query to which the document was, or should have been, judged relevant. In this way, entirely new subject terms can be incorporated in describing a document. In the model presented, what is occurring is that those terms already supplied are being more effectively distributed across the various descriptions of the document.

In a broader sense, this article suggests that the theoretical principles behind probabilistic document retrieval can be used to guide the design of implementable retrieval systems that provide a high degree of retrieval effectiveness. Though other forms of adaptation may be even better suited to certain retrieval environments, the lesson remains that theory can successfully influence practice.
COPYRIGHT 1988 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1988 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Gordon, Michael
Publication:Communications of the ACM
Article Type:technical
Date:Oct 1, 1988
Previous Article:Characterizing computer performance with a single number.
Next Article:Calendar queues: a fast 0(1) priority queue implementation for the simulation event set problem.

Terms of use | Privacy policy | Copyright © 2022 Farlex, Inc. | Feedback | For webmasters |