Abstracts and Abstracting in Knowledge Discovery.ABSTRACT VARIOUS LEVELS OF CRITERIA FOR JUDGING the quality of abstracts and abstracting are presented. Requirements for abstracts to be read by humans are compared with requirements for those to be searched by computer. It is concluded that the wide availability of complete text in electronic form does not reduce the value of abstracts for information retrieval information retrieval Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links. activities even in such more sophisticated applications as knowledge discovery. INTRODUCTION Abstracts were first developed to be read by humans, providing concise summaries or descriptions of published items suitable for inclusion in printed indexing services or in scholarly journals along with the articles to which they relate. When computers started to have a serious impact on information retrieval in the 1960s, abstracts became important as human-readable output from electronic databases. Later, as storage and processing costs declined, they began to assume a new role--that of computer-searchable surrogates for larger bodies of text. Today, of course, it is economically feasible to store vast quantities of text in computer-searchable form. Nevertheless, this has not made abstracts redundant. They remain useful summaries to be read by humans. Furthermore, if recall and precision are both taken into account, they may still be optimum for retrieval purposes because the searching of full text will frequently cause an unacceptable level of irrelevancy ir·rel·e·van·cy n. pl. ir·rel·e·van·cies Irrelevance. Noun 1. irrelevancy - the lack of a relation of something to the matter at hand irrelevance . Several investigators (e.g., Tenopir, 1985) have shown that searching abstracts may be more effective or more cost-effective than searching of full text, while Salton (1971) found that, while full text gave better overall results than abstracts in the type of automatic processing employed in his SMART retrieval system, the differences were not great and the abstracts allowed more cost-effective processing. On the surface, one might assume that knowledge discovery operations would be most likely to succeed when the complete text of items is processed. This is not necessarily so because full text can generate so many spurious relationships In statistics, a spurious relationship (or, sometimes, spurious correlation) is a mathematical relationship in which two occurrences have no causal connection, yet it may be inferred that they do, due to a certain third, unseen factor (referred to as a "confounding factor" that significant and useful associations will be virtually impossible to recognize. Abstracts may still have great value in knowledge discovery activities as they do in many others. This article will review various criteria by which the quality of abstracts may be judged. It will then discuss which criteria apply most clearly to the value of abstracts in knowledge discovery applications. QUALITY IN GENERAL The word "quality" occurs frequently in everyday life and, in this general setting, stands for an idea that, while not necessarily exact, seems readily understood. On the other hand, in more formal and restricted applications--such as science, technology, commerce, and education--much less agreement exists on what "quality" really means and how the quality of something is to be measured and expressed. This is less true, of course, when applied to things that are concrete. The quality of many manufactured products can be precisely quantified. This results from the fact that they must conform to Verb 1. conform to - satisfy a condition or restriction; "Does this paper meet the requirements for the degree?" fit, meet coordinate - be co-ordinated; "These activities coordinate well" standards that are strictly enforceable and are precisely quantifiable--e.g., steel either meets a standard relating to relating to relate prep → concernant relating to relate prep → bezüglich +gen, mit Bezug auf +acc its composition or it does not. In the manufacturing situation, then, "quality control" is not a nebulous idea--it relates to the extent to which products meet the required standards. In less concrete settings, such as those relating to various types of services, quality is less easily defined. For example, we may refer to the "quality of law enforcement" or the "quality of library service," but these are notions that are more subjective than objective. Despite it being an imprecise im·pre·cise adj. Not precise. im pre·cise ly adv. idea in many contexts, it is obvious
that the last decade or so has brought a great increase in concern for
"quality" in virtually all areas of human endeavor. The growth
of the literature on the subject is a tangible manifestation man·i·fes·ta·tionn. An indication of the existence, reality, or presence of something, especially an illness. manifestation (man´ifestā´sh of this. Nevertheless, it is somewhat misleading to speak of quality as though it were a single idea. Instead, one may recognize various levels or perspectives, as illustrated in Table 1. At the one extreme, there is the abstract or transcendental idea of quality, one that is static, absolute, and existing only in philosophical and metaphysical met·a·phys·i·cal adj. 1. Of or relating to metaphysics. 2. Based on speculative or abstract reasoning. 3. Highly abstract or theoretical; abstruse. 4. a. Immaterial; incorporeal. speculation. At the other extreme is the "user" perspective, which is personal and even, perhaps, idiosyncratic id·i·o·syn·cra·sy n. pl. id·i·o·syn·cra·sies 1. A structural or behavioral characteristic peculiar to an individual or group. 2. A physiological or temperamental peculiarity. 3. . It is also dynamic and "relative"--in the sense that it often involves a comparison and the choice of one among several alternatives. Frequently the choice will be made on the basis of cost, which could be a cost in monetary form or in terms of time and convenience. Table 1. VARIOUS POSSIBLE LEVELS OR PERSPECTIVES RELATING TO QUALITY
Perspective Basis of judgment Characteristics
ABSTRACT Philosophy Absolute
Speculation Static
ORGANIZATIONAL
PROCESS Standards, regulations, Some processes may be
norms strictly regulated;
others not
PRODUCT Standards For manufactured
products, may be
objective and
enforceable
SERVICE Standards or norms More subjective than
objective; rarely
enforceable
USER/CUSTOMER Cost Dynamic
Value Relative
Personal value system
Between these extremes, we have other levels or perspectives, identified in the table as being "organizational." Quality related to products varies greatly with type of product. For the many products that must be manufactured to conform to standards, quality can be considered close to absolute, at least relative to the standards, but not completely so since most manufacturing standards accept a range of values, albeit a very narrow one in many cases. Intellectual products, such as various forms of publication, are less susceptible to true standardization standardization In industry, the development and application of standards that make it possible to manufacture a large volume of interchangeable parts. Standardization may focus on engineering standards, such as properties of materials, fits and tolerances, and drafting . At least, this is true of their content. The container (paper, binding, and so on) can be standardized standardized pertaining to data that have been submitted to standardization procedures. standardized morbidity rate see morbidity rate. standardized mortality rate see mortality rate. . The process perspective is heterogeneous. Some processes can be standardized. In fact, in some cases, processes may be subjected to absolute regulation--e.g., concerning cleanliness Cleanliness See also Orderliness. Cleverness (See CUNNING.) Berchta unkempt herself, demands cleanliness from others, especially children. [Ger. Folklore: Leach, 137] cat continually “washes” itself. , safety, and other health-related issues. Again, intellectual processes are not as susceptible to regulation or standardization. The service perspective falls midway between the product perspective and the user perspective. Services can rarely be judged in absolute terms (Alg.) such as are known, or which do not contain the unknown quantity. See also: Absolute . Although some aspects of service can be quantified--e.g., number of seats per reader, number of students per instructor--the standards are rarely completely enforceable so they tend to be normative nor·ma·tive adj. Of, relating to, or prescribing a norm or standard: normative grammar. nor values rather than true standards, and some services (e.g., associated with organized religion or with certain social agencies) seem not susceptible to evaluation against any type of standard. Nevertheless, approaches to the enforcement of quality within service agencies have become increasingly sophisticated in the last several years, culminating in adoption of the principles of total quality management (TQM (Total Quality Management) An organizational undertaking to improve the quality of manufacturing and service. It focuses on obtaining continuous feedback for making improvements and refining existing processes over the long term. See ISO 9000. ), which include emphasis on customer satisfaction and on continuous improvement. QUALITY IN INFORMATION SERVICE SETTINGS Since information tends to be intangible, it is quite difficult to obtain agreement on appropriate measures of quality for most elements of information service. All of the various perspectives represented in Table 1, except for the purely philosophical, can apply in the information service environment. Quality can be considered in tangible terms for many aspects of information products but can be quite elusive elsewhere, especially in both the service perspective and the user perspective. Take, for example, the case of an electronic database. Quantifiable Quantifiable Can be expressed as a number. The results of quantifiable psychological tests can be translated into numerical values, or scores. Mentioned in: Psychological Tests measures of quality can be applied when the database is considered as a product--i.e., its coverage of the literature within its scope, the average number of access points per item, up-to-dateness, and so on. Retrospective search and current awareness services derived from use of the database present more difficult problems. While certain measures of service quality can be objective and quantified (e.g., average time elapsing from demand to delivery of response), the more important measures, such as those of recall and precision, are both subjective and difficult to apply. When the user perspective is considered here, of course, the situation becomes even more subjective. For example, a database search can retrieve many items that match a user's stated request or stored interest profile but may still be judged of little value by the user, because the actual information needed did not appear in the search results, because the items retrieved were already known to the user, because he considered them as insignificant contributions to the subject, or for some other reason that might be quite idiosyncratic. Moreover, if the user has to pay for the service, he may apply a purely cost-effectiveness measure to judge the quality of the search results--i.e., the cost per useful item retrieved. The process perspective on quality is not as nebulous as the user perspective, but it is still an area in which it is difficult to apply true standards. This is because many of the processes are intellectual. While certain applications can be standardized (e.g., form of name in catalog catalog, descriptive list, on cards or in a book, of the contents of a library. Assurbanipal's library at Nineveh was cataloged on shelves of slate. The first known subject catalog was compiled by Callimachus at the Alexandrian Library in the 3d cent. B.C. entries), others, such as subject indexing Subject indexing is the act of describing a document by index terms to indicate what the document is about or to summarize its content. The index terms are often selected from some form of controlled vocabulary. , are not susceptible to standardization except in very trivial aspects. Quality concerns applied to another intellectual process, abstracting, is the focus of our present discussion. QUALITY CONSIDERATIONS APPLIED TO ABSTRACTING From a psycholinguistic psy·cho·lin·guis·tics n. (used with a sing. verb) The study of the influence of psychological factors on the development, use, and interpretation of language. perspective, abstracting is more ambitious and complex than indexing: not only must the text of documents be analyzed an·a·lyze tr.v. an·a·lyzed, an·a·lyz·ing, an·a·lyz·es 1. To examine methodically by separating into parts and studying their interrelations. 2. Chemistry To make a chemical analysis of. 3. in some detail but text (the abstract) must also be produced. This text must be coherent syntactically syn·tac·tic or syn·tac·ti·cal adj. Of, relating to, or conforming to the rules of syntax. [Greek suntaktikos, putting together, from suntaktos, constructed, from and semantically and, at the same time, be a reasonable summary of the original document. Abstracting is the most difficult of all operations normally applied in a document processing Processing text documents, which includes indexing methods for text retrieval based on content. See document imaging. environment because, today at least, an abstract must act as both content description and retrieval tool. Fidel (1986) has shown that these two uses may not be completely compatible. A possible model of the abstracting process is presented in Figure 1. In actual fact, four levels of processing are represented. The goals are defined by the service or journal producing the abstracts and may be embodied em·bod·y tr.v. em·bod·ied, em·bod·y·ing, em·bod·ies 1. To give a bodily form to; incarnate. 2. To represent in bodily or material form: or reflected in guidelines guidelines, n.pl a set of standards, criteria, or specifications to be used or followed in the performance of certain tasks. for the abstractors. The individual abstractor observes the goals by following these guidelines. The two processes, "content interpretation/selection" and "content transformation," are directly equivalent to the conceptual analysis and translation stages of subject indexing (Lancaster, 1998). The former is concerned with understanding what is discussed in the original text and deciding which elements should be included in the abstract, while the latter is concerned with the composition of the abstract--i.e., how the selected elements are to be presented in the text of the abstract. [Figure 1 ILLUSTRATION OMITTED] The process headed "checking" is the process directly related to quality. It has several possible dimensions: the individual abstractor may impose his/her own review of quality before submitting the abstract for further processing, the abstractor's work may later be checked by an editor or senior abstractor before publication, and readers may apply their own quality checks relating to the intelligibility in·tel·li·gi·ble adj. 1. Capable of being understood: an intelligible set of directions. 2. Capable of being apprehended by the intellect alone. of the abstract and its value in predicting the relevance of the original item to their own interests. Figure 1 suggests that the quality of the abstract is largely determined by the quality of the knowledge base of the abstractor. The knowledge base incorporates both linguistic knowledge (ability to interpret the language of texts in the subject area dealt with) and nonlinguistic knowledge: understanding of the subject matter, of the needs and interests of the audience served, and of the guidelines under which the abstractor is to operate. Despite the fact that their application in retrieval (as substitutes for or complements to sets of index terms) makes them more important now than ever before, especially in the Internet environment (Wheatley & Armstrong, 1997), there exist no generally accepted measures of the quality of abstracts. Of course, many writers have identified their desirable attributes. Borko and Bernier (1975), for example, regard abstracting as a form of writing that has a unique style (it is not a "natural" form); abstracts must be brief, accurate, and clearly written. Unlike Cremmins (1996), they do not claim that they must have "elegance." Lancaster (1998) suggests two broad criteria for judging quality: are the major points of the article covered and are they represented accurately, succinctly suc·cinct adj. suc·cinct·er, suc·cinct·est 1. Characterized by clear, precise expression in few words; concise and terse: a succinct reply; a succinct style. 2. , and unambiguously? The latest English-language standard (National Information Standards Organization The National Information Standards Organization (NISO) is a United States non-profit standards organization that develops, maintains and publishes technical standards related to bibliographic and library applications. , 1997), while it gives guidance on style, makes no attempt to provide criteria that can be used to assess quality. Other writers (e.g., Brown & Day, 1983) have focused on the art of text summarization sum·ma·rize intr. & tr.v. sum·ma·rized, sum·ma·riz·ing, sum·ma·riz·es To make a summary or make a summary of. sum or on the skills needed by a good abstractor (e.g., see Endres-Niggemeyer, Maier, & Sigel, 1995). Interest in the evaluation of abstracts can be traced back to at least the late 1950s. For example, Edmundson et al. (1959) proposed several criteria: comparison with an "ideal" abstract, the retrievability of a document by the abstract, and the extent to which the abstract could be used to answer test questions as well as the use of intuitive subjective judgment. Payne, Munger, and Altman (1962) also suggested a test of the value of abstracts in answering questions, as well as a measure of the amount of text reduction achieved in an abstract, and the use of a consistency test in which the similarity of different abstracts, prepared from the same document, is compared. Vinsonhaler (1966) recommended use of a seven-point scale to determine the similarity between an abstract and the document it relates to; also proposed was a more conventional approach, one of predictive validity--the extent to which abstracts are able to correctly predict the relevance of documents. Mathis (1972) offered a numerical value, known as the "data coefficient coefficient /co·ef·fi·cient/ (ko?ah-fish´int) 1. an expression of the change or effect produced by variation in certain factors, or of the ratio between two different quantities. 2. " (DC), for the evaluation, expressed by a formula that incorporates a data retention factor and a length retention factor. The value of the DC is increased by reducing the number of words in the abstract, by increasing the number of concepts ("data elements") represented, or both. Several of these approaches have been applied over the years. The most favored is a test of the ability of an abstract to predict the relevance of a document to a particular information need. Investigators who have applied this to abstracts, or to extracts derived by computer, include Rath rath (rä, räth), circular hill fort protected by earthworks, used by the ancient Irish in the pre-Christian era as a retreat in time of danger. , Resnick, and Savage (1961); Resnick (1961); Kent et al. (1967); Dym (1967); Shirey and Kurfeerst (i967); Saracevic (1969); Marcus, Benenfeld and Kugel ku·gel n. A baked pudding of noodles or potatoes, eggs, and seasonings, traditionally eaten by Jews on the Sabbath. [Yiddish kugel, ball (from its puffed-up shape), from Middle High German. (1971); Thompson (1973); and Keen (1976). Hartley, Sydes, and Blurton (1996) provide an example of a study in which abstracts are judged on their ability to answer various questions; in this case, they were comparing "structured" abstracts with unstructured ones. Salton et al. (1997) used a variation of the similarity approach: the extent to which an automatically-derived extract resembles one derived by humans. Other approaches have assessed the "readability read·a·ble adj. 1. Easily read; legible: a readable typeface. 2. Pleasurable or interesting to read: a readable story. " of abstracts using standard readability formulas, comprehension measures, or both. Examples can be found in the work of Dronberger and Kowitz (1975), King (1976), Tenopir and Jacso (1993), and Hartley (1994). More recently, Wheatley and Armstrong (1997) studied readability of a variety of abstracts drawn from Internet sources. A more "linguistic" approach was used by Salager-Meyer (1991), who analyzed a sample of medical abstracts from this perspective, finding almost half to be "poorly structured" (i.e., having discoursal deficiency). Since "discoursal deficiency" can include such things as conceptual scatter scat·ter v. 1. To cause to separate and go in different directions. 2. To separate and go in different directions; disperse. 3. To deflect radiation or particles. n. (e.g., results reported in different places in the abstract), as well as omission omission n. 1) failure to perform an act agreed to, where there is a duty to an individual or the public to act (including omitting to take care) or is required by law. Such an omission may give rise to a lawsuit in the same way as a negligent or improper act. of an important element (e.g., purpose of research) from the abstract, the author implies that abstracts flawed flaw 1 n. 1. An imperfection, often concealed, that impairs soundness: a flaw in the crystal that caused it to shatter. See Synonyms at blemish. 2. in this way will be less effective in conveying information. Elsewhere, Pinto pinto Spotted horse, also called paint, piebald, skewbald, and other terms to describe variations in colour and markings. The American Indian ponies of the western U.S. were often pintos. Most pure-breed associations refuse to register horses with pinto colouring. (1992, 1994, 1995) has dealt in detail with the process of text summarization from the viewpoint of linguistic structure. It is clear that the various quality criteria proposed or used in the past look at abstracts/abstracting from different perspectives. In fact, virtually all perspectives represented in Table 1 can apply to abstracts or abstracting, as shown in Table 2. Table 2. ATTRIBUTES OF QUALITY ASSOCIATED WITH DIFFERENT PERSPECTIVES ON ABSTRACTS AND ABSTRACTING
Process perspective Service perspective
Exhaustivity Customer satisfaction
Accuracy Cost-effectiveness
Readability
Cohesion/coherence User perspective
Cost Cost
Value
Product perspective
Consistency Process/product perspective
Brevity Density
Cost Cost
The process perspective deals primarily with attributes of cognitive representation. Here analogies can be drawn between the process of abstracting and the process of indexing (Lancaster, 1998). The exhaustivity of the abstract relates to its breadth of coverage. In essence, it is a measure of the extent to which all of the themes of the original text are represented in the abstract. Clearly, an abstract is unlikely to include all the content of the original text (unless it is completely trivial) so the exhaustivity of the abstract can be considered as the extent to which all of the themes (ideas, conclusions, or whatever) judged important are covered in the abstract. This implies that some group of people, presumably pre·sum·a·ble adj. That can be presumed or taken for granted; reasonable as a supposition: presumable causes of the disaster. specialists in the subject area dealt with, can agree on what is important in the original and what is not. In an ideal situation, of course, an abstract should be tailored to the needs of a particular audience. This is most obvious in the case of one written for an in-house bulletin prepared, for example, to serve a particular company or research organization. In this case, an exhaustive abstract would be one that covers all the themes of the original that are of potential interest to the limited community. In an extreme case, this might be a single theme--e.g., results of applying a particular drug extracted from a medical article discussing multiple approaches to the treatment of some disease. Clearly, the writer of such an abstract must have a good knowledge of the needs and interests of the target community as well as familiarity with the subject matter dealt with. The more heterogeneous the interests of the audience served, the less likely one is to reach agreement on which themes to include in the abstract and which not: difficult in the case of general mission-oriented abstracts (e.g., serving the needs of an entire industry), more difficult still in the case of abstracts intended to serve the needs of an entire discipline. Accuracy refers to the extent to which the abstract correctly represents the original text. A theme covered in the abstract could be an inaccurate representation of the original because of an intellectual error (the abstractor misinterprets the text) or an error of carelessness Carelessness See also Forgetfulness, Irresponsibility, Laziness. Grasshopper sings through summer, overlooking winter preparations. [Gk. Lit. (the abstractor records incorrectly--e.g., gives a wrong numerical value). The former should be relatively rare but could occur if the abstractor is not fully familiar with the subject matter or if the original text is somewhat obscure. A special case would be the situation of an abstractor dealing with a language in which he is not completely fluent fluent /flu·ent/ (floo´int) flowing effortlessly; said of speech. . Accuracy errors of the second type would be attributable to personal characteristics of the abstractor (ability to concentrate, ability to transcribe To copy data from one medium to another; for example, from one source document to another, or from a source document to the computer. It often implies a change of format or codes. correctly), including qualities that could vary considerably from one day to the next, and to working conditions. Most significant of the latter would be pressures associated with required productivity, where an abstractor may be required to produce a specified number of abstracts in a particular time period. Of course, once the abstract has been printed and distributed, it would be impossible to determine whether an error of this type was attributable to the abstractor or was introduced at some later stage of the production process. The readability of an abstract is determined by the ability of the abstractor to express himself clearly, concisely, and unambiguously, by the rules or guidelines under which he operates, and by the format of the abstract (e.g., some claim that abstracts structured into paragraphs with topical headings are easier to comprehend). To the extent that general tests of the readability of text (e.g., the Flesch Reading Ease formula) or of comprehension (e.g., cloze cloze adj. Based on or being a test of reading comprehension in which the test taker is asked to supply words that have been systematically deleted from a text. [Alteration of closure.] Adj. criteria) are applicable to abstracts, readability can be an objective measure and one that can be quantified. Cohesion/coherence is related to readability but is not identical with it. These properties relate to connectivity between different parts of a text. Extracts prepared by computer (selecting sentences on the basis of statistical, positional, or linguistic criteria) will frequently be lacking in these properties, even though the total extract may be a satisfactory representation of the principal themes of the original text. Salager-Meyer (1991) is perhaps the only author to apply such linguistic criteria to humanly hu·man·ly adv. 1. In a human way. 2. Within the scope of human means, capabilities, or powers: not humanly possible. 3. prepared abstracts. A major measure used was that of conceptual scatter--the extent to which related elements (e.g., results) are separated in an abstract. Since structured abstracts (see Haynes, 1993; Hartley, 1994; Hartley, Sydes, & Blurton, 1996) are formatted into paragraphs with preestablished subheads (e.g., methods, results), they are less likely to exhibit such conceptual scatter. Factors affecting cohesion/coherence are the same as those affecting readability. The product perspective (see Table 2) relates to the technical adequacy of the abstract. The idea of consistency in abstracting is similar to consistency in subject indexing. It refers to the degree to which two individuals produce abstracts that are similar to each other (interabstractor consistency) or the degree to which the same individual agrees with himself when abstracting a document on different occasions (intra-abstractor consistency). In the indexing situation, a distinction can be made between consistency in conceptual analysis and consistency in the translation of the conceptual analysis into a particular vocabulary (e.g., terms drawn from a thesaurus). Consistency in abstracting, however, applies only at the conceptual level since it is unrealistic to expect different individuals to use exactly the same words or grammatical constructions Noun 1. grammatical construction - a group of words that form a constituent of a sentence and are considered as a single unit; "I concluded from his awkward constructions that he was a foreigner" construction, expression . Presumably, consistency will be greatest when abstractors work to precise rules as to what to include and what not. For obvious reasons, structured abstracts should be more consistent than others. In abstracting, just as in indexing, consistency is not the same as quality (Cooper, 1969). Nevertheless, if two abstractors (or indexers) consistently produce similar results, while a third agrees little with the other two, one is generally inclined to believe that the consistent abstracting (indexing) will be "better." Salton, Singhal, Mitra, and Buckley (1997) justify their automatic procedures for selecting and linking pieces of text on the grounds that the summary thus produced is as likely to agree with a humanly-produced summary as one humanly-produced summary is to agree with another. In translating from one language to another also, consistency (similarities) has been suggested as an indicator of quality (Brew & Thompson, 1994). Brevity Brevity Adonis’ garden of short life. [Br. Lit.: I Henry IV] bubbles symbolic of transitoriness of life. [Art: Hall, 54] cherry fair cherry orchards where fruit was briefly sold; symbolic of transience. is an obviously desirable attribute of a good abstract, and it is susceptible to exact measurement. Moreover, length is one of the few attributes that the published standards can and do address precisely, at least in terms of a recommended range in number of words. Nevertheless, brevity should always be secondary to other considerations such as exhaustivity and accuracy. Moreover, absolute standards make little sense since several factors would influence the brevity: length, complexity or diversity of the original, type of abstract (indicative, informative, critical), and accessibility of the original (one could argue that materials less physically or intellectually accessible--e.g., published in obscure sources or unfamiliar languages--should be abstracted more fully). Cost can be related to abstracts at different levels: the intellectual cost of creating an abstract, the cost per abstract of producing a printed publication, the cost per abstract in distribution (e.g., as part of a current awareness service), and so on. Factors affecting cost differ from level to level. For example, abstract length has a major effect on the cost of producing a printed publication but much less effect on the inclusion of an abstract in an electronic database. Cost of writing the abstract in the first place depends most obviously on who the writer is, how much he/she is paid, and who is paying. The cost of abstracting can be looked at from several different perspectives. For example, use of author-generated abstracts is economical for database producers. From the much broader (society) perspective, however, they are very expensive since the time of such authors as research scientists can be considered to be so valuable that it is perhaps better spent on other things. Carried to its logical conclusion, of course, one could argue that the greatest cost associated with abstracting is the cost of the time spent by people in reading the abstracts (thus the importance of such factors as brevity and readability) and in taking actions based upon them (thus the importance of such factors as accuracy and exhaustivity). Cost, then, is a multifaceted mul·ti·fac·et·ed adj. Having many facets or aspects. See Synonyms at versatile. Adj. 1. multifaceted - having many aspects; "a many-sided subject"; "a multifaceted undertaking"; "multifarious interests"; "the multifarious attribute when related to abstracts and abstracting. For this reason, it appears within all the perspectives illustrated in Table 2. Density is a measure that relates the attribute of exhaustivity to that of brevity. It thus, in a sense, combines the process and product perspectives. Given that the abstract includes everything that should be included--all the topics of potential interest to the intended audience--the briefer the abstract the better providing, of course, that other requirements, such as readability, are not significantly degraded de·grad·ed adj. 1. Reduced in rank, dignity, or esteem. 2. Having been corrupted or depraved. 3. Having been reduced in quality or value. . Density, then, refers to the amount of information content provided by an abstract of a certain length. The density of an abstract can be considered related to its entropy--the extent to which uncertainty about the original document is reduced for the reader of the abstract. Standard tests of the relevance predictability of abstracts address this issue. The data coefficient proposed and tested by Mathis (1972) was a precise measure of density, defined by the equation DC = C/L--i.e., the data coefficient (DC) is the "data retention factor," C, divided by the "length retention factor," L. The C value is the measure of exhaustivity as defined earlier in this discussion, while the L value is the number of words in the abstract divided by the number in the original. Clearly, the DC of an abstract improves as either exhaustivity or brevity increase. While the process and product perspectives consider abstracts as entities in their own right, the service perspective is obviously concerned with their application. Providers of abstracts, whether publishers and editors of scholarly journals or producers of secondary databases in printed or electronic form, are presumably concerned with offering a product that the majority of their customers (journal readers, database users) will find acceptable. Customer satisfaction will most obviously be associated with the process and product parameters discussed earlier, perhaps most closely to accuracy, readability, and exhaustivity. Clearly, the providers will also be concerned with production and distribution costs distribution costs distribute npl → Vertriebskosten pl so, ultimately, "quality" becomes a matter of cost-effectiveness--i.e., customer satisfaction at least cost. As mentioned earlier, the user perspective on quality will tend to be subjective, relative, dynamic and, perhaps, idiosyncratic. Users of abstracts will be likely to judge their quality in practical and pragmatic terms. They are unlikely to demand elegance but they will expect readability. Ultimately, they will judge abstracts and abstracting services in terms of costs and value to themselves. Taking the user's own time into account, the predictive validity In psychometrics, predictive validity is the extent to which a scale predicts scores on some criterion measure. For example, the validity of a cognitive test for job performance is the correlation between test scores and, for example, supervisor performance ratings. of the abstract is of paramount importance. That is, users will be unhappy with a service whose abstracts frequently cause the incurring of costs associated with obtaining complete texts that turn out to be irrelevant. Nor will they be satisfied with one that frequently fails to lead them to sources that they would judge valuable if seen in full form. CURRENT METHODS The automatic processing of text has increased considerably over the years as computing computing - computer power has increased, computing and storage costs have decreased, and more and more text has become available in electronic form, largely as a byproduct by·prod·uct or by-prod·uct n. 1. Something produced in the making of something else. 2. A secondary result; a side effect. Noun 1. of various forms of publishing. The development of the Internet and the World Wide Web, which makes vast quantities of text accessible to huge numbers of users, has made text search the norm rather than the exception. As might be expected from all of this, interest in automatic text processing methods has increased very greatly in the 1990s, in the research community as well as in government and commercial sectors. Current approaches to the processing of text, for information retrieval and related purposes, are well portrayed por·tray tr.v. por·trayed, por·tray·ing, por·trays 1. To depict or represent pictorially; make a picture of. 2. To depict or describe in words. 3. To represent dramatically, as on the stage. in the proceedings of a series of conferences. Most important among these have been the Text Retrieval Conferences For other uses of "TREC", see TREC. The Text REtrieval Conference (TREC) is an on-going series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. (TREC TREC Texas Real Estate Commission TREC Text Retrieval Conference TREC Technique de Randonnée Equestre de Compétition TREC Tropical Research and Education Center TREC T-cell Receptor Excision Circle TREC Teachers and Researchers Exploring and Collaborating ) organized by the (U. S.) National Institute of Standards and Technology National Institute of Standards and Technology, governmental agency within the U.S. Dept. of Commerce with the mission of "working with industry to develop and apply technology, measurements, and standards" in the national interest. (Sparck Jones, 1995; Harman, 1997), the Message Understanding Conferences (MUC MUC Mount Union College (Ohio) MUC Multi User Chat MUC Message Understanding Conference MUC Montreal Urban Community MUC Malaspina University College (Canada) ), the Conferences on Applied Natural Language Processing Natural language processing Computer analysis and generation of natural language text. The goal is to enable natural languages, such as English, French, or Japanese, to serve either as the medium through which users interact with computer systems such as , and the International Conferences on Document Analysis and Recognition. The TREC and MUC conferences are particularly important for their methodology: all participating research groups must apply their text processing procedures to some common pre-established tasks, allowing performance comparisons across the methods. Current methods of text processing for information-retrieval-like purposes go beyond text search, automatic indexing and automatic extracting procedures (all of which have existed, to some extent at least, since the late 1950s), now including such activities as text linkage linkage In mechanical engineering, a system of solid, usually metallic, links (bars) connected to two or more other links by pin joints (hinges), sliding joints, or ball-and-socket joints to form a closed chain or a series of closed chains. , text augmentation AUGMENTATION, old English law. The name of a court erected by Henry VIII., which was invested with the power of determining suits and controversies relating to monasteries and abbey lands. , and text generation. Nevertheless, while current approaches may achieve rather better results, they do not differ much in principle from those first introduced forty to fifty years ago, even though they may be given different names ("text summarization" in place of abstracting/extracting, "text categorization" in place of indexing/classification, and so on) and may be more sophisticated in some respects (e.g., not just extracting text but putting the extracts into a pre-established template). While some current approaches claim to apply techniques drawn from artificial intelligence research, and the term "intelligent text processing" is sometimes used to refer to procedures of this type (see, for example, Jacobs, 1992), it is doubtful that any can be considered to exhibit true intelligence (Lancaster & Smith, 1999). KNOWLEDGE DISCOVERY The great majority of the criteria of quality proposed and used in the past apply most obviously to abstracts intended to be read by humans. As mentioned earlier, if abstracts are intended primarily as useful document surrogates for search purposes, the quality criteria become somewhat different. Unfortunately, a good abstract for search purposes is unlikely to be good for a human reader. Indeed, an abstract prepared solely for computer searching, such as the telegraphic tel·e·graph·ic also tel·e·graph·i·cal adj. 1. Of, relating to, or transmitted by telegraph. 2. Brief or concise: a telegraphic style of writing. abstracts of the semantic code system (Perry & Kent, 1958), may not be readable read·a·ble adj. 1. Easily read; legible: a readable typeface. 2. Pleasurable or interesting to read: a readable story. by humans at all, and abstracts prepared primarily for search purposes, such as the mini-abstracts proposed by Lunin (1967), may be somewhat difficult for humans to comprehend. For retrieval purposes, and especially in knowledge discovery tasks, exhaustivity and accuracy are extremely important, and the other attributes in Table 2 diminish in significance. In fact, for abstracts intended solely for search purposes, such criteria as readability and coherence/cohesion are not important at all, while other attributes are applicable in opposite ways. Most obviously, brevity is not necessarily desirable since the retrievability of an abstract will be directly related to its length (i.e., number of access points provided). Nevertheless, for reasons mentioned before, there is likely to be an optimum length for effective search and discovery operations. The data retention factor proposed by Mathis (1972) seems a particularly appropriate criterion in knowledge discovery applications since it relates length to completeness of content coverage. Also undesirable for knowledge discovery purposes is internal consistency In statistics and research, internal consistency is a measure based on the correlations between different items on the same test (or the same subscale on a larger test). It measures whether several items that propose to measure the same general construct produce similar scores. because redundancy improves retrievability. That is, if a particular idea is expressed in different ways in an abstract (no synonym synonym (sĭn`ənĭm) [Gr.,=having the same name], word having a meaning that is the same as or very similar to the meaning of another word of the same language. Some are alike in some meanings only, as live and dwell. control), this increases the probability that the text will match an expression selected by a particular searcher or that meaningful relationships between related ideas will be revealed. CONCLUSION Text surrogates for larger bodies of text, whether one refers to them as "abstracts," "summaries," or some other term, have proved extremely useful in a wide variety of information processing information processing: see data processing. information processing Acquisition, recording, organization, retrieval, display, and dissemination of information. Today the term usually refers to computer-based operations. applications for very many years. The increasing application of computers to text processing has not reduced their value (although criteria for judging their quality may have changed somewhat), and one has no reason to suppose that their value diminishes as more critical or sophisticated operations, including those of knowledge discovery, are applied to the text. REFERENCES Borko, H., & Bernier, C. L. (1975). Abstracting concepts and methods. New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of : Academic Press. Brew, C., & Thompson, H. S. (1994). Automatic evaluation of computer generated text: A progress report on the TextEval project. In Proceedings of the Human Language Technology Workshop (March 8-11, 1994) (pp. 108-113). San Francisco San Francisco (săn frănsĭs`kō), city (1990 pop. 723,959), coextensive with San Francisco co., W Calif., on the tip of a peninsula between the Pacific Ocean and San Francisco Bay, which are connected by the strait known as the Golden , CA: Morgan Kaufmann. Brown, A. L., & Day, J. D. (1983). Macrorules for summarizing texts: The development of expertise. Journal of Verbal Learning and Verbal Behavior, 22(1), 1-14. Cooper, W. S. (1969). Is inter-indexer consistency a hobgoblin hobgoblin: see goblin. ? American Documentation, 20(3), 268-278. Cremmins, E. T. (1996). The art of abstracting, 2d ed. Arlington, VA: Information Resources (1) The data and information assets of an organization, department or unit. See data administration. (2) Another name for the Information Systems (IS) or Information Technology (IT) department. See IT. Press. Dronberger, G. B., & Kowitz, G. T. (1975). Abstract readability as a factor in information systems. Journal of the American Society for Information Science, 26(2), 108-111. Dym, E. D. (1967). Relevance predictability: I. Investigation, background and procedures. In A. Kent, O. E. Taulbee, J. Belzer, & G. D. Goldstein(Eds.), Electronic handling of information: Testing and evaluation (pp. 175-185). Washington, DC: Thompson Book Co. Edmundson, H. P.; Oswald, V. A., Jr.; & Wyllys, R. E. (1959). Automatic indexing and abstract-ing of the contents of documents. Los Angeles Los Angeles (lôs ăn`jələs, lŏs, ăn`jəlēz'), city (1990 pop. 3,485,398), seat of Los Angeles co., S Calif.; inc. 1850. , CA: Planning Research Corporation. Endres-Niggemeyer, B.; Maier, E.; & Sigel, A. (1995). How to implement a naturalistic nat·u·ral·is·tic adj. 1. Imitating or producing the effect or appearance of nature. 2. Of or in accordance with the doctrines of naturalism. model of abstracting: Four core working steps of an expert abstractor. Information Processing & Management, 31(5), 631-674. Fidel, R. (1986). Writing abstracts for free-text searching. Journal of Documentation, 42(1), 11-21. Harman, D. (1997). The TREC conferences. In K. Sparck Jones & P. Willett (Eds.), Readings in information retrieval (pp. 247-256). San Francisco, CA: Morgan Kaufmann. Hartley, J. (1994). Three ways to improve the clarity of journal abstracts. British Journal of Educational Psychology, 64(1), 331-343. Hartley, J., & Sydes, M. (1996). Which layout do you prefer? An analysis of readers' preferences for different typographic See typography. layouts of structured abstracts. Journal of Information Science, 22(1), 27-37. Hartley, J.; Sydes, M.; & Blurton, A. (1996). Obtaining information accurately and quickly: Are structured abstracts more efficient? Journal of Information Science, 22(5), 349-356. Haynes, R. B. (1993). More informative abstracts: Current status and evaluation. Journal of Clinical Epidemiology epidemiology, field of medicine concerned with the study of epidemics, outbreaks of disease that affect large numbers of people. Epidemiologists, using sophisticated statistical analyses, field investigations, and complex laboratory techniques, investigate the cause , 46, 595-597. Jacobs, P. S. (Ed.). (1992). Text-based intelligent systems: Current research and practice in information extraction In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured and retrieval. Hillsdale, NJ: Lawrence Erlbaum. Keen, E. M. (1976). A retrieval comparison of six published indexes in the field of library and information science. Unesco Bulletin for Libraries, 30(1), 26-36. Kent, A.; Belzer, J.; Kurfeerst, M.; Dym, E. D.; Shirey, D. L.; & Bose, A. (1967). Relevance predictability in information retrieval systems. Methods of Information in Medicine, 6(2), 45-51. King, R. (1976). A comparison of the readability of abstracts with their source documents. Journal of the American Society for Information Science, 2 7(2), 118-121. Lancaster, F. W. (1998). Indexing and abstracting in theory and practice, 2d ed. UrbanaChampaign: University of Illinois University of Illinois may refer to:
Lancaster, F. W., & Smith, L. C. (In press). Intelligent technologies in library and information service applications: A realistic appraisal. Medford, NJ: Information Today.' Lunin, L. (1967). The development of a machine-searchable index-abstract and its application to biomedical bi·o·med·i·cal adj. 1. Of or relating to biomedicine. 2. Of, relating to, or involving biological, medical, and physical sciences. literature. In B. Flood (Ed.), Three Drexel information science-research studies (pp. 47-134). Philadelphia, PA: Drexel Press. Marcus, R. S.; Benenfeld, A.R.; & Kugel, P. (1971). The user interface for the Intrex retrieval system. In D. E. Walker (Ed.), Interactive bibliographic bib·li·og·ra·phy n. pl. bib·li·og·ra·phies 1. A list of the works of a specific author or publisher. 2. a. search: The user/computer interface (pp. 159-201). Montvale, NJ: AFIPS (American Federation of Information Processing Societies Inc.) An organization founded in 1961 dedicated to advancing information processing in the U.S. It was the U.S. representative of IFIP and umbrella for 11 membership societies. Press. Mathis, B. A. (1972). Techniques for the evaluation and improvement of computer-produced abstracts. Columbus: Ohio State University Ohio State University, main campus at Columbus; land-grant and state supported; coeducational; chartered 1870, opened 1873 as Ohio Agricultural and Mechanical College, renamed 1878. There are also campuses at Lima, Mansfield, Marion, and Newark. , Computer and Information Science Research Center (OSU-CISRC-TR-72-15. PB 214 675). National Information Standards Organization. (1997). Guidelines for abstracts. Bethesda, MD: NISO (National Information Standards Organization, Baltimore, MD, www.niso.org) A non-profit organization founded in 1939 that deals with bibliographic and related information standards. . Payne, D.; Munger, S.J.; & Altman, J. W. (1962). A textual tex·tu·al adj. Of, relating to, or conforming to a text. tex tu·al·ly adv. abstracting technique: A preliminary development and evaluation support.
Pittsburgh, PA: American Institutes for Research (2 vols. AD
285081-285082).Perry, J. W., & Kent, A. (1958). Tools for machine literature searching. New York: Interscience Publishers Inc. Pinto, M. (1995). Documentary abstracting: Toward a methodological model. Journal of the American Society for Information Science, 46(3), 225-234. Pinto, M. (1994). Interdisciplinary in·ter·dis·ci·pli·nar·y adj. Of, relating to, or involving two or more academic disciplines that are usually considered distinct. interdisciplinary Adjective approaches to the concept and practice of written text documentary content analysis (WTDCA). Journal of Documentation, 50(2), 111-133. Pinto, M. (1992). El resumen documental: Principios y metodos. Madrid: La Fundaci6n German Sanchez Ruiperez. Rath, G. J.; Resnick, A.; & Savage, T. R. (1961). Comparison of four types of lexical lex·i·cal adj. 1. Of or relating to the vocabulary, words, or morphemes of a language. 2. Of or relating to lexicography or a lexicon. [lexic(on) + -al1. indicators of content. American Documentation, 12(2), 126-130. Resnick, A. (1961). Relative effectiveness of document titles and abstracts for determining relevance of documents. Science, 134(3484), 1004-1006. Salager-Meyer, F. (1991). Medical English abstracts: How well are they structured? Journal of the American Society for Information Science, 42(7), 528-531. Salton, G. (Ed.). (1971). The SMART retrieval system: Experiments in automatic document processing. Englewood Cliffs, NJ: Prentice-Hall. Salton, G.; Singhal, A.; Mitra, M.; & Buckley, C. (1997). Automatic text structuring and summarization. Information Processing & Management, 33(2), 193-207. Saracevic, T. (1969). Comparative effects of titles, abstracts and full texts on relevance judgements. Proceedings of the American Society for Information Science, 6, 293-299. Shirey, D. L., & Kurfeerst, M. (1967). Relevance predictability: II. Data reduction. In A. Kent; Taulbee, O. E.; Belzer, J.; Goldstein, G. D. (Eds.), Electronic handling of information: Testing and evaluation (pp. 187-198). Washington, DC: Thompson Book Co. Sparck Jones, K. (1995). Reflections on TREC. Information Processing & Management, 31(3), 291-314. Tenopir, C. (1985). Full text database retrieval performance. Online Review, 9(2), 149-164. Tenopir, C., & Jacso, P. (1993). Quality of abstracts. Online, 17(3), 44-55. Thompson, C. W. N. (1973). The functions of abstracts in the initial screening of technical documents by the user. Journal of the American Society for Information Science, 24(4), 270-276. Vinsonhaler, J. F. (1966). Some behavioral indices of the validity of document abstracts. Information Storage and Retrieval information storage and retrieval, the systematic process of collecting and cataloging data so that they can be located and displayed on request. Computers and data processing techniques have made possible the high-speed, selective retrieval of large amounts of , 3(1), 1-11. Wheatley, A., & Armstrong, C. J. (1997). Metadata (1) (meta-data) Data that describes other data. The term may refer to detailed compilations such as data dictionaries and repositories that provide a substantial amount of information about each data element. , recall, and abstracts: Can abstracts ever be reliable indicators of document value? Aslib Proceedings, 49(8), 206-213. Maria Pinto, Departamento de Biblioteconomia y Documentacion, Universidad de Granada, 18071 Granada, Spain F. W. Lancaster, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Early years: 1867-1880 The Morrill Act of 1862 granted each state in the United States a portion of land on which to establish a major public state university, one which could teach agriculture, mechanic arts, and military training, "without excluding other scientific , 501 E. Daniel Street Daniel Street is a political reporter for Channel Nine's National Nine News[1]. He attended St Ignatius' College, Riverview. Street is also a member of the Board of Directors of the Global Panel Foundation-Australasia. , Champaign Champaign (shămpān`), city (1990 pop. 63,502), Champaign co., E central Ill.; inc. 1860. It adjoins the city of Urbana and is a commercial and industrial center in a fertile farm area. The Univ. MARIA PINTO is Professor in the Documentation Faculty of the Granada University, where she teaches courses on information processing and management of quality in library and information science. She is the author of six books (two in second editions) in areas related to knowledge representation, content analysis, abstracting methods and products, and the role of quality management in information processes. Ms. Pinto has also published chapters in monographs and articles in international reviews, one of which received the MIP MIP See: Monthly income preferred security Award of the FID as the Best Article of the 1994 year. She has participated as a partner in Project I+D I+D Investigación y Desarrollo (Spanish: Research and Development) financed by the European Community European Community: see European Union. European Community (EC) Organization formed in 1967 with the merger of the European Economic Community, European Coal and Steel Community, and European Atomic Energy Community. and has been responsible for investigation projects financed by the Education Ministry of Spain. F. W. LANCASTER, editor of Library Trends and Professor Emeritus e·mer·i·tus adj. Retired but retaining an honorary title corresponding to that held immediately before retirement: a professor emeritus. n. pl. of Library and Information Science at the University of Illinois at Urbana-Champaign, has been working in or around libraries for almost fifty years. He is author or co-author co·au·thor or co-au·thor n. A collaborating or joint author. tr.v. co·au·thored, co·au·thor·ing, co·au·thors To be a collaborating or joint author of: "He and a colleague . . . of eleven books (several of which have earned prestigious national awards) and editor or co-editor of twelve others. He has lectured at more than seventy universities or colleges in sixteen countries.3 |
|
||||||||||||||||||||

pre·cise
ly adv.
Printer friendly
Cite/link
Email
Feedback
Reader Opinion