Printer Friendly
The Free Library
19,607,053 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Effects of inconsistent relevance judgments on information retrieval test results: a historical perspective.


Only by continuous self-appraisal can a large information system make itself responsive to the needs of the scientific community.

Concluding sentence in Lancaster Lancaster, city, England
Lancaster (lăng`kəstər), city (1991 pop. 43,902) and district, county seat of Lancashire, NW England, on the Lune River.
 (1969)

ABSTRACT

The main objective of information retrieval information retrieval

Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links.
 (IR) systems is to retrieve information or information objects relevant to user requests and possible needs. In IR tests, retrieval effectiveness is established by comparing IR systems retrievals (systems relevance) with users' or user surrogates' assessments (user relevance), where user relevance is treated as the gold standard for performance evaluation Performance evaluation

The assessment of a manager's results, which involves, first, determining whether the money manager added value by outperforming the established benchmark (performance measurement) and, second, determining how the money manager achieved the calculated return
. Relevance is a human notion, and establishing relevance by humans is fraught fraught  
adj.
1. Filled with a specified element or elements; charged: an incident fraught with danger; an evening fraught with high drama.

2.
 with a number of problems--inconsistency in judgment being one of them. The aim of this critical review is to explore the relationship between relevance on the one hand and testing of IR systems and procedures on the other. Critics of IR tests raised the issue of validity of the IR tests because they were based on relevance judgments that are inconsistent. This review traces and synthesizes experimental studies dealing with (1) inconsistency in·con·sis·ten·cy  
n. pl. in·con·sis·ten·cies
1. The state or quality of being inconsistent.

2. Something inconsistent: many inconsistencies in your proposal.
 of relevance judgments by people, (2) effects of such inconsistency on results of IR tests and (3) reasons for retrieval failures. A historical context for these studies and for IR testing is provided including an assessment of Lancaster's (1969) evaluation of MEDLARS MEDLARS
abbr.
Medical Literature Analysis and Retrieval System (computerized index system of the US National Library of Medicine)


MEDLARS,
n.
 and its unique place in the history of IR evaluation.

INTRODUCTION

Information retrieval systems came into being shortly after the Second World War addressing the problem of controlling the information explosion, primarily as related to scientific and technical information. Vannevar Bush (person) Vannevar Bush - Dr. Vannevar Bush, 1890-1974. The man who invented hypertext, which he called memex, in the 1930s.

Bush did his undergraduate work at Tufts College, where he later taught.
 (1890-1974) is credited with defining the problem and suggesting a solution that caught wide attention. As to the problem, he defined it this way: "The summation summation n. the final argument of an attorney at the close of a trial in which he/she attempts to convince the judge and/or jury of the virtues of the client's case. (See: closing argument)  of human experience is being expanded at a prodigious pro·di·gious  
adj.
1. Impressively great in size, force, or extent; enormous: a prodigious storm.

2. Extraordinary; marvelous: a prodigious talent.

3.
 rate" and "our methods of transmitting transmitting,
v to send and receive information, signals, and so on; allows a therapist to perceive a client's physical, emotional, and spiritual states.
 and reviewing the results of research are generations old and by now are totally inadequate for their purpose" (Bush, 1945, p. 2). Bush suggested a technological solution in the form of a device he called memex--"a device in which an individual stores all his books, records, and communications, and which is mechanized mech·a·nize  
tr.v. mech·a·nized, mech·a·niz·ing, mech·a·niz·es
1. To equip with machinery: mechanize a factory.

2.
 so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory" (ibid., p. 6). As yet, memex (hypertext) Memex - Vannevar Bush's original name for hypertext, which he invented in the 1930s.

Fantastic article.
 has not been built. It was a vision. However, the idea of inadequacy of existing methods for controlling the information explosion and of providing a technological solution caught on immediately after the Second World War. Among other things, it affected the development of information retrieval (IR) by using new techniques and systems that rested on technology. Importantly, Bush's ideas were a motivation for funders, such as the National Science Foundation in the United States United States, officially United States of America, republic (2005 est. pop. 295,734,000), 3,539,227 sq mi (9,166,598 sq km), North America. The United States is the world's third largest country in population and the fourth largest country in area. , to support IR development and testing.

As defined by Calvin Mooers Calvin Northrup Mooers (1919 – December 1, 1994), was an American computer scientist known for his work in information retrieval and for the programming language TRAC.  (1919-94), a mathematician, physicist, and pioneer in the field, "information retrieval ... embraces the intellectual aspects of the description of information and its specification for search, and also whatever systems, technique, or machines that are employed to carry out the operation" (Mooers Mooers may refer to:
  • Benjamin Mooers, a general from the American Revolution
  • Mooers (town), New York, a town named after him
  • Mooers (CDP), New York, a community in the town
  • Calvin Mooers, an American computer scientist
, 1951, p. 25). Of course, IR systems and techniques have undergone evolutionary and even revolutionary changes since 1951, but basically, they still concentrate on the same aspects Mooers defined.

The difference between IR and related methods and systems that long preceded it--classifications, subject headings, various indexing methods, or bibliographic bib·li·og·ra·phy  
n. pl. bib·li·og·ra·phies
1. A list of the works of a specific author or publisher.

2.
a.
 descriptions, including the contemporary Functional Requirements for Bibliographic Records Functional Requirements for Bibliographic Records -- or FRBR, sometimes pronounced (IPA pronunciation: [fɝbɚ]  (IFLA IFLA International Federation of Library Associations and Institutions
IFLA International Federation of Landscape Architects
IFLA Instituto Forestal Latinoamericano (Venezuela)
IFLA Israel Free Loan Association
, 1998)--is that IR specifically included "specification for search." The others did not include searching in their specification; searching was simply assumed. In IR, searching is specified in algorithmic al·go·rithm  
n.
A step-by-step problem-solving procedure, especially an established, recursive computational procedure for solving a problem in a finite number of steps.
 detail and the algorithms The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.  keep changing and improving. This is the first key difference.

The second key difference was the choice (at the beginning more by assumption than deliberate selection) of relevance as the underlying, basic notion:
   The fundamental notion used in bibliographic description and in all
   types of classifications or categorizations, including those used
   in contemporary databases, is aboutness. The fundamental notion
   used in IR is relevance. It is not about any kind of information,
   and there are great many, but about relevant information.
   Fundamentally, bibliographic description and classification
   concentrate on describing and categorizing information objects; IR
   is also about that but, and this is a very important "but," in
   addition IR is about searching as well, and searching is about
   relevance. (Saracevic, 2007a, p. 1917)


Retrieval of relevant information or information objects became and still is the primary objective of IR systems.

The two choices in IR, algorithms for searching and relevance as the basic notion and objective, not only affected but even governed gov·ern  
v. gov·erned, gov·ern·ing, gov·erns

v.tr.
1. To make and administer the public policy and affairs of; exercise sovereign authority in.

2.
 testing that grew to be a very important activity in IR. From the outset of IR testing, which had already started by the mid- mid-
pref.
Middle: midbrain. 
1950s, relevance served as the criterion on the basis of which performance of various IR systems or algorithms were compared. Relevance is a human notion and relevance judgments are human assessments, bringing with them all kinds of issues and problems common to many human notions and types of assessments. Well, they are human. One of the issues is that human relevance assessments (like a great many other human assessments) are not consistent, raising the obvious question on the effect of inconsistency in judgments on the results of IR testing.

The aim of this article is to review studies that contained data (as opposed to discussion only) related to questions implied above: What are the effects of inconsistent human relevance judgments on relative performance of different IR algorithms or approaches? Does inconsistency affect test results? In the process, I am providing a historical perspective to these questions and to the general description of IR testing that follows. In addition, I am reviewing and honoring the classic test of Wilf Lancaster (1969) that differed in significant ways from IR tests that followed. His was a unique contribution to IR testing.

Note that the present article is an enlargement enlargement,
n an increase in size.

enlargement, Dilantin,
n.pr See hyperplasia, gingival, Dilantin.

enlargement, idiopathic,
n
 of one part of the relevance study reported in Saracevic (2007a, 2007b). In that study I dealt comprehensively with relevance as the basic notion in information science while in this review I am focusing and enlarging ENLARGING. Extending or making more comprehensive; as an enlarging statute, which is one extending the common law.  on the part that dealt with the relation between relevance and information retrieval testing.

TESTING IN INFORMATION RETRIEVAL

From the very start of practical development of IR systems dating to the late 1940s, searching was based on Boolean logic The "mathematics of logic," developed by English mathematician George Boole in the mid-19th century. Its rules govern logical functions (true/false) and are the foundation of all electronic circuits in the computer.  (AND, OR, NOT), even though at the start "Boolean (mathematics, logic) Boolean - 1. Boolean algebra.

2. (bool) The type of an expression with two possible values, "true" and "false". Also, a variable of Boolean type or a function with Boolean arguments or result. The most common Boolean functions are AND, OR and NOT.
" was not mentioned by name and computing computing - computer  technology was yet to be used (Mooers, 1951; Perry, 1951). Shortly thereafter, coordinate indexing, developed by Mortimer Taube Taube the surname of:
  • Evert Taube (1890–1976), Swedish composer and singer
  • Karl Taube (b.1956), Mayanist and pre-Columbian Mesoamerica scholar, Professor of Anthropology (UC Riverside)
 and colleagues at a company named Documentation Inc., was a direct outgrowth of these ideas and it took the IR world by storm; interestingly, Taube referred to coordinate indexing, following Bush, as "association of ideas (Physiol.) the combination or connection of states of mind or their objects with one another, as the result of which one is said to be revived or represented by means of the other. The relations according to which they are thus connected or revived are called the law of association. " (Taube and Associates, 1955). It was based on uniterms, single terms assigned as·sign  
tr.v. as·signed, as·sign·ing, as·signs
1. To set apart for a particular purpose; designate: assigned a day for the inspection.

2.
 to documents to represent the content, that were later "coordinated" in searching, meaning searched in a Boolean fashion. Uniterms were predecessors of modern techniques in IR. While originally they were assigned and searched by human indexers and searchers, now computers are doing a similar job using various algorithms. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke"
put differently
, uniterms were a granddaddy of IR. With a wide adoption of coordinate indexing, Boolean logic was fully recognized as the basis for searching in IR. A variety of specific, even competing, approaches and tools were developed and applied in practical realizations of coordinate indexing and IR in general.

Very soon, the perennial perennial, any plant that under natural conditions lives for several to many growing seasons, as contrasted to an annual or a biennial. Botanically, the term perennial  questions asked of all systems were raised: What is the effectiveness and performance of given IR approaches? How do they compare? It is not surprising that these questions were raised in IR. At the time; most developers, funders, and users associated with IR were engineers or scientists or worked in related areas where the question of testing was natural, even obligatory obligatory /ob·lig·a·to·ry/ (ob-lig´ah-tor?e) obligate.

obligatory

unavoidable; something that is bound to occur.
. In addition, IR testing began in the late 1950s within a certain context as described by Cyril Cyr·il   , Saint 827-869.

Christian missionary and theologian who with his brother Saint Methodius (826-885) worked in Moravia, translating the Scriptures into Old Church Slavonic.

Noun 1.
 Cleverdon in his acceptance speech for the 1991 Association for Computing Machinery See ACM.

Association for Computing Machinery - Association for Computing
, Special Interest Group on Information Retrieval Gerard Salton Award The Gerard Salton Award is presented by the Association for Computing Machinery (ACM) SIGIR (Special Interest Group on Information Retrieval) every three years to an individual who has made "significant, sustained and continuing contributions to research in information retrieval". :
   These new techniques generated considerable argument, not only
   between the proponents of the different systems, but also among the
   library establishment, many of whom saw these new methods as
   degrading their professional mystiques.... Controversy over the new
   methods was still raging, with extravagant claims on one side being
   countered by absurd arguments on the other side, without any firm
   data being available to justify either viewpoint. (Cleverdon, 1991,
   pp. 3, 4)


Kent et al. (1955) were first to propose measures for testing IR effectiveness; they suggested "recall" and "relevance" (later, because of confusion, renamed "precision"), where relevance was the underlying criterion for these measures. Respectively, they measure the probability of agreement between what the system retrieved or failed to retrieve as relevant (systems relevance) and what the user assessed as relevant (user relevance) where user relevance is the gold standard on the basis of which evaluations are made. (2) Other measures were suggested, but not adopted. With some variation on the theme, precision and recall remained standard measures of IR effectiveness to this day with relevance as the underlying criterion.

The first IR test on record was attempted in the early 1950s, as reported by Gull gull, common name for an aquatic bird of the family Laridae, which also includes the tern and the jaeger. It is found near all oceans and many inland waters. Gulls are larger and bulkier than terns, and their tails are squared rather than forked.  (1956) and recounted later in the section, Inconsistency in Human Relevance Assessments. In short, the test collapsed because of disagreement in relevance assessments between two competing groups. Historically, early IR tests that were most influential were collectively known as "Cranfield Coordinates:

Cranfield is a village in north-west Bedfordshire, England, between Bedford and Milton Keynes. It has a population of around 6,000, and is within the district of Mid Bedfordshire.
 tests," done in the 1950s and 1960s at the (U.K.) Cranfield College of Aeronautics aeronautics: see aerodynamics; airplane; aviation.  (to become Cranfield Institute The Cranfield Institute for Safety, Risk and Reliability (commonly referred to simply as The Cranfield Institute) is a part of Cranfield University in the UK. It is primarily a teaching and research facility, but also offers safety-related consultancy to businesses.  of Technology in 1969 and Cranfield University Cranfield University is a British postgraduate university based on three campuses. The main campus is at Cranfield, Bedfordshire, England. The others are at Shrivenham, Oxfordshire, and Silsoe, also in Bedfordshire, some  in 1993) under the leadership of Cyril Cleverdon (1914-1997). As summarized in Cleverdon (1962, 1967, 1991), the 1962 report refers to Cranfield I and the 1966 and 1967 and in Cleverdon, Mills, & Keen (1966) reports to Cranfield II tests. (3) Cranfield tests also became controversial. For instance, Swanson (1965, 1971), among others, argued that the method of obtaining relevance judgments had influenced the results. Thus, as in the Gull (1956) test, relevance assessments entered again as a point of contention in IR testing. They remain contentious to this day.

In Cranfield I tests, four methods for representing information were compared: Universal Decimal Classification The Universal Decimal Classification is a system of library classification developed by the Belgian bibliographers Paul Otlet and Henri la Fontaine at the end of the 19th century. It is based on the Dewey Decimal Classification, but is much more powerful.  (UDC UDC
abbr.
universal decimal system

UDC (Brit) n abbr (= Urban District Council) → Stadtverwaltung f 
), alphabetical subject catalog catalog, descriptive list, on cards or in a book, of the contents of a library. Assurbanipal's library at Nineveh was cataloged on shelves of slate. The first known subject catalog was compiled by Callimachus at the Alexandrian Library in the 3d cent. B.C. , faceted classification A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order. , and uniterms. This was the first and last time that traditional library techniques (the first three) were tested together with a technique representing IR (uniterms). The results were not anticipated by proponents of each system, namely on many counts, the four systems performed pretty much the same:
   No system that has been investigated has shown itself to be so
   markedly superior as to justify its use in all conditions.... The
   most surprising finding was that "uniterm," as a descriptor
   language, can be given a high rating on many counts. It achieved
   the best overall figures in the test, it presented no serious
   difficulties for the technical searchers ... and was notably
   successful with short indexing time. (Cleverdon, Mills, and Keen,
   1966, p. 92)


Of course, there were numerous critiques of the tests and findings. Today, it is hard to imagine the emotionalism that followed the test--they were contrary to many firmly held beliefs. My favorite critique that Cleverdon repeated a number of times was: "You had no right to be so intelligent with the uniterm system; it is meant to be used by people of low intellect A natural language query program for IBM mainframes developed by Artificial Intelligence Corporation. The company was later acquired by Trinzic Corporation, which was acquired by Platinum, which was acquired by Computer Associates. " (Cleverdon, Mills, & Keen, 1966, p.6).

Cranfield II was devoted to testing various index language devices based on natural language. Thirty-three types of index languages were investigated starting with single terms and then adding word forms and synonyms; broader, related, and narrower terms; and term phrases, hierarchies, and combinations thereof, with alterations of levels of specificity and exhaustivity of indexing (Cleverdon, 1967). Some results were surprising, even revolutionary at the time: "Neither we nor anybody else had considered it as remotely possible that an index language based on single terms in the natural language of the documents would be so effective that the performance could only be improved by confounding confounding

when the effects of two, or more, processes on results cannot be separated, the results are said to be confounded, a cause of bias in disease studies.


confounding factor
 word forms or true synonyms" (Cleverdon 1991, p. 8). This can be done by computers. The Cranfield results paved pave  
tr.v. paved, pav·ing, paves
1. To cover with a pavement.

2. To cover uniformly, as if with pavement.

3. To be or compose the pavement of.
 the way.

Cranfield tests were significant for two other reasons. First, they established a model of IR, called the traditional or laboratory IR model, that was used in IR testing later by Gerard Salton Gerard Salton (8 March, 1927 in Nuremberg - 28 August, 1995) was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time.  (1927-95) in the famous SMART experiments (summarized in Salton Salton may refer to:
  • Salton Inc., a manufacturer of home appliances
  • Salton, East Lothian, Scotland; also spelt Saltoun; the birthplace of Andrew Fletcher of Saltoun
  • Salton, North Yorkshire, England
  • Gerard Salton, a professor of computer science
, 1971 and Salton & McGill McGill may refer to:
  • McGill (surname), people with the surname McGill
  • McGill (Montreal Metro), a metro (subway) station.
  • McGill College Avenue, a street in downtown Montreal
  • McGill, Nevada, a US census-designated place
, 1983), that later morphed into the comprehensive Text Retrieval Conference For other uses of "TREC", see TREC.

The Text REtrieval Conference (TREC) is an on-going series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks.
 (TREC TREC Texas Real Estate Commission
TREC Text Retrieval Conference
TREC Technique de Randonnée Equestre de Compétition
TREC Tropical Research and Education Center
TREC T-cell Receptor Excision Circle
TREC Teachers and Researchers Exploring and Collaborating
) experiments conducted from 1992 to date (Voorhees Voorhees may mean: Places
  • Voorhees Township, New Jersey
  • Voorhees Township, Kansas
People
  • Daniel W. Voorhees, U.S. senator
  • Foster M. Voorhees, New Jersey governor
  • John C. Vorhees, Comedian
  • Tracy Voorhees, U.S.
 & Harman Harman can refer to: People
  • Andrew Harman, writer
  • Avraham Harman, diplomat
  • Brooke Harman, actress
  • Chris Harman, Trotskyist
  • Denham Harman, biogerontologist
  • Gilbert Harman, philosopher
  • Glyn Harman, mathematician
 2005). (4) Unlike Cranfield tests, SMART and TREC were fully automated au·to·mate  
v. au·to·mat·ed, au·to·mat·ing, au·to·mates

v.tr.
1. To convert to automatic operation: automate a factory.

2.
. The model that came out of Cranfield tests has been in continuous use in IR testing for half a century. The emphasis in the model is on processing information objects by IR systems and then matching them with queries to produce retrieved results. The processing and matching is algorithmic; the goal of the algorithms is to maximize retrieval of relevant information or information objects. In the purest form of this model, the user is represented by a query only and not considered beyond that at all; also, interaction with anything outside the system is not a consideration, as if the system is a self-contained self-con·tained
adj.
1. Constituting a complete and independent unit in and of itself: A self-contained dictionary defines every word contained within it.

2.
a.
 black box. Relevance assessments are done by a user, or user surrogate surrogate n. 1) a person acting on behalf of another or a substitute, including a woman who gives birth to a baby of a mother who is unable to carry the child. 2) a judge in some states (notably New York) responsible only for probates, estates, and adoptions. , and the effectiveness of retrieved outputs, using different approaches or algorithms, is compared to these assessments. Testing is based on a number of assumptions, one of them being that human judgments of relevance are consistent (Saracevic 2007b, p. 2132). Needless to say, the evident restrictions of the model came under numerous critiques, more recently and thoroughly by Ingwersen & Jarvelin (2005).

Second, for the first time in Cranfield tests the familiar precision-recall graphs were drawn and the "law" of inverse (mathematics) inverse - Given a function, f : D -> C, a function g : C -> D is called a left inverse for f if for all d in D, g (f d) = d and a right inverse if, for all c in C, f (g c) = c and an inverse if both conditions hold.  performance between recall and precision was formulated for·mu·late  
tr.v. for·mu·lat·ed, for·mu·lat·ing, for·mu·lates
1.
a. To state as or reduce to a formula.

b. To express in systematic terms or concepts.

c.
 (Cleverdon, 1962, pp. 72, 89, 90). To this day, graphing of precision-recall figures is an established way to demonstrate and compare performance, and improving on the inverse relation In mathematics, the inverse relation of a binary relation is the relation taken 'backwards', as in changing the relation 'child of' to 'parent of'. In formal terms, if

 is a major goal of most procedures in IR tests.

SMART tests also signified sig·ni·fied  
n. Linguistics
The concept that a signifier denotes.



[Translation of French signifié, past participle of signifier, to signify.]

Noun 1.
 a departure of IR from the original Boolean logic for searching and retrieval to more sophisticated approaches that allowed for different information organizations and subsequent outputs, such as ranking and clustering by relevance, where relevance is determined by the system, of course. A variety of approaches and algorithms were used and tested, so tests became more involved as well. TREC further extended these approaches and algorithms, even involving numerous new areas for IR, such as retrieval of recordings of speech, across multiple languages and much more, as recounted on the TREC site, http:// trec.nist.gov/. Not surprisingly, IR tests became still more involved.

DETERMINING RELEVANCE IN INFORMATION RETRIEVAL TESTS

As mentioned, IR tests are based on comparing systems relevance--responses to a query that a system deemed and retrieved as relevant following whatever procedure--and user relevance--user's (or a surrogate's) assessment as to relevance of retrieved answers or of any information or information objects in the system, even if not retrieved. User relevance is the gold standard against which system relevance, that is, system performance, is compared. Thus, performance assessment of a given system (algorithm algorithm (ăl`gərĭth'əm) or algorism (–rĭz'əm) [for Al-Khowarizmi], a clearly defined procedure for obtaining the solution to a general type of problem, often numerical. , procedure ...) follows from and is based on human judgment of relevance of given information or information object to a given query or need. The key issue is obtaining acceptable relevance judgments that can then be used as a standard for calculating recall and precision. Once these are obtained, calculations are straightforward. Well, almost. The assessments have to involve not only the retrieved answers, but also all potentially relevant documents in the collection (or in a representative sample, or in a pooled set of answers) so that recall can be calculated. One of the best descriptions of these and other requirements of IR testing was concisely con·cise  
adj.
Expressing much in few words; clear and succinct.



[Latin conc
 provided by Tague-Sutcliffe (1992).

Establishing this gold standard is one of the main problems, even conundrums, of IR testing. Not surprisingly then, in many reports of IR tests, the critical step showing how relevant objects became relevant is often shrouded shroud  
n.
1. A cloth used to wrap a body for burial; a winding sheet.

2. Something that conceals, protects, or screens: under a shroud of fog.

3.
a.
 in mystery. Or, it is glossed over. Or, it is accepted from a previous source without further ado Ado (ä`dō), city (1987 est. pop. 287,000), SW Nigeria. Located in a region where rice, corn, cassava, and yams are grown. Traditionally an important cotton-weaving town, Ado also manufactures bricks, tile, and pottery. . Or some collective group, such as "judges" or "librarians This is a list of people who have practised as a librarian and are well-known, either for their contributions to the library profession or primarily in some other field. " or "searchers" or "students" is mentioned as bearing the responsibility. Or, some such explanation. It is hard to get at it.

The objective of relevance judgments in IR tests is to get as close as possible to real-life real-life  
adj.
Actually happening or having happened; not fictional: a documentary with footage of real-life police chases. 
 situations so that test results would have real-life validity. This is very, very difficult to achieve. Thus, simulation methods have been developed. Basically, there are four methods by which relevance judgments have been obtained that are regarded as gold standards:

1. By the user or questioner--person who posed own question made the judgment as well;

2. By a user surrogate(s)--such as a specialist (or by consensus of a group of specialists) who perform judgments on the topic of a given question in their specialty;

3. By an information professional (or by consensus of a group of professionals) who is professionally entrusted or involved with some aspect of the process, who performs judgments on the topic of a given question that is not necessarily in their specialty, but is familiar with what is going on; and

4. By "bystanders" signifying Signifyin' (slang) is an African-American rhetorical device featuring indirect communication or persuasion and the creating of new meanings for old words and signs. Signifying, in this sense, includes repetition and difference, implication and association, combining words and  none of the above--for example, by students asked to do a given task of judgment, including possible prescreening.

The first method involves "real users" and the others "laboratory-type users." Here are some examples. In Cranfield I, "the search questions had been obtained from several hundred individuals in 58 different organisations, mainly in England England, the largest and most populous portion of the United Kingdom of Great Britain and Northern Ireland (1991 pop. 46,382,050), 50,334 sq mi (130,365 sq km). It is bounded by Wales and the Irish Sea on the west and Scotland on the north.  and America America [for Amerigo Vespucci], the lands of the Western Hemisphere—North America, Central (or Middle) America, and South America. The world map published in 1507 by Martin Waldseemüller is the first known cartographic use of the name. . Each question was based on a single document in the test collection, and a search was considered successful if that particular paper was located in the catalogue" (Cleverdon, 1991, p. 4; full report in Cleverdon, 1962, pp. 8-9, 52). This is a variation of the theme of the second method above. Questions came from an unknown number of individual specialists who were asked to pose a question(s) on the basis of a source document, and the gold standard was the document from which the question came. But additional documents were retrieved, and the issue became how to deal with them as to relevance. These "were assessed in relation to the appropriate question" (ibid., p. 52). Presumably pre·sum·a·ble  
adj.
That can be presumed or taken for granted; reasonable as a supposition: presumable causes of the disaster.
, the project members did the additional relevance assessments, thus bringing in the third method. In Cranfield II, the procedure for getting the gold standard was changed: a number of authors of recent research papers (in aeronautics) provided a question based on the problem that led to the research, together with more questions that arose during the conduct of research; the authors also were given a set of references to judge as to their relevance to these questions (Cleverdon, Mills, & Keen, 1966, p. 16). The source documents and evaluated references comprised the gold standard for each question. This is a combination of the first and second method. However, some prescreening also was done by students, so the fourth, or bystander by·stand·er  
n.
A person who is present at an event without participating in it.


bystander
Noun

a person present but not involved; onlooker; spectator

Noun 1.
, method was used as well. Generously, the Cranfield collection with relevance assessments was provided as open source for sharing. Subsequently, it was used in many IR tests, including SMART. With this, Cranfield relevance assessments migrated as well.

All IR tests that followed used one or more of these methods for establishing gold standards, the first method used the least because it is the most difficult to secure. Here is a sampling: Lancaster (1969) and Saracevic et al. (1988) used the first method; SMART test collections used the second and third method; TREC uses the second method, with some derivative derivative: see calculus.
derivative

In mathematics, a fundamental concept of differential calculus representing the instantaneous rate of change of a function.
 tests using the third and fourth method; Shaw et al. (1991) used the second and third method. Needless to say, all of these tests faced similar difficulties as the Cranfield tests in obtaining gold standards, but subsequently, all abandoned the use of a source document as the standard the way it had been used in the Cranfield tests. In some form or other, sometimes real users but mostly surrogates--specialists, information professionals, or bystanders--were the ultimate relevance judges for gold standards.

ANALYSIS OF RETRIEVAL FAILURES IN IR TESTS

For any system or process, diagnosing the reason(s) for failure is often a key issue in testing in general. Here, we are considering IR tests where analysis of failures was done on the basis of retrieval effectiveness measures, namely precision and recall. These were: the Cranfield I test, (failure was not analyzed an·a·lyze  
tr.v. an·a·lyzed, an·a·lyz·ing, an·a·lyz·es
1. To examine methodically by separating into parts and studying their interrelations.

2. Chemistry To make a chemical analysis of.

3.
 in Cranfield II), Lancaster (1969) test of MEDLARS, and Blair & Maron (1985) tests of a legal collection. That's it. Diagnosing failure has not become a part of major IR tests. Thus, we are dealing here with a very limited universe. Just to mention a connection: Wilfrid Lancaster was in 1963 a member of the Cranfield team.

Analysis of failures was one of the objectives of the Cranfield I test. By failure it was meant "analysis of all cases ... where source document was not retrieved" (Cleverdon, 1962, p. 38). The reasons for failure were classified as to (1) question (six reasons), (2) indexing (ten reasons), (3) searching (six reasons), and (4) system (six reasons). The analysis to determine causes of failure proved to be time consuming, from one to two hours per case, and complex, often involving consultation. The results indicated that the following percentages of failures were due to factors related to: question, 17 percent; indexing, 60 percent; searching, 17 percent; and system, 6 percent. Human decisions were most often causes for failure, particularly as to how questions were handled and interpreted, how indexing was done, and how searching was conducted.

Lancaster (1969) conducted a large and comprehensive evaluation of MEDLARS (Medical Literature Analysis and Retrieval System Noun 1. Medical Literature Analysis and Retrieval System - relational database of the United States National Library of Medicine for the storage and retrieval of bibliographical information concerning the biomedical literature
MEDLARS
) operated by the U.S. National Library of Medicine Noun 1. U.S. National Library of Medicine - the world's largest medical library
National Library of Medicine, United States National Library of Medicine
. At the time it was a computerized computerized

adapted for analysis, storage and retrieval on a computer.


computerized axial tomography
see computed tomography.
 system for retrospective LAW, RETROSPECTIVE. A retrospective law is one that is to take effect, in point of time, before it was passed.
     2. Whenever a law of this kind impairs the obligation of contracts, it is void. 3 Dall. 391.
 searching on demand and had some 800,000 citations. When MEDLARS moved online it became Medline, the most widely used biomedical bi·o·med·i·cal
adj.
1. Of or relating to biomedicine.

2. Of, relating to, or involving biological, medical, and physical sciences.
 resource in the world that annually adds some 600,000 articles. Lancaster's was not a laboratory evaluation. It involved 299 regular, real questions posed over a twelve-month period by MEDLINE users who agreed to be part of the study. Users received a random sample of 25 to 30 retrieved articles plus additional articles found by means outside of MEDLARS (known by requesters as relevant searches outside MEDLARS) and evaluated these articles as to relevance to their request. (Additional articles were supplied in order to create a base for calculation of recall.) The average precision was 50 percent and recall was 58 percent--these figures were later widely used as general indicators of performance for IR systems. But Lancaster cautioned that averages can be misleading--some searches operated with high precision and recall at the same time, while others with very low recall.

Lancaster analyzed two types of failures: recall failures (relevant documents that were not retrieved) and precision failures (retrieved documents that were not relevant). There were 797 recall failures and 3,038 precision failures. As to recall failures 10 percent were due to index language, 35 percent due to searching, 37 percent due to indexing, and 25 percent due to inadequate user-system interaction. (A document can be missed due to more than one cause, thus the percentages add to more than 100.) As to precision failures 36 percent were due to index language, 32 percent due to searching, 13 percent due to indexing, 17 percent due to inadequate user-system interaction, and 2 percent due to value judgment. A large number of failures were due to inadequate searching and user-computer interaction; Lancaster made a number of suggestions on how to improve them. These suggestions are still relevant today. In practice, searching and human-computer interactions Human-computer interaction

An interdisciplinary field focused on the interactions between human users and computer systems, including the user interface and the underlying processes which produce the interactions.
 still involve a great many human decisions, no matter how automated and sophisticated the systems may be.

Here follows a summary of another large study involving failure analysis. It is also the last study of this kind. Blair & Maron (1985) conducted a study that involved retrieval from a system named STAIRS (STorage And Information Retrieval System) An IBM text document management system for mainframes. It allows users to search for documents based on keywords or word combinations.  (Storage and Information Retrieval System) developed by IBM (International Business Machines Corporation, Armonk, NY, www.ibm.com) The world's largest computer company. IBM's product lines include the S/390 mainframes (zSeries), AS/400 midrange business systems (iSeries), RS/6000 workstations and servers (pSeries), Intel-based servers (xSeries)  that automatically indexed full texts of documents. Like Lancaster's, the test was not laboratory but real-life based. The collection involved 40,000 documents (about 350,000 pages of text) that were assembled as·sem·ble  
v. as·sem·bled, as·sem·bling, as·sem·bles

v.tr.
1. To bring or call together into a group or whole: assembled the jury.

2.
 and used in the defense of a large corporate lawsuit lawsuit: see procedure; tort. . Two lawyers, principal defense attorneys in the suit, generated fifty-one information requests that were searched by paralegals who were also information professionals. The searches were repeated until lawyers (requestors) indicated that they had enough relevant information to defend the lawsuit on that issue or question. Lawyers indicated the relevance of answers. Precision, as always, was easily calculated. To establish a recall base, Blair and Maron also included answers from "sample frames consisting of subsets of the unretrieved database that we believed to be rich in relevant documents" and took random samples from these subsets--these were also provided to lawyers for judging. Precision was 79 percent but recall was 20 percent--which they considered a surprisingly low figure. They gave reasons for "deterioration de·te·ri·o·ra·tion
n.
The process or condition of becoming worse.
 of recall" (i.e., the system retrieving only one in five relevant documents) as being due to the large file size, restrictions of natural language indexing, and failures in searching. They did not provide figures for each reason, only examples. Test results became controversial, as were all test results from IR testing. Salton (1986) provided a critique of the test by showing examples from the other test and concluded at the outset: "that not only is this level of performance typical of what is achievable in existing, operational retrieval environments, but that it actually represents a high order of retrieval effectiveness" (ibid., p. 649). Blair & Maron (1990) answered and clarified the results. In essence, Salton defended full-text indexing vigorously by questioning Blair & Maron's conclusion about the ineffectiveness in·ef·fec·tive  
adj.
1. Not producing an intended effect; ineffectual: an ineffective plea.

2. Inadequate; incompetent: an ineffective teacher.
 of automatic full-text indexing. Today, the controversy is forgotten. Full-text indexing is fully accepted, but failure analyses, a la Lancaster and Blair & Maron are no longer conducted.

A lot can be learned from failure analyses, particularly about human performance. Regrettably, failure tests are no longer conducted, mostly because they are complex, very time consuming, and CANNOT be done by a computer. This type of testing is now relegated to history. Lancaster is the major contributor to that history. His explanation of difficulties also provides the reasons why we have not seen more failure tests:
   The "hindsight" analysis of a search failure is the most
   challenging aspect of the evaluation process. It involves, for each
   "failure," an examination of the full text of the document; the
   indexing record for this document (i.e., the index terms assigned
   ...); the request statement; the search formulation upon which the
   search was conducted; the requester's completed assessment forms,
   particularly the reasons for articles being judged "of no value";
   and any other information supplied by the requester. On the basis
   of all these records, a decision is made as to the prime cause or
   causes of the particular failure under review. (Lancaster, 1969, p.
   123)


INCONSISTENCY IN HUMAN RELEVANCE ASSESSMENTS

People differ, sometimes considerably, in decisions related to a variety of information processes, such as indexing, classification, searching, and yes, relevance as well. Measured are individual or group differences in terms of a degree of agreement/disagreement, overlap o·ver·lap
n.
1. A part or portion of a structure that extends or projects over another.

2. The suturing of one layer of tissue above or under another layer to provide additional strength, often used in dental surgery.

v.
, or inter- inter- word element [L.], between.

inter-
pref.
1. Between; among: interdental.

2. In the midst of; within: interoceptor.
 or intraconsistency. For illustration here are some results from studies of individual differences in information processes other than relevance:

* In a recent study of inter-indexer consistency, Medelyan & Witten (2006) found an average consistency of 38 percent according to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 one measure and 49.5 percent with another measure, while in an older study Zunde & Dexter dexter /dex·ter/ (deks´ter) [L.] right; on the right side.

dex·ter
adj.
Of or located on the right side.
 (1969) found indexing consistency of 24 percent according to one and 41 percent according to another measure (averages differ depending on what measure is used--measures are not standardized standardized

pertaining to data that have been submitted to standardization procedures.


standardized morbidity rate
see morbidity rate.

standardized mortality rate
see mortality rate.
).

* In studies of selection of search terms for the same questions by different searchers, Iivonen (1995) found 40.3 percent consistency for specific and 24.4 percent for general searches, and Saracevic, Chamis, & Trivison Kantor (1988) found that the mean overlap was 27 percent.

In information science, observations of relevance inconsistency started with IR tests. As mentioned, Gull (1956) reported on the first study aimed at IR evaluation. The study is worth recounting because inadvertently it showed that relevance assessments differ significantly among groups of judges. (5) Actually, consistency of relevance judgments was not the purpose of the study at all. IR evaluation was. The original goal was to compare two different and competing indexing systems--one developed by the Armed Services The Constitution authorizes Congress to raise, support, and regulate armed services for the national defense. The President of the United States is commander in chief of all the branches of the services and has ultimate control over most military matters.  Technical Information Agency (ASTIA ASTIA Armed Services Technical Information Agency ) using subject headings, and the other by Documentation Inc. using coordinate indexing uniterms, that is, index terms searched in Boolean manner. In the test, each group indexed separately the same 15,000 documents, searched 98 requests, and then separately judged retrieved answers as to relevance. Then, not the performance of different systems, but the relevance judgments became contentious. The first group found that 2,200 documents were relevant to the 98 requests, while the second found that 1,998 were relevant. There was not much overlap between groups. The first group judged 1,640 documents relevant that the second had not, and the second group judged 980 relevant that the first had not. Then they tried to reconcile and considered each others' relevant documents and again compared judgments. Each group accepted some more as relevant, but in the end, they still disagreed; their rate of agreement, at the end was 30.9 percent. The first-ever IR test did not continue.

Cleverdon was very much aware of this study and discussed it and the associated relevance problems at some length in both the 1962 and 1966 reports. The collapse of Gull's study influenced Cleverdon's selection of the method for obtaining relevance judgments, as it did every IR test done since then. The lesson was learned: Never, ever use more than a single judge (or a single object, such as source document) for establishing the gold standard for comparison. No test ever does.

With the test fiasco reported by Gull (1956), the whole field of information retrieval became very conscious of the fact that human relevance judgments are not consistent. It was a rude rude - [WPI] 1. Badly written or functionally poor, e.g. a program that is very difficult to use because of gratuitously poor design decisions. Opposite: cuspy.

2. Anything that manipulates a shared resource without regard for its other users in such a way as to cause a
 awakening. Not unexpectedly, researchers started asking: How consistent, or rather how inconsistent are relevance judgments? and What factors affect consistency?

Consistency or rather inconsistency of relevance judgments became an object of study in a number of experiments. For some studies, this was one of a number of objectives (e.g., Rees & Schultz, 1967), for others this was the main objective (e.g., Sormunen, 2002), while still for others, like in the Gull study, this was not an objective at all, but data on relevance judgment consistency can be derived (e.g., Haynes et al., 1990).

Table 1 provides a list of studies with relevance consistency data--this is not just a representative sample, but almost the total universe of such studies. Other consistency data can be derived from studies presented in Table 2 in the next section, where all the studies were of the third category mentioned above (objective different, but consistency data derivable).

Studies are summarized following the pattern: "[author] used [subjects] to do [tasks] in order to study [object of research]." In this way, the sample, method, research question, and results are put together for direct familiarization fa·mil·iar·ize  
tr.v. fa·mil·iar·ized, fa·mil·iar·iz·ing, fa·mil·iar·iz·es
1. To make known, recognized, or familiar.

2. To make acquainted with.
 and for observation of considerable differences between various studies, which make generalizations difficult and hypothetical Hypothetical is an adjective, meaning of or pertaining to a hypothesis. See:
  • Hypothesis
  • Hypothetical
  • Hypothetical (album)
. Note that seven of the ten studies in the table were also reviewed in Saracevic (2007b); three older studies (Resnick & Savage, 1964, Rees & Schultz, 1967, and Cuadra et al., 1967) were added here to provide a longer historical perspective.

Before making conclusions, here is a note of caution. As was mentioned in Saracevic (2007b, p. 2129), for synthesizing findings caveat abound:
   Numerous aspects of the studies reviewed can be questioned and
   criticized. Easily! Criteria, measures, and methods used in these
   studies are not standardized. While no study was an island, each
   study was done more or less on its own.... Thus, the results are
   hardly comparable. Still, it is really refreshing to see
   conclusions made on the basis of data, rather than on the basis of
   examples, anecdotes, authorities or contemplation. Summary
   conclusions ... derived from the studies reviewed should be really
   treated as hypotheses.


From the nine studies in Table 1 and from data in seven studies in Table 2 reported in the next section, we can draw some hypothetical generalizations (Saracevic, 2007b, p. 2137):
   The inter- and intra-consistency or overlap in relevance judgments
   varies widely from population to population and even from
   experiment to experiment, making generalizations particularly
   difficult and tentative.


* However, it seems that higher expertise and laboratory conditions can produce an overlap in judgments up to 80% or even more. The intersection intersection /in·ter·sec·tion/ (-sek´shun) a site at which one structure crosses another.

intersection

a site at which one structure crosses another.
 is large.

* With lower expertise the overlap drops dramatically. The intersection is small.

* In general, it seems that the overlap using different populations hovers around 30 percent.

* Higher expertise results in a larger overlap. Lower expertise results in smaller overlap.

* Whatever the overlap between two judges, when a third judge is added it fails, and with each addition of a judge it starts falling dramatically. Each addition of a judge or a group of judges reduces the intersection dramatically.

* More judges result in less overlap.

* The lowest overlap reported was 3.5% when three search groups were used (Haynes et. al., 1990)

* Subject expertise affects consistency of relevance judgments. Higher expertise results in higher consistency and stringency. Lower expertise results in lower consistency and more inclusion.

TESTS OF USING HUMAN RELEVANCE JUDGMENTS IN IR TESTS

Cranfield and SMART tests and later TREC tests as well, stirred a wide debate and generated a considerable amount of harsh criticism. Critics concentrated especially on relevance judgments used as gold standards--on methods by which they were obtained, on their inadequacy, shortcomings A shortcoming is a character flaw.

Shortcomings may also be:
  • Shortcomings (SATC episode), an episode of the television series Sex and the City
, and so on (e.g,. Swanson, 1965, 1971). The critiques are succinctly suc·cinct  
adj. suc·cinct·er, suc·cinct·est
1. Characterized by clear, precise expression in few words; concise and terse: a succinct reply; a succinct style.

2.
 summarized by Harter (1996, pp. 37, 38, 43, 45):
   Relevance judgments form the bedrock on which traditional
   experimental evaluation model is constructed.... Relevance
   assessments are anything but stable and they vary significantly
   depending on the variable being investigated.... That variations in
   relevance judgments are likely to change the values of recall and
   precision is obvious.... We can no longer rest the evaluation of
   information retrieval systems on the assumption that such
   variations do not significantly affect the measurement of
   information retrieval performance.... On the other hand, the
   reaction to this research [showing variations in relevance
   judgments] and criticism from experimental researchers who use
   relevance assessment to conduct Cranfield-like experiments on
   information retrieval systems has been mostly silence ... with very
   few exceptions [As exceptions, Harter discusses studies by Lesk &
   Salton, 1968; Cleverdon, 1970; Kazhdan, 1979; and Burgin, 1992
   included in Table 2; mostly, he dismisses them because of "their
   lack of involvement with the variables associated with real
   users."]


Despite sometimes emotional criticism, Harter (and others in the same vein) raises serious and even critical questions: Given that relevance judgments are inconsistent, which they are to various degrees as amply demonstrated, how does this affect results of IR evaluation? Because of that, are IR test results valid, reliable and to be trusted in a scientific sense? Answers need to be decisive for accepting results of such tests.

There were seven experimental studies conducted to date trying to answer these questions--I believe this is the whole universe of such studies. Considering hundreds of IR tests done over the years since Cranfield, this is a small universe; nevertheless, I do not believe they can be dismissed as Harter (1996) did. Table 2 presents descriptions of and conclusions from these seven studies.

Before making conclusions, note that the same caveats mentioned above apply to these studies as well. Here are some hypothetical generalizations derived from data in seven studies in Table 2 and summarized in Saracevic (2007b, p. 2138):
   In evaluating different IR systems under laboratory conditions,
   disagreement among judges seems not to affect or affects minimally
   the results of relative performance among different systems when
   using average performance over topics or queries. The conclusion of
   no effect is counter-intuitive, but a small number of experiments
   bear it out. However, note that the use of average performance
   affects or even explains this conclusion.


* Rank order of different IR techniques seems to change minimally, if at all, when relevance judgments of different judges, averaged over topics or queries, are applied as test standards.

* However, swaps--changes in ranking--do occur with a relatively low probability. The conclusion of no effect is not universal.

* Another however: Rank order of different IR techniques does change when only highly relevant documents are considered--this is another (and significant) exception to the overall conclusion of no effect.

* Still another however: Performance ranking over individual queries or topics differs significantly depending on the query.

CONCLUSIONS

The basic aim of IR systems is to provide information that is relevant to user questions and possible needs. Thus, relevance became the criterion for measures of the effectiveness of performance for IR systems and procedures. IR tests are based on comparing systems relevance with user relevance, where user relevance assessments serve as the gold standard for comparison and evaluation. Relevance is a human notion, and establishing relevance by humans is fraught with a number of problems, inconsistency in judgment being one of them. The aim of this review is to explore the relation between relevance on the one hand and testing of information retrieval systems and procedures on the other. In the process, a historical perspective is provided on the testing of IR systems, and on studies that addressed the inconsistency of relevance judgments and the effect of that inconsistency on results of IR tests.

Conclusions from these studies are provided as hypothetical generalizations (with proper caveats) at the end of the last two sections. Thus, they are not repeated here. Instead, some general observations about IR tests are made here in conclusion.

Information retrieval has a proud history. It started right at the conclusion of the Second World War by addressing the problem of information explosion, particularly in science and technology, and applying modern information technology as a solution. Over the ensuing en·sue  
intr.v. en·sued, en·su·ing, en·sues
1. To follow as a consequence or result. See Synonyms at follow.

2. To take place subsequently.
 decades, IR systems and techniques spread worldwide and are successfully used in a great many endeavors, including the contemporary search engines. In part, this is due to advances in information technology--databases are larger and enable inclusion of full texts, not just representations as when IR started--searches are faster, interfaces more elaborate and flexible, and so on. And in part, this is also due to improvements in IR algorithms and procedures. But again, in many respects, these were predicated on advances in technology. The two are intertwined.

It is true that human relevance judgments are affected by a host of factors that produce significant individual and group disagreements. Tests and pragmatic experiences, as well as common sense, have shown that. Concluding that there are no effects of inconsistent relevance judgments on rank order of tested IR procedures, as optimistically op·ti·mist  
n.
1. One who usually expects a favorable outcome.

2. A believer in philosophical optimism.



op
 proclaimed pro·claim  
tr.v. pro·claimed, pro·claim·ing, pro·claims
1. To announce officially and publicly; declare. See Synonyms at announce.

2.
 in early tests, may not be completely warranted. Averaging has an effect; rank switches do occur at times, and the issue needs a lot of further research.

But it is also easily observable ob·serv·a·ble  
adj.
1. Possible to observe: observable phenomena; an observable change in demeanor. See Synonyms at noticeable.

2.
 that significant advances were made over decades in IR. By many pragmatic ways of figuring, contemporary IR systems and processes are better than those of a few decades ago. Along with technology, testing played a major role in improvements of IR algorithms and processes. In other words, despite observed relevance problems from the human side, IR systems improved from the systems side.

On the historical side, it is quite interesting, if not amazing a·maze  
v. a·mazed, a·maz·ing, a·maz·es

v.tr.
1. To affect with great wonder; astonish. See Synonyms at surprise.

2. Obsolete To bewilder; perplex.

v.intr.
, to note that the basic methodological principles and model for testing laid down a half century ago are still governing gov·ern  
v. gov·erned, gov·ern·ing, gov·erns

v.tr.
1. To make and administer the public policy and affairs of; exercise sovereign authority in.

2.
 IR testing today. IR testing is like a river that became broader and deeper but never changed its course. The course seems to be cemented.

IR systems, as conceptualized, will never get away from relevance. For people, relevance is here to stay. Thus, it is here to stay with all associated problems for IR systems as well.

ACKNOWLEDGMENTS

First of all, I wish to thank researchers who reported on experiments reviewed here. They did the hard work and provided data. Moreover, they were the inspiration for this review.

Thanks to Yuelin Li and Ying Zhang, my assistants at the time, who tirelessly tire·less  
adj.
Not yielding to fatigue; untiring or indefatigable.



tireless·ly adv.
 searched the literature for sources about relevance and then organized them. As with my previous work, Sandra sandra (sänˑ·dr),
adj
 Lanman's editing and thoughtful suggestions were really relevant.

I also wish to thank Keith Russell and Lorraine Haricombe, editors for this issue, for their invitation to contribute to this Festschrift fest·schrift  
n. pl. fest·schrif·ten or fest·schrifts
A volume of learned articles or essays by colleagues and admirers, serving as a tribute or memorial especially to a scholar.
 in honor As a verb, to accept a bill of exchange, or to pay a note, check, or accepted bill, at maturity. To pay or to accept and pay, or, where a credit so engages, to purchase or discount a draft complying with the terms of the draft.  of Professor F. W. Lancaster. Wilf is a valued friend of many years. The invitation was an honor.

REFERENCES

Blair, D. C., & Maron, M. E. (1985). An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM (publication) Communications of the ACM - (CACM) A monthly publication by the Association for Computing Machinery sent to all members. CACM is an influential publication that keeps computer science professionals up to date on developments. , 28(3), 289-291.

Blair, D. C., & Maron, M. E. (1990). Full-text information retrieval: Further analysis and clarification. Information Processing information processing: see data processing.
information processing

Acquisition, recording, organization, retrieval, display, and dissemination of information. Today the term usually refers to computer-based operations.
 & Management, 26(3), 437-447.

Burgin, R. (1992). Variations in relevance judgments and the evaluation of retrieval performance. Information Processing and Management, 28(5), 619-627.

Bush, V. (1945). As we may think. Atlantic Monthly, 176(11), 101-108. Retrieved Nov. 7, 2007, from http://www.theatlantic.com/doc/194507/bush.

Cleverdon, C. W. (1962). Report on the testing and analysis of an investigation into the comparative efficiency of indexing. Cranfield, UK: ASLIB ASLIB Association of Special Libraries & Information Bureau  Cranfield Research Project. Retrieved Nov. 17, 2007, from http://hdl.handle.net/1826/836.

Cleverdon, C. W. (1967). The Cranfield tests on index language devices. Aslib Proceedings, 19(6), 173-194.

Cleverdon, C. W. (1970). The effect of variations in relevance assessments in comparative experimental tests of indexing languages. Cranfield, UK: Cranfield Library Report no.3. Retrieved Nov. 16, 2007, from http://hdl.handle.net/1826/967.

Cleverdon, C. W. (1991). The significance of the Cranfield tests on index languages. Proceedings of the 14th Annual International ACM (Association for Computing Machinery, New York, www.acm.org) A membership organization founded in 1947 dedicated to advancing the arts and sciences of information processing. In addition to awards and publications, ACM also maintains special interest groups (SIGs) in the computer field.  SIGIR SIGIR Special Interest Group on Information Retrieval (Association for Computing Machinery)
SIGIR Special Inspector General for Iraq Reconstruction
 Conference on Research and Development in Information Retrieval, 1-3.

Cleverdon, Cyril W., Mills, Jack, & Keen, Michael (1966). Factors determining the performance of indexing systems; Volume 1, Design; Part 1, Text. Retrieved Nov. 17, 2007, from http://hdl.handle.net/1826/861.

Cuadra, C. A., Katter, R. V., Holmes, E. H., & Wallace Wal·lace , Alfred Russel 1823-1913.

British naturalist who developed a concept of evolution that paralleled the work of Charles Darwin.
, E. M. (1967). Experimental Studies of Relevance Judgments: Final Report. 3 vols. Santa Monica Santa Monica (săn`tə mŏn`ĭkə), city (1990 pop. 86,905), Los Angeles co., S Calif., on Santa Monica Bay; inc. 1886. Tourism and retailing are important, and the city has motion-picture, biotechnology, and software industries. , CA: System Development Corporation. NTIS NTIS - National Technical Information Service : PB-175 518/XAB, PB-175 517/XAB, PB-175 567/XAB.

Gull, C. D. (1956). Seven years of work on the organization of materials in special library. American Documentation, 7(4), 320-329.

Harter, S. P. (1971). The Cranfield II relevance assessments: A critical evaluation. Library Quarterly, 41(3), 229-243.

Harter, S. P. (1996). Variations in relevance assessments and the measurement of retrieval effectiveness. Journal of the American Society for Information Science, 47(1), 37-49.

Haynes, B. R., McKibbon, A., Walker, C. Y., Ryan, N., Fitzgerald, D., & Ramsden, M.F. (1990). Online access to MEDLINE in clinical setting. Annals of Internal Medicine Annals of Internal Medicine (Ann Intern Med) is an academic medical journal published by the American College of Physicians (ACP). It publishes research articles and reviews in the area of internal medicine. Its current editor is Harold C. Sox. , 112(1), 78-84.

Iivonen, M. (1995). Consistency in the selection of search concepts and search terms. Information Processing & Management, 31(2), 173-190.

Ingwersen, P., & Jarvelin, K (2005). The turn: Integration of information seeking Information seeking is the process or activity of attempting to obtain information in both human and technological contexts. Information seeking is related to, but yet different from, information retrieval (IR).  and retrieval in context. Dordrecht: Springer springer

a North American term commonly used to describe heifers close to term with their first calf.
.

International Federation of Library Association and Institutions (IFLA) (1998). Functional Requirements See information requirements and functional specification.

(specification) functional requirements - What a system should be able to do, the functions it should perform.
 for Bibliographic Records-Final Report. Retrieved Nov. 15, 2007 from: http://www.ifla.org/VII/s13/frbr/frbrl.htm#2.1.

Janes, J. W. (1994). Other people's judgments: A comparison of users' and others' judgments of document relevance, topicality, and utility. Journal of the American Society for Information Science, 45(3), 160-171.

Janes, J. W., & McKinney, R. (1992). Relevance judgments of actual users and secondary users: A comparative study. Library Quarterly, 62(2), 150-168.

Kent, A., Berry Berry, former province, France
Berry (bĕrē`), former province, central France. Bourges, the capital, and Châteauroux are the chief towns.
, M., Leuhrs, E U., & Perry, J. W. (1955). Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation, 6(2), 93-101.

Kazhdan, T. V. (1979). Effects of subjective expert evaluation of relevance on the performance parameters of document-based information retrieval system. Nauchno-Tekhnicheskaya Informatsiya, Seriya 2(13), 21-24.

Lancaster, F. W. (1969). MEDLARS: Report on the evaluation of its operating efficiency. American Documentation, 20(2), 119-142.

Lee, H., Belkin, N. J., & Krovitz, B. (2006). Rutgers information retrieval evaluation project on IR performance on different precision levels. Journal of the Korean Society for Information Management, 23(2), 97-111.

Lesk, M. E., & Salton, G. (1968). Relevance assessment and retrieval system evaluation. Information Processing & Management, 4(4), 343-359.

Medelyan, O., & Witten, I. H. (2006). Measuring inter-indexer consistency using a thesaurus. Proceedings of the 2006 ACM/IEEE Joint Conference on Digital Libraries. 274-275

Mooers, C. N. (1951). Zatocoding applied to mechanical organization of knowledge. American Documentation, 2(1), 20-32

Perry, J. W. (1951). Superimposed su·per·im·pose  
tr.v. su·per·im·posed, su·per·im·pos·ing, su·per·im·pos·es
1. To lay or place (something) on or over something else.

2.
 punching of numerical numerical

expressed in numbers, i.e. Arabic numerals of 0 to 9 inclusive.


numerical nomenclature
a numerical code is used to indicate the words, or other alphabetical signals, intended.
 codes on handsorted punched cards See punch card.

(storage, history) punched card - (Or "punch card") The signature medium of computing's Stone Age, now long obsolete outside of a few legacy systems.
. American Documentation, 2(4), 205-212.

Rees, A. M., & Schultz, D. G. (1967) A field experimental approach to the study of relevance assessments in relation to document searching. 2 vols. Cleveland, OH: Western Reserve University, School of Library Science, Center for Documentation and Communication Research. NTIS: PB-176 080/XAB, PB-176 079/XAB. ERIC: ED027909, ED027910.

Resnick, A., & Savage, T. R. (1964). The consistence con·sis·tence  
n.
Consistency.

Noun 1. consistence - a harmonious uniformity or agreement among things or parts
consistency
 of human judgments of relevance. American Documentation, 15(2), 93-95.

Salton, G. (Ed.). (1971). The SMART retrieval system: Experiments in automatic document processing Processing text documents, which includes indexing methods for text retrieval based on content. See document imaging. . Englewood Cliffs, NJ: Prentice-Hall.

Salton, G. (1986). Another look at automatic text-retrieval systems. Communications of the ACM, 29(7), 648-656.

Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
: McGraw Hill.

Saracevic, T. (1991). Individual differences in organizing, searching and retrieving information. Proceedings of the American Society for Information Science, 28, 82-86.

Saracevic, T. (2007a). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology The American Society for Information Science and Technology (also referred to as ASIST or ASIS&T) is an organization of information professionals. Established in 1937, the organization sponsors an annual conference and publishes proceedings from this conference under , 58(3), 1915-1933.

Saracevic, T. (2007b). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology, 58(13), 2126-2144.

Saracevic, T., Kantor. P., Chamis, A. Y., & Trivison, D. (1988). A study of information seeking and retrieving. I. Background and methodology. Journal of the American Society for Information Science, 39(3), 161-176.

Shaw, W. M., Jr, Wood, J. B., Wood, R. E., & Tibbo, H. R. (1991). The cystic fibrosis cystic fibrosis (sĭs`tĭk fībrō`sĭs), inherited disorder of the exocrine glands (see gland), affecting children and young people; median survival is 25 years in females and 30 years in males.  database: Content and research opportunities. Library & Information Science Research, 13(4), 347-366.

Sormunen, E. (2002). Liberal relevance criteria of TREC: Counting on neglible documents? Proceedings of the 25st Annual International Conference on Research and Development in Information Retrieval of the Special Interest Group on Information Retrieval, Association for Computing Machinery, 324-330.

Swanson, D. R. (1965) Evidence underlying the Cranfield results. Library Quarterly, 35(1), 1-20

Swanson, D. R. (1971). Some unexplained unexplained
Adjective

strange or unclear because the reason for it is not known

Adj. 1. unexplained - not explained; "accomplished by some unexplained process"
 aspects of the Cranfield tests of indexing performance factors. Library Quarterly, 41(3), 223-228.

Tague-Sutcliffe, J. (1992). The pragmatics pragmatics

In linguistics and philosophy, the study of the use of natural language in communication; more generally, the study of the relations between languages and their users.
 of information retrieval experimentation revisited. Information Processing & Management, 28(4): 467-490.

Taube, M. and Associates. (1955). Storage and retrieval of information by means of the association of ideas. American Documentation, 6(1), 1-17.

Vakkari, P., & Sormunen, E. (2004). The influence of relevance levels on the effectiveness of interactive information retrieval. Journal of the American Society for Information Science and Technology, 55(11), 963-969.

Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5), 697-716.

Voorhees, E. M. (2001). Evaluation by highly relevant documents. Proceedings of the 24th Annual International Conference on Research and Development in Information Retrieval of the Special Interest Group on Information Retrieval, Association for Computing Machinery, 74-82.

Voorhees, E. M., & Harman, D. K. (Eds.). (2005). TREC. Experiment and evaluation in information retrieval. Cambridge, MA: MIT MIT - Massachusetts Institute of Technology  Press.

Wallis, P., & Thom, J. A. (1996). Relevance judgments for assessing recall, Information Processing & Management, 32(3), 273-286.

Zunde, P., & Dexter, M. E. (1969). Indexing consistency and quality. American Documentation, 20(3), 259-267.

NOTES

(1.) Parts of this paper were reported in Saracevic (2007a and 2007b). Verbatim ver·ba·tim  
adj.
Using exactly the same words; corresponding word for word: a verbatim report of the conversation.

adv.
 quotes are clearly indicated.

(2.) Recall can be defined as probability that a relevant information object will be retrieved and precision that a retrieved object will be relevant.

(3.) Interestingly enough, Cranfield tests did not use a computer but simulated computer searching: "At that time there was no program which was remotely capable of doing what was required but fortunately a member of my staff, Michael Keen, came up with an ingenious in·gen·ious  
adj.
1. Marked by inventive skill and imagination.

2. Having or arising from an inventive or cunning mind; clever: an ingenious scheme. See Synonyms at clever.

3.
 idea which allowed us to simulate simulate - simulation  computer searching, albeit with considerable clerical effort." (Cleverdon, 1991, p. 8)

(4.) TREC is a long-term Long-term

Three or more years. In the context of accounting, more than 1 year.


long-term

1. Of or relating to a gain or loss in the value of a security that has been held over a specific length of time. Compare short-term.
 effort at the [US] National Institute for Standards and Technology (NIST), that brings various IR teams together annually to compare results from different IR approaches under laboratory conditions.

(5.) This study and studies that follow are reported and commented upon in Saracevic (2007b, pp. 2134ff.).

Tefko Saracevic is Professor II at School of Communication, Information and Library Studies, Rutgers University Rutgers University, main campus at New Brunswick, N.J.; land-grant and state supported; coeducational except for Douglass College; chartered 1766 as Queen's College, opened 1771. Campuses and Facilities


Rutgers maintains three campuses.
 in New Brunswick, New Jersey This article is about the city in New Jersey. For the Canadian province, see New Brunswick.
New Brunswick, also known as "the Healthcare City"[2] or "Hub City",[3] is a city and the county seat of the County of Middlesex, New Jersey, USA.
. He was the president of the American Society for Information Science and received the Society's Award of Merit (the highest award given by the society). He also received the Gerard Salton Award for Excellence in Research, by the Special Interest Group on Information Retrieval, Association for Computing Machinery (also the highest award given by the group). In a histogram histogram
 or bar graph

Graph using vertical or horizontal bars whose lengths indicate quantities. Along with the pie chart, the histogram is the most common format for representing statistical data.
 of citations from papers in the Journal of the American Society for Information Science and Technology (JASIST JASIST Journal of the American Society for Information Science and Technology  & predecessor names), done by Eugene Garfield Eugene "Gene" Garfield (born September 16 1925 in New York City) is an American scientist, one of the founders of bibliometrics and scientometrics.

Following ideas inspired by Vannevar Bush's famous 1945 article As We May Think, Garfield undertook the development of a
 from the Web of Science for years 1956-2004 and involving 3,575 authors, Tefko Saracevic ranked first in citations to his work both in articles in the Journal (Total Local Citation Citation

(foaled 1945) U.S. Thoroughbred racehorse. In four seasons he won 32 of 45 races, finished second in ten, and third in two. He won the 1948 Triple Crown, and became the first horse to win $1 million. He set a world record in 1950 by running a mile in 1:33 3/5.
 Score), as well in articles globally from that Journal (Total Global Citation Score).
Table 1. Studies Reporting on Consistency of Relevance Judgments.

Resnick & Savage (1964) in the first relevance consistency study on
record, used forty-six technical professionals to assess relevance
of thirty-four technical reports and patent disclosures to indicate
which of these are relevant to their interest in order to observe
intra-consistency of relevance judgments. The judges were divided
into four groups each receiving a different representation--full
text, citation, abstract, including citation, and title. The
experiment was repeated after one month. Respectively,
intra-relevance agreements on judgments were for full documents
54%, for citations 70%, for abstracts 61%, and for titles 63%.

Rees & Schultz (1967) used a total of 153 judges divided in seven
groups (as listed below) that were given sixteen documents in
diabetes related to a real research project to judge the relevance
of the documents to each of three research stages in order to,
among others, observe the inter-consistency of relevance judgments
by each group. Respectively, interrelevance agreement for
twenty-one medical librarians--searchers was 44%, twenty-one
medical librarians--non-searchers was 40%, fourteen medical
experts--researchers was 58%, fourteen medical
experts--non-researchers was 56%, twenty-nine scientists was 55%,
twenty-five residents was 51% and twenty-nine medical students was
50%.

Cuadra & Katter (1967) used 230 seniors and graduate students in
psychology (with different levels of experience) to rate relevance
of each of nine psychology journal abstracts against each of eight
short information requirement statements in order, among others, to
observe the degree of inter-judge agreement in relevance ratings as
related to the level of training of the .judges in the filed. Four
levels of experience were established. The inter-judge correlations
for the four experience levels from lowest to highest were .41,
.41, .49, and .44.

Haynes et al. (1990) studied MEDLINE use in a clinical setting and
not relevance consistency. However, their report does include data
from which consistency rates can be derived. They used forty-seven
attending physicians and 110 trainees who retrieved 5,307 citations
for 280 searches related to their clinical problem, and assessed
the relevance of the retrieved citations. Authors then used two
other search groups of thirteen physicians experienced in searching
and three librarians to replicate 78 of those searches where
relevance was judged by a physician with clinical expertise in the
topic area in order to compare retrieval of relevant citations
according to expertise. For the replicated searches, all searcher
groups retrieved some relevant articles, but only 53 of the 1,525
relevant articles (3.5%) were retrieved by all three search groups.
This is the only real-life study on the question.

Shaw, Wood, Wood, & Tibbo (1991) used four judges to assess the
relevance of 1,239 documents in the cystic fibrosis test collection
to 100 queries. Judged documents were divided into four sets: A
from query author/researcher on the subject, B from 9 other
researchers, C from four postdoctoral fellows, and D from one
medical bibliographer, in order to enable performance evaluations
of different IR representations and techniques using any or all of
the judgment sets. The overall agreement between judgment sets was
40%.

Janes & McKinney (1992) used four students as users with
information requests to judge as to relevance two sets of retrieved
documents that differed in the amount of information presented
(primary judges) and then used four undergraduate students without
and four graduate students with searching expertise (secondary
judges) to re-judge the two sets in order to compare changes in
judgments due to increase in provided inlbrmation between primary
and secondary judges. The overlap in judgment of relevant documents
(calculated here as sensitivity) between all secondary judges and
primary judges was 68%.

Janes (1994) used thirteen students inexperienced in searching,
twenty experienced student searchers and fifteen librarians to
re-judge twenty documents in each of two topics that were
previously judged as to relevance by users in order to compare
users' versus non-users' relevance judgments. The overall agreement
in ratings between original users' judgments and judgments of the
three groups was 57% and 72% for the respective document sets.

Sormunen (2002) used nine master's students to reassess 5,271
documents already judged on relevance in thirty-eight topics in
TREC-7 and 8 on a graded four-point scale (as opposed to a binary
scale used in TREC) in order to compare the distribution of
agreement on relevance judgment between original TREC and newly
reassessed documents and seek resolution in cases of disagreement.
He found that 25% of documents rated relevant in TREC were rated
not relevant by the new assessors; 36% of those relevant in TREC
were marginally relevant; and 1% of documents rated not relevant in
TREC were rated relevant.

Vakkari & Sormunen (2004) used twenty students to search four
TREC-9 topics that already had pre-assigned relevance ratings by
TREC assessors on a system that provided interactive relevance
feedback capabilities, in order to study the consistency of user
identification of relevant documents as pre-defined by TREC and
possible differences in retrieval of relevant and non relevant
documents. They found that the student users identified 45% of
items judged relevant by TREC assessors.

Lee, Belkin, & Krovitz (2006) used ten experienced searchers (not
indicated as to status) to compare two lists of thirty documents
each for ten TREC topics. The documents were beforehand judged as
to relevance by three judges; then the lists were ordered so that
precision level varied from 30% to 70%. Subjects indicated their
preference between two lists of various precision levels for each
topic. The study was done in order to examine the ability of
subjects to recognize lists that have a higher precision level,
called "right lists" as they contain more relevant documents. The
range of recognition of right lists varied from 14.6% to 31.2%.
Agreement in relevance judgments was 24%

Table 2. Studies Reporting on the Effect of Inconsistency of
Relevance Judgments on IR Test Results

Lesk & Salton (1968) used eight students or librarians (not
specified as to which) who posed forty-eight different queries to
the SMART system containing a collection of 1,268 abstracts in the
field of library and information science, to assess the relevance
of those 1,268 documents to their queries (called the A judgments).
Then a second, independent set of relevance judgments (B judgments)
was obtained by asking each of the eight judges to assess for
relevance six additional queries not of his/her own in order to
rank system performance obtained using four different judgments
sets (A, B, their intersection and union). They found that the
overall agreement between original assessors (A) and eight new
assessors (B) was 30% and concluded after testing three different
IR techniques that all sets of relevance judgments produce stable
performance ranking of the three techniques.

Cleverdon (1970) used three subject experts in aerodynamics (the
field of the collection) to separately judge relevance of documents
retrieved for forty-two questions in Cranfield II tests for which
known relevance scores were originally established by users in
order to observe "whether the new sets of relevance decisions made
any significant difference in the order of merit, as determined by
the normalized recall of the indexing language" (ibid., p. 11).
Nineteen indexing languages were tested. Rank correlation showed
that relevance decisions by different judges did not significantly
affect the comparative results of original rankings for these
languages--the rank correlation between original results and three
new sets was .92, .92, and .94 respectively. Overall agreement in
relevance decisions was not given, although it could be calculated
from data in appendices.

Kazhdan (1979) took the findings from the Lesk & Sahon (1968) study
as a hypothesis and used a collection of 2,600 documents in
electrical engineering that had sixty queries with two sets of
relevance judgments--one from a single expert and the other from a
group of thirty experts--in evaluating seven different document
representations in order to compare the performance of different
representations in relation to different judgment sets. He found
that Lesk & Salton hypothesis is confirmed: the relative ranking of
the seven different representations remained the same over two sets
of judgments; however, there was one exception where ranking
changed.

Burgin (1992) used a collection of 1,239 documents in the cystic
fibrosis collection (Shaw et al., 1991) that had one hundred
queries with tour sets of relevance judgments in the evaluation of
six different document representations in order to compare
performance as a function of different document representations and
different judgment sets. The overall agreement between judgment
sets was 40%. He found that there were no noticeable differences in
overall performance averaged over all queries for the four judgment
sets; however, there were many noticeable differences for
individual queries.

Wallis & Thom (1996) used seven queries from the SMART CACM
collection of 3,204 computer science documents (titles and in most
cases, abstracts) that already had relevance judgments by SMART
judges in order to compare two retrieval techniques. Then two
judges (paper authors, called judge 1 and 2) assessed separately 80
pooled top-ranked retrieved documents for each of seven queries in
order to rank system performance using three different judgments
sets (SMART, intersection and union of judge 1 and 2). They found
that the overall agreement between original assessors (SMART) and
two new assessors (judge 1 and 2) on relevant documents was 48%.
After testing two different IR techniques they concluded that the
three sets of relevance judgments did not produce the same
performance ranking of the two techniques, but the performance
figures for each technique are close to each other in all three
judgment sets.

Voorhees (2000) (also in Voorhees & Harman, 2005, pp. 44, 68-70)
reports on two studies involving TREC data. (Reminder: A pool of
retrieved documents for each topic in TREC is assessed for
relevance by a single assessor, the author of the topic, called
here the primary assessor). In the first study, two additional (or
secondary) assessors independently rejudged a pool of up to 200
relevant and 200 nonrelevant documents as judged so by the primary
assessor for each of the 49 topics in TREC-4. Then the performance
of 33 retrieval techniques was evaluated using three sets of
judgments (primary, secondary union, and intersection). In the
second study, an unspecified number of assessors from a different
and independent institution, Waterloo University, judged more than
13,000 documents for relevance related to fifty TREC-6 topics;
next, the performance of seven-four IR techniques was evaluated
using three sets of judgments (primary, Waterloo union and
intersection). Both studies were done in order to look at the
effect of relevance assessments by different judges on the
performance ranking of the different IR techniques tested. She
found that in the first study, the mean overlap between all
assessors (primary and secondary) was 30%, and in the second study,
33%. After testing thirty-three different IR techniques in the
first and seventy-four in the second test, she concluded: "The
relative performance of different retrieval strategies is stable
despite marked differences in the relevance judgments used to
define perfect retrieval" (Voorhees 2000, p. 714). Swaps in ranking
did occur but the probability of the swap was relatively small.

Voorhees (2001) used fifty topics created for the TREC-9 Web track
and asked assessors to judge retrieved pages on a three point
scale: relevant, highly relevant, not relevant (as opposed to
general TREC assessments that use a binary relevance
scale--relevant and not relevant). The assessments were done by a
primary judge and then the relevant and highly relevant documents
were re-assessed by two other secondary assessors. All assessors
were also asked to identify the best page or pages for a topic. The
study was done in order to examine the effect of highly relevant
documents on the performance ranking of the different IR techniques
tested. She found that "different retrieval systems are better at
finding the highly relevant documents than those that are better at
finding generally relevant documents." (ibid., p. 76) This
conclusion contradicts the finding of the previous (Voorhees, 2000)
study which concluded that relative effectiveness of retrieval
systems is stable despite differences in relevance judgment sets.
"The ability to separate highly relevant documents from generally
relevant documents evidently is correlated with systems
functionality, and thus differences among systems are reflected in
the average score" (ibid., p. 77). The agreement among three
assessors as to the best pages for a topic was 34%.
COPYRIGHT 2008 University of Illinois at Urbana-Champaign
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2008 Gale, Cengage Learning. All rights reserved.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Saracevic, Tefko
Publication:Library Trends
Article Type:Report
Geographic Code:1USA
Date:Mar 22, 2008
Words:10286
Previous Article:F. W. Lancaster as scholar, teacher, and mentor: reflections of students.
Next Article:Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster.
Topics:



Related Articles
How do children find information on different types of tasks? Children's use of the Science Library Catalog.
Introduction.
Evaluation of Image Retrieval Systems: Role of User Feedback.
Information Retrieval Beyond the Text Document.
Exploiting Multimodal Context in Image Retrieval.
Faculty relevance criteria: internalized user needs.
Relevance: language, semantics, philosophy.
Modules and information retrieval facilities of the Human Use Regulatory Affairs Advisor (HURAA).
Improving performance support systems through information retrieval evaluation.
A neural network based software retrieval system with fuzzy-related thesaurus.

Terms of use | Copyright © 2012 Farlex, Inc. | Feedback | For webmasters | Submit articles