Printer Friendly

Finding needles in database haystacks.

Computers have become the repositories of vast amounts of information, ranging from electronic messages and bulletins to newspaper articles, research papers, textbook materials, documents and dictionaries. Whereas storing large masses of information is relatively easy, retrieving particular items from such enormous stocks can prove both time consuming and frustrating. Text retrieval is especially difficult when a database contains material covering an unlimited range of subjects expressed in widely varying vocabularies. Because some words may mean different things in different contexts -- plasma, for example, -- conventional search and retrieval methods, which rely on indexes consisting of sets of key words or phrases, are unreliable and difficult to apply.

Computer scientists Gerard Salton and Chris Buckley of Cornell University have now developed an alternative approach for extracting relevant information from a large, diverse database. Their scheme, described in the Aug. 30 SCIENCE, relies on automated techniques for evaluating the degree of similarity between different pieces of text. The method involves breaking down each piece of text into such units as sections, paragraphs and sentences, then assigning to each unit a set of terms used to represent its content.

Suppose that a user of a digitally stored encyclopedia wants to find all material related to astronomical intruments. The user selects a single article, perhaps on telescopes, as her starting point. She then asks the computer to look for all other articles containing material similar to that in the telescope article. The computer proceeds by evaluating the degree of similarity, expressed according to a set of special formulas, between the telescope article and the material in the rest of the database. On the basis of those calculations, the computer then selects other articles that appear relevant to the topic. Instead of starting with a text excerpt or article already in the database, a user can also write out a request for information, expressed in English-language sentences that provide a good description of the required material.

The scheme's efficiency and convenience depends on how effectively it identifies related text passages. Preliminary tests have proved encouraging, the researchers say. "No other text search and retrieval approach currently contemplated appears to offer equal promise for unrestricted text environments and arbitrary subject matter," they conclude.
COPYRIGHT 1991 Science Service, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1991, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:new method for text retrieval
Publication:Science News
Date:Sep 7, 1991
Words:366
Previous Article:Firms sweet on no- or low-cal sugar.
Next Article:Detecting the loss of encoded data.
Topics:


Related Articles
State consumer fraud act applied to accounting firm.
Pigeonholing text.
Introduction.
Introduction.
Exploiting Multimodal Context in Image Retrieval.
NOVEMBER CONFERENCE TACKLES TEXT RETRIEVAL SYSTEMS.
Learning PHP.
Interactive exploration of non-indexed data.
Watching the kids: Surveillance in the U.K.

Terms of use | Copyright © 2016 Farlex, Inc. | Feedback | For webmasters