Pigeonholing text.The enormous task of categorizing and retrieving information from the vast quantities of text stored in digital form has spurred the development of a variety of strategies for finding the textual tex·tu·al adj. Of, relating to, or conforming to a text. tex tu·al·ly adv. needle in the database haystack. Most of these automated au·to·mate v. au·to·mat·ed, au·to·mat·ing, au·to·mates v.tr. 1. To convert to automatic operation: automate a factory. 2. techniques rely on the identification of specific words and phrases Words and Phrases® A multivolume set of law books published by West Group containing thousands of judicial definitions of words and phrases, arranged alphabetically, from 1658 to the present. after sentences and paragraphs are stripped of extraneous ex·tra·ne·ous adj. 1. Not constituting a vital element or part. 2. Inessential or unrelated to the topic or matter at hand; irrelevant. See Synonyms at irrelevant. 3. material (SN: 9/7/91, p.155). However, such methods often require some degree of expert human participation in their development and setup. They have trouble with misspellings and garbled text, and they are usually suitable only for specific topics or languages. Now, Marc Damashek of the Department of Defense's National Computer Security Center at Fort George G. Meade Fort George G. Meade, U.S. army post, 13,500 acres (5,460 hectares), central Md., between Baltimore and Washington, D.C.; est. 1917 as a World War I induction center. , Md., has developed a text categorization and retrieval technique that works equally well in any language and requires practically no human preparation. His method, known as Acquaintance, is purely statistical. "No prior information about document content or language is required," Damashek says. His software divides text samples into sequences made up of a given number of consecutive characters, then computes how often each distinct sequence appears in the document. To gauge similarity, Damashek assumes that two documents showing comparable patterns are likely to deal with related subjects. Tests of the technique show that it performs well for grouping documents by language, topic, and subtopic sub·top·ic n. One of the divisions into which a main topic may be divided. , Damashek says. He describes the method in the Feb. 10 Science. |
|
||||||||||||||||||

tu·al·ly adv.
Printer friendly
Cite/link
Email
Feedback
Reader Opinion