Vaulting the language barrier: computers are helping to search texts and data now shrouded in linguistic differences.Marjorie Hlava can't read Russian, but that doesn't stop her from learning the contents of a document printed in the Cyrillic alphabet Cyrillic alphabet Alphabet used for Russian, Serbian (see Serbo-Croatian language), Bulgarian and Macedonian, Belarusian, Ukrainian, and many non-Slavic languages of the former Soviet Union, as well as Khalka Mongolian (see Mongolian language). . She simply places each page under the cover of the flatbed scanner A scanner that provides a flat, glass surface to hold pages of paper, books and other objects for scanning. The scan head is moved under the glass across the page. Sheet feeders are usually optionally available that allow multiple sheets to be fed automatically. in her Albuquerque office, presses a button, and waits as her computer displays an English-language version. Using only English, she can also search Russian databases, such as files of published scientific reports. She types in the key words or phrases that describe her interests, then lets a series of computer programs take over. After converting her request into Russian, they sift through data files for references to documents that seem to match, convert those matches back into English, and display them on her computer. More than once she has even conversed via her laptop computer-on a plane, for instance-with Russians who know no English. She types her side of the dialogue in English, which the computer converts into a Russian display. The other party types his or her responses in Russian, which the computer translates for Hlava. They can chat for hours that way, provided they restrict their words and phrases Words and Phrases® A multivolume set of law books published by West Group containing thousands of judicial definitions of words and phrases, arranged alphabetically, from 1658 to the present. to those in the thesauruses, or set lists of words, on her machine. That isn't too hard, Hlava notes, since the Russian-to-English portion currently contains some 750,000 words and phrases and the English-to-Russian one nearly 600,000. Most of the software programs that allow fairly inexpensive, off-the-shelf computer hardware to translate Russian are preliminary versions being developed by Gerold G. Belonogov and Boris A. Kuznetsov at VINITI, the All-Russian Institute for Scientific and Technical Information in Moscow. Hlava's company, Access Innovations, helped channel some U.S. government financing into the creation of those systems. As the Internet has been demonstrating over the past few years, "we now have access to an enormous amount of information that didn't used to be available," notes Douglas W. Oard of the University of Maryland University of Maryland can refer to:
Because users seldom pay for data they find on the Web, there is little incentive for those who post the information to invest in expensive, time-consuming multilanguage translations or indexing. What a user needs to make full and efficient use of a foreign database or the Internet, Oard explains, is a system that translates among languages, searches effectively for answers to a user's query or stated interests, and then ranks any matches by the likelihood of their satisfying a particular user's needs. For many persons interested in focused areas of science or engineering-such as the microwave heating of plasmas or drugs to treat cancer patients-"the things that Marjorie Hlava [and her VINITI colleagues] do are just as good as you would like," observes Oard. "The limitation is that humans can find them difficult to use"; that is, they need to be trained in effective search strategies. He and a host of others are working to make foreign data and files easily accessible to an even broader audience, one with little training in data searches. Unfortunately, he says, "we're only about half as good as you'd like at doing this. And getting halfway turns out to have been rather easy." It's the second half that will prove costly in both time and money, he maintains. The payoff could prove substantial, he and Hlava agree. Such efforts could uncloak a world of research and data for people who don't speak a foreign language. Today, computer technologies are being developed to translate a wide range of mother tongues mother tongue n. 1. One's native language. 2. A parent language. mother tongue Noun the language first learned by a child Noun 1. . At the behest be·hest n. 1. An authoritative command. 2. An urgent request: I called the office at the behest of my assistant. of the European Parliament European Parliament, a branch of the governing body of the European Union (EU). It convenes on a monthly basis in Strasbourg, France; most meetings of the separate parliamentary committees are held in Brussels, Belgium, and its Secretariat is located in Luxembourg. , for instance, several ambitious programs are working to make documents prepared in English or French intelligible to those who read any of the other nine official languages of the European Union The languages of the European Union are languages used by people within the member states of the European Union. They include the twenty-three official languages of the European Union along with a range of others. . Even more challenging projects around the world seek to pair English with languages written in non-Roman characters-such as Japanese, Chinese, Greek, Arabic, Russian, Korean, and Vietnamese. Few of these efforts are designed to provide full machine translation of the documents; rather, their aim is a more limited rendering of some important aspects-such as titles, key words, or abstracts. Indeed, this may be sufficient if the goal is merely to identify a few particularly valuable documents that a user might then choose to have translated in full, Oard observes. The projects could also help electronic browsers identify more circumscribed circumscribed /cir·cum·scribed/ (serk´um-skribd) bounded or limited; confined to a limited space. cir·cum·scribed adj. Bounded by a line; limited or confined. information, such as images posted on the Internet with captions in a foreign language, names and affiliations of foreign scientists who have conducted research on a topic of interest, or newly coined foreign terms or short quotations in a text. Even limited cross-language identification and retrieval of electronically stored text represents a tall order, Oard notes. For instance, even within a single language, commercial database searching remains a fairly unscientific unscientific Unproven, see there , "seat-of-the-pants thing," observes Richard S Ri·chard , Joseph Henri Maurice Known as "Rocket." 1921-2000. Canadian hockey player. A right wing for the Montreal Canadiens (1942-1960), he led his team to eight Stanley Cup championships and was the first player to score 50 goals in a . Marcus, an information scientist at the Massachusetts Institute of Technology Massachusetts Institute of Technology, at Cambridge; coeducational; chartered 1861, opened 1865 in Boston, moved 1916. It has long been recognized as an outstanding technological institute and its Sloan School of Management has notable programs in business, . What's not well recognized, he says, is that unless someone is an expert in searching or has the services of a good librarian, "you typically are able to retrieve only about 5 percent of the relevant documents available." By employing certain computer techniques that he says are available only on experimental systems, "you can bring the comprehensiveness of a search as close to 100 percent as you like." With several interactions, sophisticated programs can prompt a user to find the most effective words for a query. Marcus maintains that this extra effort "can make all the difference between getting almost nothing and getting everything you want." Before computers were in wide use, librarians indexed documents with a few key words-the ones that appeared in a card catalog. Such limited indexing "is not very good for detailed analysis of articles and documents," Marcus says, "because a few terms won't cover all of their information." Moreover, unless the wording of an indexed portion of some text-often the title or abstract-is restricted to terms in a thesaurus, an indexer might employ words that a later searcher wouldn't think to use. With computers, "you can now index all of the words in a document" for full-text querying, Marcus notes. Yet even this does not always prove satisfactory. If an author used the word "Cessna" in his text and a searcher attempted to retrieve it by asking for references to small planes, even a full-text search A search that compares every word in a document, as opposed to searching an abstract or a set of keywords associated with the document. Word processors and text editors contain full-text search functions that let you find a word or phrase anywhere in the document. would miss what conceptually should have been a valid match. "So our research over the past 20 years has been to make key-word use smarter" by getting the computer to suggest synonyms, Marcus says. Not only might it point out that a Cessna is a type of small plane, it might also ask whether it should expand the ongoing search to include other small planes, perhaps helicopters-even dirigibles. Alternatively, the computer may attempt to narrow an overly broad search by soliciting feedback on its first few matches. The computer can then look for a pattern in what was rejected or ask the user why certain choices were rejected, then refine subsequent searches based on the response. British computer scientist Steven Pollitt of the University of Huddersfield's Centre for Database Access Research is taking a similar tack. His computer-aided searches ask the user what terms he or she would like to begin with and use them as a departure for identifying related search terms-some broader, some narrower in focus. If a searcher typed in Alzheimer's disease Alzheimer's disease (ăls`hī'mərz, ôls–), degenerative disease of nerve cells in the cerebral cortex that leads to atrophy of the brain and senile dementia. , for example, the computer would flash a list of related terms, such as Alzheimer's syndrome and Alzheimer fibrillary fi·bril n. 1. A small slender fiber or filament. 2. Anatomy Any threadlike fiber or filament, such as a myofibril or neurofibril, that is a constituent of a cell or larger structure. lesion. A number next to each term shows how many documents match it. The computer can also search simultaneously for texts fitting additional categories-such as a country (where clinical trials may have occurred), drugs, or other treatments (such as acupuncture acupuncture (ăk`y pŭng'chər), technique of traditional Chinese medicine, in which a number of very fine metal needles are inserted into the skin at specially designated points. )-and count or display all texts that match the combination. The key to making this approach work is a comprehensive list of index terms that have been organized into hierarchies, Pollitt explains. Degenerative disease A degenerative disease is a disease in which the function or structure of the affected tissues or organs will progressively deteriorate over time, whether due to normal bodily wear or lifestyle choices such as exercise or eating habits. , for instance, would contain a file of terms for Alzheimer's and other chronic illnesses. Choosing Alzheimer's would allow the computer to suggest broader terms, such as degenerative disease, or narrower ones. For searching to work effectively, the developers of a database must have indexed all texts using an agreed-upon vocabulary-and the more specific the vocabulary, the better. The European Parliament has a list of 6,000 terms, known as EUROVOC, to index all subjects in its documents, from politics and law to science. Only a few dozen of these EUROVOC terms deal with medicine. In contrast, the National Library of Medicine has compiled a working list of more than 17,000 words for indexing articles cited in its MEDLINE The online medical database of the U.S. National Library of Medicine (NLM) whose parent is the National Institutes of Health, Bethesda, MD. MEDLINE contains millions of articles from thousands of medical journals and publications. The consumer section of the site (http://medlineplus. database. Searching success also improves, Pollitt notes, when each starting thesaurus is tailored to the vocabulary of a particular field, such as medicine or physics. This will limit confusion among terms common to both but having quite different meanings-such as plasma. To physicists, it's an ionized i·on·ize tr. & intr.v. i·on·ized, i·on·iz·ing, i·on·iz·es To convert or be converted totally or partially into ions. i gas, whereas to biochemists it's blood minus its cellular components. Belonogov, who is a linguist lin·guist n. 1. A person who speaks several languages fluently. 2. A specialist in linguistics. [Latin lingua, language; see , has embedded Inserted into. See embedded system. 21 such thematically organized dictionaries (covering such subjects as ecology, geophysics geophysics, study of the structure, composition, and dynamic changes of the earth, its atmosphere, hydrosphere and magnetosphere, based on the principles of physics. , and foreign trade) within his thesauruses. To limit confusion further, the thesauruses treat as a single term many commonly used phrases up to 13 words long. In fact, about 75 percent of the English entries involve word combos, such as "bottom line," "ballistic bal·lis·tic adj. 1. a. Of or relating to the study of the dynamics of projectiles. b. Of or relating to the study of the internal action of firearms. 2. missile," or "might be interested in." When they surveyed the field last year, Oard and Maryland colleague Bonnie bon·ny also bon·nie adj. bon·ni·er, bon·ni·est Scots 1. Physically attractive or appealing; pretty. 2. Excellent. J. Dorr found few commercial systems that ranked potential matches. So if 20,000 potential matches are identified, a user must sift through them all to find the few that might be valuable. Though the VINITI browser does rank its responses, "the drawback is that those responses are in Cyrillic," Hlava says. Nonetheless, it can prove useful when coupled to VINITI's translator programs. Together, the pair can search and retrieve documents from Russia's scientific holdings, which include not only Russian documents but also those published by Russia's trading partners, such as the former Soviet republics, North Korea, Syria, Iran, and Iraq. MIT's experimental system attempts to rank matched terms on the basis of how they were used or where they appeared. For instance, Marcus says, "we have demonstrated that the title words are most important." So if a queried term appears there, the document will be ranked higher than another in which the same term is buried in the text. Pollitt has tested his searching system on a database of 600,000 medical citations written in a host of European languages. He has also tested it by querying and retrieving citations-in English or Japanese-from INSPEC INSPEC Information Service for Physics, Electronics, and Computing , a British bibliographic database For computer programs to manage an individual's bibliographic references, see Reference management software A bibliographic or library database is a database of bibliographic information. covering texts on physics, electronics, and computing. He says the system can now be developed commercially. Similarly, Marcus believes the system his team has developed is ready for commercialization. Though VINITI's systems are still under development, working versions are available from the institute in Moscow and from Hlava. However, Hlava notes, money to refine them has all but dried up. The software programs that she marries into working systems still have a way to go before they offer "transparent" translation capabilities to both English and Russian readers, she says. "It breaks my heart," she told Science News, "that we can't get these technologies off the ground." Hlava says $20,000 would enable the VINITI team to develop a version of the translation and searching programs that would be compatible with Microsoft Windows See Windows. (operating system) Microsoft Windows - Microsoft's proprietary window system and user interface software released in 1985 to run on top of MS-DOS. Widely criticised for being too slow (hence "Windoze", "Microsloth Windows") on the machines available then. , the primary organizing software on desktop computers today. The Moscow researchers have no money to invest in it, however: Not only are they working without pay, they don't have money to heat their offices this winter. Indeed, most of these programs suffer from a paucity pau·ci·ty n. 1. Smallness of number; fewness. 2. Scarcity; dearth: a paucity of natural resources. of both financing and visibility. Oard hopes to counter the latter through a symposium he's organizing under the auspices of the American Association American Association refers to one of the following professional baseball leagues:
Among the challenges, he says, are programs to revise thesauruses automatically as languages grow and change, to identify words in languages like Chinese and Vietnamese, which do not put spaces between words, and to insert verbs in languages, such as Arabic, that frequently use nouns in place of verbs. |
|
||||||||||||||||||

pŭng'chər)
Printer friendly
Cite/link
Email
Feedback
Reader Opinion