Speaking in tongues: foreign language KM technologies.
This article is the first part in a two-part series in which I will introduce a number of different foreign language KM technologies. In this segment, I focus on unstructured text mining tools that provide users with natural language processing, language identification, transliteration and name normalization capabilities. In a follow-on article later this year, I will focus on speech-to-text and machine translation systems.
In previous pieces, I've probably sounded like a broken record when talking about metadata, and I am not going to stop now. Most of the metadata extraction tools I've discussed in the past are available in a variety of foreign language formats. As you might expect, those tools include document categorizers, clustering engines, classifiers, named entity extractors, summarizers and indexers, just to mention a few.
Many companies produce foreign language metadata extraction and generation technologies. Those firms are the cornerstones of many modern KM systems, and many readers are probably already familiar with them. The technologies are important to this discussion, because they lay the foundation for building more complex and sophisticated foreign language technologies. Table 1 on page 9 lists some of those KM technology companies and the types of products they sell.
When considering natural language processing, language identification, name normalization and transliteration, Basis Technology (basistech.com) tends to stand out in the crowd. Basis' linguistic support products provide excellent examples of the commercial state of the art in foreign language support and KM.
Basis Technology is well known in the linguistics and foreign language support community for its Rosette Linguistics Platform (RLP). RLP provides a multilanguage platform for large-scale text management and exploitation systems that identify, analyze, index, search and transliterate unstructured text in Asian, European and Middle-Eastern languages.
RLP provides a multifaceted toolkit, which can add internationalization services to existing software applications. It also provides a variety of analytic functions to build comprehensive and sophisticated foreign language text mining solutions. Basis' products help other companies that need multilingual software support for unstructured text processing, by providing the specific services shown in Figure 1 on page 9.
[FIGURE 1 OMITTED]
For the sake of this discussion and as shown in Figure 1, I'm going to group the capabilities of RLP into two categories: 1.) basic services and 2.) advanced services. Those are my own groupings and have nothing to do with Basis or its product names or marketing conventions per se. Rather, they are a way to help you view RLP in the context of more or less common capabilities in the marketplace.
RLP's language processing capabilities are built on top of a variety of basic linguistic services including 1.) core language support via Unicode, 2.) base linguistics and 3.) entity extraction.
Rosette Core Library for Unicode (RCLU)
Rosette Core Library for Unicode (RCLU) helps organizations that have multiple language support requirements for their information systems easily implement standard Unicode encoding for a variety of global languages. RCLU is a set of programming libraries written in C that allow software developers to easily add Unicode support to their software, rather than having to develop it all themselves. RCLU supports multiple computer platforms including Windows, Linux, MAC OS and Unix, among others.
Unicode is often referred to as UTF-8 and UTF-16 (Unicode Transformation Format for single byte [8-bit] encoded languages and double byte [16-bit] encoded languages). Every letter in a Unicode-enabled system is assigned a unique 8- or 16-bit code. It is a standards-based digital encoding scheme for internationalization of software and computer systems, which allows software manufacturers to implement support for dozens of different language character sets, including single-byte Roman, Cyrillic, Hebrew and Arabic scripts as UTF-8 and double-byte Asian languages such as Japanese, Korean and Chinese in Kanji scripts (UTF-16).
Unicode also provides a standardized means for operating system software vendors to present font- and language-based information to your computer and peripheral devices (e.g., on your screen and from your printer), and to accept input from keyboards and other language-dependent devices in a standardized, language-independent fashion.
Rosette Base Linguistics (RBL)
There are multiple technology approaches for building language support tools for the wide variety of languages currently spoken throughout the world. Those can be broken down into two principal approaches: 1.) statistical methods and 2.) natural language methods. My experience has shown that a deep understanding of natural language rules and heuristics more accurately identifies specific characteristics and detail within a language set than do statistical approaches, which are more generalized. Basis uses specific natural language approaches rather than statistical approaches for its RBL language processing capabilities. However, other support tools within Basis' portfolio of foreign language technologies do include statistical methods for foreign language processing.
Underlying RBL, Basis uses natural language-based morphological techniques in developing its core technology platform. That is, RBL understands the specific parts of a given source language in fine detail. That includes grammar, spelling, punctuation, parts of speech, semantic word roots and variants, male/female components and other detailed rules that are often extremely nuanced for a given language. RBL supports other linguistic methods for language analysis including normalization of parts of speech, segmentation, decompounding, support for lexical stemming (i.e., reducing inflected or derived words to their stem) and support for words with compound meaning.
Software that is based on the rules of natural language tends to evolve and improve in accuracy only over time and requires a long-term commitment to investment in research, development and product enhancement. Such a commitment is demonstrated by Basis' long-term commitment to its RBL system. Figure 2 shows a screen shot of RBL identifying language, word, part of speech (POS) and stem.
[FIGURE 2 OMITTED]
Rosette Entity Extractor
The Rosette Entity Extractor (REX) understands noun phrases within sentences in multiple languages. It specifically extracts names, places, dates and other text components. Entity extraction is an important part of providing structure to unstructured text and is critical in text mining activities. REX is built using a statistical processing engine that learns by experience using large training sets of foreign language documents. Basis has designed REX so that it comes ready to use with a variety of core languages. Out of the box, REX supports Arabic, Chinese, Dutch, English, French, German, Italian, Japanese and Spanish.
It is also very easy to extend the named entity extraction service to other foreign language models, and Basis is continually working to add new off-the-shelf language support for important languages of interest to its customers. Figure 3 shows an example of how REX identifies each language in a document, categorizes each entity type by a color code, identifies the different scripts or writing within the document and identifies the digital encoding schemes used within the document.
[FIGURE 3 OMITTED]
RLP offers a number of advanced services built on top of its basic services, which I would categorize as follows: language and document encoding identification, name identification (aka name normalization) and transliteration services. As with the basic RLP services, these services were developed by combining a variety of core linguistic capabilities.
Rosette Language Identifier (RLI)
The Rosette Language Identifier (RLI) is a critical part of multilingual text analytics. For example, consider multilingual search engines whose indexing processes crawl Web sites in multiple languages all over the world. A search engine that supports multilingual indexing must be able to ingest any text it finds and quickly and accurately determine what language the indexing engine must use to process the data.
RLI has the unique capability to reliably, efficiently and quickly recognize the language of a document or multiple languages within a single document. At its core, RLI relies on the linguistic concept of an n-gram, which is a technique for breaking a word into multiple parts in order to compare it to other words.
For each language grouping, Basis builds a profile of the n-grams describing a particular language. Out of the box, RLI has 114 profiles that can recognize 43 discrete native languages, 33 native encodings and includes support for UTF-8 for every supported language.
The technology requires only plain text as an input and statistically ranks its findings with the most likely candidate first, followed by multiple matches in descending order. Figure 4 shows RLI detecting a language and then ranking its findings. As can be seen in Figure 4, the language profile that returns the largest number of n-gram matches from the input text is ranked the highest.
[FIGURE 4 OMITTED]
The technology requires approximately 128 bytes of data for 100 percent detection and can identify languages using very limited information, such as the title of an RSS feed, the title of an HTML document or the subject of an e-mail message. With the latest version of the technology, Version 5, Basis characterizes RLI as having an accuracy of 91 percent. That is, in a test of 800,000 HTML titles with an average of 39 characters in each title, it misidentified less than 9 percent of the languages in the corpus of documents.
Rosette Name Matcher (RNM)
One of the most difficult problems in linguistics, particularly for non-native speakers of a given language, is recognizing a person or place name and all of its possible variations. Rosette Name Matcher (RNM) helps solve a variety of related problems whereby a person or place may have more than one name. A given name may have multiple spellings in different parts of the world. There may be no international standard for spelling a person's name or a location name.
If you follow the news of the ongoing wars in Iraq and Afghanistan, you are likely to see person and place names given in Arabic that leave you confounded. Arabic and other Middle-Eastern languages often have complex naming conventions and rules that make it difficult for foreign language (English) speakers to recognize given entities or locations.
RNM solves the name permutation problem in multiple languages including Arabic, Chinese, Korean, Pashto, Persian and Urdu. It allows a user to use his or her language to look up names found in target foreign language documents. For example, for Arabic names in Arabic script in a database, an English-speaking user would be able to enter them using the English letter sounds that match the Arabic language sounds. That phonetic approach is easy for the user to understand and implement.
RNM analyzes the queries using fuzzy algorithms and can successfully align even a partial match. For example, I might look up the name Gaddafi (as in Colonel Qadhdhafi of Libya--I'm certain you remember who he is). If you think about it, there are many different ways I might spell his name. RNM will match my attempt to spell the name Gaddafi in English and give me the appropriate spelling in the native Arabic script. Basis maintains a lexical database containing the proper name spelling for most common entity names. Figure 5 on page 11 shows all the different ways in which I might attempt to spell Gaddafi phonetically in English and the actual resulting name given in Arabic as supplied by RNM.
[FIGURE 5 OMITTED]
Putting it all together, RLP uses its various linguistics components to build applications that use transliterations written in one language to approximate the sounds of the actual foreign language word. In fact, transliteration means "a systematic way to convert characters in one alphabet or phonetic sounds into another alphabet." Basis supplies a variety of interactive transliteration tools built on its various RLP components.
One of Basis' most recent products is the Transliteration Assistant (XA), which is a plug-in for Microsoft Word, Excel and Access. The company also has a variety of other custom transliteration tools in multiple languages that provide standardized ways to spell person and geographical location names. Basis' transliteration tools observe a variety of different transliteration standards depending on the language in question. For example, with regard to Arabic, there are four different transliteration standards. When U.S. government or intelligence community (IC) personnel perform transliteration services, they must follow congressionally mandated "IC standard" transliteration rules. In addition to the IC standard, Basis supports the U.S. Board on Geographic Names (BGN), the Standard Arabic Technical Transliteration System (SATTS) and Basis' own internal transliteration format.
The XA tools for MS-Word, Excel and Access are convenient, particularly for linguists who work in a standard Microsoft Office environment. The plug-ins simply add a "Transliteration" menu option to the standard MSWord, Excel or Access applications menu bar. The tool works equivalently in all three Microsoft applications; however, I personally like the MS-Excel integration because I can use spreadsheet functions to build a spreadsheet of transliterated names in different formats.
In Figure 6, I show a spreadsheet I built containing a list of Arabic names. In the first column are unvocalized (without the vowel marks) Arabic names. In the second column, I've had the XA add the Arabic vocalizations. In the third column, I've had the XA convert the Arabic text into transliterated English using the IC standard for transliteration. In the fourth column, I use the BGN standard for transliteration, which produces slightly different English results.
[FIGURE 6 OMITTED]
As can be seen in Figure 6, the "Translate" menu function on the Excel application menu bar is self-contained and provides a variety of different tools and help features. I found the help feature comprehensive and useful as well.
Basis has used the different components of its RLP to build a variety of different tools and utilities. For example, it has created a geospatial mapping tool called GeoScope Map Viewer, shown in Figure 7. GeoScope uses a gazetteer of foreign place names and allows you to search a map using your native language equivalent of foreign language place names. In Figure 7, GeoScope loads a tourist map of Iraq and an Iraq gazetteer compiled from data from the old "Bathiist" Iraqi Office of Tourism. All of the map place names are shown in Arabic script. In this example, I typed in a fuzzy search for Tikrit using the English search term "tekreet," and GeoScope placed a crosshair right at the location it labeled Quada' Tikrit.
[FIGURE 7 OMITTED]
Another useful set of utilities is offered as the Arabic Desktop Suite, which contains a Knowledge Center that lets the user search for heads of state using English transliteration. In Figure 8, I do a search on the term Karzai, which returns information telling me that Hamid Karzai is the president of Afghanistan.
[FIGURE 8 OMITTED]
Basis has a variety of other tools it has built on top of its RLP technology in multiple different languages. Those tools range from Arabic text editors as part of the Arabic Desktop Suite, name matchers in Korean, other GeoScope Map Viewer mappings and a number of other utilities and applications. Keep in mind that RLP is a linguistics platform with a diverse set of tools, libraries, scripts and applications that can be used to build any type of linguistic support application or service you might imagine, or add foreign language support to any application a developer might build.
Doing the hard work
Foreign language tools for KM are essential to building systems that can accurately and completely support text mining either on the Web or within the enterprise. Basis provides a comprehensive set of foreign language tools, starting with Unicode libraries for multiple foreign languages to support internationalization of a developer's existing applications. RLP has a set of basic services, including base linguistic capabilities that use natural language processing techniques to provide highly accurate means of accessing parts of speech, performing indexing, entity extraction, stemming, normalization and other linguistic capabilities essential to text mining.
Basis builds on top of those basic services to deliver a range of advanced services, including language identification, name matching and translation and transliteration. RLP should be thought of as a toolkit or framework that combines those basic and advanced services to allow developers to add a range of foreign language capabilities to existing applications (beyond plain old Unicode internationalization). Moreover, developers can use the tools to build even more advanced linguistic applications, as exemplified by GeoScope Map View, Transliteration Assistant for MS-Office and the Basis Arabic Desktop Suite.
As a software developer myself, I rest easy in the knowledge that I don't have to develop my own Unicode extensions for the applications I write. I don't know if you have ever looked at the Unicode standards for developing internationalization, but that stuff looks really complicated and difficult. I thank my lucky stars that I know about companies like Basis Technologies that have done all the hard work and more for me.
A Linguist's Dictionary
The study of the nature, structure and variation of language, including phonetics, phonology, morphology, syntax, semantics, sociolinguistics and pragmatics.
The smallest linguistic unit that has semantic meaning.
* Morphological Analysis
A technique developed by Fritz Zwicky (1966, 1969) for exploring all the possible solutions to a multidimensional, non-quantified problem complex. In linguistics, it refers to identification of a word stem from a full word form (see morpheme).
* Natural Language Processing
Natural language processing (NLP) is a convenient description for all attempts to use computers to process natural language.
An N-gram is a subsequence of n letters from a given string after removing all spaces. For example, the 3-grams that can be generated from "good morning" are "goo," "ood," "odm," "dmo," "mor" and so forth.
* POS or parts of speech
Identification of the semantic parts of sentences made up of nouns, verbs, adverbs, adjectives, etc. A POS tagger is a program that identifies and tags text based on different parts of speech.
In linguistics, this is a technique for identifying the main part of a word to which prefixes or suffixes are added.
Unicode provides a unique code number for every character, no matter what the platform, no matter what the program, no matter what the language. A standard managed by the Unicode Consortium.
Greg Pepus has more than 20 years of experience in advanced knowledge management technology as well as business operations and venture capital. He regularly works in the U.S. intelligence community helping the government develop policy and technology solutions in support of the intelligence analyst community, e-mail firstname.lastname@example.org.
Table 1--Foreign Language KM Tool Companies Categorization Multiple Clustering Company Product Languages Classification IBM Text Miner * * Basis Technology Rosette * Inxight Smart Discovery * * Vivisimo Velocity * * Convera Excalibur * * Autonomy IDOL * * Kofax Ascent * * Named Entity Search Document Company Extraction Indexing Summarization IBM * * * Basis Technology * * Inxight * * * Vivisimo * Convera * Autonomy * * * Kofax *