Printer Friendly

LINGOES: a linguistic ontology management system.

Abstract: LINGuistic Ontology managEment System (LINGOES) is a framework to enable linguists to take full advantage of the Semantic Web technologies. Together with OntoGloss, a text annotation tool, and an RDF database with versioning and querying capabilities, it allows a linguist to markup any document with classes in one or more ontologies at the morpheme's level. Textual documents can be in any language as long as they are accessible via a URI (Universal Resource Identifier). The annotated data can be queried across these languages or can be used to annotate other documents. Saving the annotated data in an RDF repository with inference, querying and change management capabilities makes annotations in LINGOES accessible by machines and useful to the wider Semantic Web community.

Categories and Subject Descriptors D.3.2[Language Classifications]: H.2 [Database Management]; H.2.3 [Languages]

General Terms

Ontology, XML

Keywords: Ontology management system, ontology-based annotation tool

1. Introduction

For linguists, marking up a document is a way of preserving its content. This is more urgent in the case of languages that are in the danger of disappearing. Endangered languages can tremendously benefit from an ontology based annotation system. Ontology, as a way of formalizing knowledge, can help linguists to solve the incompatibility of the markup data in a multilingual search.

LINGOES is providing a framework for linguists to capture the knowledge on a specific language and share it with other languages. Its use of RDF (Resource Description Framework) as the main storage and exchange method makes sure that knowledge in the field is saved in a format that is portable to other applications and is readable by machine as well as the human. When a linguist annotates a section, paragraph, word or morphemes of a word with concepts in the ontology, he/she is expressing certain knowledge in the field that is relevant and applicable throughout the field. For linguists it is important to start using RDF instead of XML that do not carry any semantics except with the mutual agreement within a small group of developers. The semantic inherent in the RDF and OWL constructs makes sure that knowledge is transferable throughout a larger audience and with the Web community as a whole.

LINGOES consists of an annotator (OntoGloss), a Change Management module and an RDF repository. There are many text annotators available both as open source and as commercial products [4, 9, 10, 13]. What is different about a linguistic annotator is that words in linguistics are broken up into morphemes. OntoGloss is able to annotate morphemes in a word. For example, if xxxabc is composted of xxx with a suffix -abc, a linguist using OntoGloss is able to annotate each morpheme separately. In the automatic annotation of new documents, when OntoGloss finds yyyabc, it can determine if it has the same suffix [11] and annotate it with the same class in the ontology. Another contribution of LINGOES, that we are not aware of it in any other system, is in managing changes and versioning in the underlying ontologies. Our main contributions are:

* A linguistic ontology-based annotator that annotates a document from the most general level to the morpheme's level

* Change Management that allows versioning in ontologies without rendering affected annotations inaccessible

* Using the linguistic knowledge gathered through annotations by the community to automatically annotate other documents at the morpheme's level

* Using ontologies from different sources to markup documents to make the knowledge gathered in analyzing one language applicable to other languages.

2 LINGOES System Description

Figure 1 shows the architecture of the LINGOES system. It consists of the following modules:

* OntoGloss. OntoGloss is an annotator used in annotating documents using concepts in the ontology. Its user interface is well suited for linguists. It's drag and drop functionality, lets the user browse any textual document and easily annotate it with concepts from available ontologies. The annotator can also automatically annotate words that are previously encountered and annotated. Each annotated document could be linked to a language code, so that one can extract all material on a particular language.

* RDF Repository. Annotated data is saved in an RDF repository. This repository uses a relational database for faster and scalable response. This module provides tools for exchanging, evolving and querying resource-related knowledge.

* Change Management. This module allows semi-automatic creation of migration rules that are used in migrating older annotations to their new classes in the new ontology. It uses structural comparison and a set of heuristics to compare the old and new versions of the ontology. After an expert verifies the generated rules, they can be applied to the older annotations. The result is the same as annotating the older documents with the new ontology.

* User Interface. This is a web interface for users to browse the annotated documents and run queries on documents in a multilingual environment.

2.1 OntoGloss

An ontology based annotation tool can be described as a tool that is using pre-defined concepts in an ontology to markup a document [8]. Figure 2 shows how data is annotated if an ontology like GOLD (General Ontology for Linguistic Description) [3] is used. Two numbers 2004 and 10 in page 1 and 2 are annotated as instances of the NumberValue class in GOLD.

OntoGloss has the following features:

* It can use different ontologies to markup documents, paragraphs, sentences, words and morphemes. It is independent of the selected ontology and can accommodate several ontologies at the same time.

* In addition to the selected text, a user can annotate general information like the name of the annotator, date and other information as specified in Dublin Core.

* Annotating the document with drag and drop operations.

* Automatically annotating new documents based on the incremental knowledge in the system.

* Support for local and remote annotation servers. In the local mode, annotated data is saved locally and it is used in annotating documents that are visited for the first time. In the remote or shared annotation server mode, linguist can add his/her annotated data to a server for the community to use. A crawler searches in the specified namespace for annotating documents. Although not currently implemented, we envision a security scheme to make sure the integrity of the scientific work.

[FIGURE 1 OMITTED]

Classes in the ontology are color-coded. An annotated text has the same color as the class that is used in annotation. This gives a visual clue to the linguist on the type of markup. When moving the mouse over an annotated selection, the linguist can see the type of the annotation. For any annotated page, a set of RDF triples is created and saved in the database. On the next visit to the same document, OntoGloss retrieves all the triples for the page from the database and marks all the annotated sections. As long as the structure of the document does not change dramatically (which is usually the case in linguistics) this would make the same annotated sections. When a document is first browsed, OntoGloss compares each word with all the annotated text in the database and assigns the same type of annotation to words. This will serve as an initial suggestion from OntoGloss and can be changed by the linguist if needed. Here is a sample of the OntoGloss output. For brevity, the URI before the # sign is replaced with Pagel and Page2.

[FIGURE 2 OMITTED]
<rdf:Description rdf:about="Page1#2004">
 <rdf:type rdf:resource "GOLD:NumberValue"/>
</rdf.Description>
<rdf:Description rdf:about="Page2#10">
 <rdf:type rdf:resource"GOLD:NumberValue"/>
</rdf.Description>


2.2 RDF Repository

Storage and retrieval of ontologies and data committed to them is different from the traditional data structures. In addition to the need for a data model to encode the knowledge and data, there is a need for an inference engine. We have used Sesame [1], which is an open source RDF database with inference and querying support on top of a variety of storage systems including MySQL relational database. Sesame supports several querying languages including RQL, RDQL and SeRQL. It accepts triples in RDF and N-Triples formats. In case of RDBMS, it keeps several tables including Class, SubClass, Instance, Triple and Resource among others. When it reads a new triple that is not already in the database, it parses the triple and spreads it among tables while enforcing the referential integrity between them.

As an example, we load the GOLD ontology and then use OntoGloss to annotate two numbers 2004 and 10 in the Web page 1 and 2. These two numbers are going to be instances of the NumberValue class in GOLD. Exploring Sesame through its user interface for a resource like Page1#2004, we would see this instance with all the classes that are in its path to the root class. The resource Page1#2004 is of type GOLD: NumberValue and GOLD: MorphoSyntacticFeatureValue and so on as Table 1 shows.

We can write a query in SeRQL-S to return all the instances of the NumberValue class annotated in any document. Table 2 shows the result.

Q1. SELECT*

FROM {subject} rdf:type {object} WHERE object like "*NumberValue"

As another example, the following query searches for all the classes that a resource (Page1#2004) is of their type. Table 3 shows the result that is a list of super classes of NumberValue.

Q2. SELECT *

FROM {subject} rdf:type {object} WHERE subject like "*2004"

2.3 Change Management

The main purpose of an ontology management system is to manage ontology re-use, integration and change. Combining and relating ontologies, ontology versioning and ontology storage and retrieval are among the main functionalities of such a system [2]. The number of ontologies has grown considerably in the last few years. Most of these ontologies are still small in their size and scope. By introducing new versions, the problem of managing ontologies and annotated data committed to different versions, becomes immediate. This would be compounded by the fact that ontologies can be imported into other ontologies and the imported ontology might go through modifications as well. To keep track of changes between different versions, there are a number of strategies [6]. One is using change log, the same way that Protege [12] keeps a record for each change. A problem with the log is the difficulty in maintaining it on a decentralized environment like the Internet. The other problem is when there is more than one developer making changes to the same ontology. The other strategy is looking into the structural differences between two ontologies and saving a mapping between the two in a system like PROPMTDIFF [7]. The third method is creating a conceptual relation between concepts in the old ontology and the new one. In this method, old concepts would be related with the new ones by relations like SubClassOf, EquivalentTo and so on. OntoView [5] uses this method. The forth method is transformation set. In this method, changes between two versions are recorded as a set of transformation operations [6]. This is somewhat like a log with two major differences. A log records all the changes while transformation set only records the end results. The other difference is that the log is unique but there could be many transformation sets between two versions.

Ontology modifications adds significantly to the complexity of document annotation. If a class is deleted or changed its position in the ontology graph, all the instances of that class might be inaccessible unless we make sure to move instances along with the class (in the case of change) or to the other classes (in the case of delete). There are clearly two problems here. One is the problem related to changes in the ontology itself. The second one is the problem of making sure to associate all the instances of deleted class with another class. This might be moving up the tree to become an instance of a parent or ancestor or might be moving to another sub-tree altogether.

In our approach, all the triples in the ontology are associated with a version number. Triples are kept incrementally in the system. Using a shorthand notation, we have [<Numeral subClassOf Quantifier>.sub.1] in version 1 that is changed to [<NumericValue subClassOf MorphoSyntacticFeatureValue>.sub.2] in the new version. Both of these triples stay in the database and depending on which version a document is committed to, that triple is the one that participates in any inference and querying process. When a class that has instances, goes through changes, we need to decide about the fate of its instances. We have implemented a tool that gets two versions of an ontology and semi-automatically, with help from an expert, creates a set of rules that when applied to instances, moves them all to the appropriate classes of the new version of ontology. After some pre-processing including the initial string matching between names of classes are done the system recursively joins triples in the two versions to get the difference between them. Only the subclass relations, or is-a relations are important here. Is-a relations are transitively closed so by joining the two sets of triples (of type subClass0f) in two ontologies, we get the difference between them including deleted or changed classes. Based on this initial set of differences, the system does structural tests like checking for the same siblings or same parents to determine how a class has changed and what rule should apply to its instances. To suggest these rules, we use a set of heuristics, like if a middle node has deleted, all the instances should move up to the parent of the node. These are suggestions to the user and he/she can either accept them or enforce new rules. After confirming each rule, the system will create new triples for each instance (containing its type) and subscript it with the new version. Therefore, if we used to have [<Page1#2004 rdf:type GOLD:Numeral>.sub.1] now it would be [<Page1#2004 rdf:type GOLD: NumericValue>.sub.2]. The first triple stays in the repository until a point in time that all the triples belonging to a version are deleted.

3. Related Work

There are many text annotation tools available including Amaya [4] from W3C and KIM [9] from OntoText lab. Amaya is an RDF-based annotation but it is limited to pieces of information about the Author, Type, Creator, Last modified or a text that annotator provides. KIM is another general purpose annotation tool that uses KIM Ontology (KIMO) and a knowledge base of general important terms to automatically annotate a document. Although KIM's approach in using an ontology is similar to OntoGloss, the main difference is the ability of OntoGloss in using different ontologies and different versions of the same ontology plus the semi-automatic nature of OntoGloss that is warranted for a scientific field that needs expert's input (in this case, linguist's input). Both KIM and OntoGloss are using Sesame as their main RDF repository although in our case, this is still an area of active research and might change in the future.

OntoAnnotate [10] is another text annotation tool. OntoAnnotate keeps a local copy of the document in the document management system along with the metadata that annotates the document. In our approach, documents stay where they are and we only keep the annotation triples in LINGOES. The other big difference is that in OntoGloss the annotation is in the morpheme level. No information regarding version control and evolution mechanism for OntoAnnotate is available. We can benefit from OntoAnnotate extraction-based approach for semi-automatic annotation.

MnM [13] is training the system with a set of documents and is learning through the initial manual annotation and subsequent Information-Extraction methods. The end result is a set of induced rules that can be used to extract information from the text. The main difficulty in using MnM for the linguistic field is that there are not usually many documents to learn from for most of the endangered languages.

[FIGURE 3 OMITTED]

3 Implementation

We have implemented major components of LINGOES including OntoGloss and Change Management module on top of the Sesame with a MySQL database back-end. OntoGloss (Figure 3) is implemented with Microsoft Access. User can annotate documents with OntoGloss and run RQL or SeRQL-S queries on the annotated data using the Sesame Web interface. They can also annotate a document with two different versions of the same ontology and use the Change Management module to access instances of the changed or deleted classes. Currently, the annotated data moves from OntoGloss to Sesame through an N-Triples format text file. In the future, OntoGloss will access the MySQL database directly in an integrated environment with a Web-based interface that replaces the current implementation of OntoGloss and will include an easier query interface.

Received 30 Oct. 2004; Reviewed and accepted 30 Jan. 2005

[1] Broekstra, J. Kampman, A. Van Harmelen, F (2002). Sesame: A generic architecture for storing and querying RDF and RDF schema. In: the Proceedings of the 1st International Semantic Web Conference, Sardinia, Italia.

[2] Ding, Y. Fensel, D. Klein, M. Omelayenko, B (2001) Ontology Management: survey, requirements and directions, IST 1999-10132 Ontoknowledge Project, Deliverable 4.

[3] Farrar, S. Langendoen, T (2003). A Linguistic Ontology for the Semantic Web, GLOT International 7(3) 97-100.

[4] Kahan, J. Koivunen, M. Prud' Hommeaux, E. Swick, R (2001). Annotea: An Open RDF Infrastructure for Shared Web Annotations. In: Proceedings of WWW10, Hong Kong.

[5] Klein, M (2004). Change Management for Distributed Ontologies. PhD Thesis, Department of Computer Science, Vrije Universiteit, Amsterdam.

[6] Klein, M. Noy, N (2003). A component-based framework for ontology evolution. Technical Report IR-504, Department of Computer, Vrije Universiteit, Amsterdam.

[7] Noy, N. Musen, M (2002). PromtDiff: A Fixed-Point Algorithm for Comparing Ontology Versions. In: Eighteenth National Conference on Artificial Intelligence (AAAI-2002), Edmonton, Alberta.

[8] OntoWeb Consortium. Ontology-based information exchange for knowledge management and electronic commerce--IST-2000-29243. http://www.ontoweb.org

[9] Popov, B. Kiryakov, A. Ognyanoff, D. Manov, D. Kirilov, A. Goranov, M. (2003). KIM--Semantic Annotation Platform. The 2nd International Semantic Web Conference (ISWC2003). LNAI Vol 2870, 484-499 Springer-Verlag.

[10] Staab, S. Maedche, A. Handschuh, S (2001). An Annotation Framework for the Semantic Web. In: Proc. 1 Int. Workshop on MultiMediaAnnotaion, Tokyo.

[11] The Linguist's Shoebox. www.sil.org/computing/shoebox

[12] The Protege project. http://protege.stanford.edu

[13] Vargas-Vera, M. Motta, E. Domingue, J. Lanzoni, M. Stutt, A. Ciravegna, F(2002). MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup. In: Proceedings of EKAW.

F. Mostowfi (1) F. Fotouhi (1) S. Lu (1) A. Aristar (2)

(1) Computer Science Department, Wayne State University, Detroit, Michigan, USA. {fmostowfi, Fotouhi, shiyong}@wayne.edu

(2) Department of English, Wayne State University, Detroit, Michigan, USA. aristar@linguistlist.org
Table 1. Class hierarchy for an instance

 Subject Predicate Object

Page1#2004 rdf:type GOLD:NumberValue
Page1#2004 rdf:type GOLD:MorphoSyntacticFeatureValue
Page1#2004 rdf:type GOLD:FeatureValue
Page1#2004 rdf:type GOLD:Abstract
Page1#2004 rdf:type GOLD:Entity

Table 2. Results for Q1

 Subject Predicate Object

Page2#10 rdf:type GOLD:NumberValue
Page2#2004 rdf:type GOLD:NumberValue

Table 3. Results for Q2

 Subject Predicate Object

Page1#2004 rdf:type GOLD:NumberValue
Page1#2004 rdf:type GOLD:MorphoSyntacticFeatureValue
Page1#2004 rdf:type GOLD:FeatureValue
Page1#2004 rdf:type GOLD:Abstract
Page1#2004 rdf:type GOLD:Entity
COPYRIGHT 2005 Digital Information Research Foundation
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2005 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Mostowfi, F.; Fotouhi, F.; Lu, S.; Aristar, A.
Publication:Journal of Digital Information Management
Date:Dec 1, 2005
Words:3207
Previous Article:Semantic image retrieval based on ontology and relevance model: a preliminary study.
Next Article:Searching multimedia documents: an application in patent examination.

Terms of use | Copyright © 2017 Farlex, Inc. | Feedback | For webmasters