Printer Friendly

Mining semantics from medical structured corpuses to build and enrich domain Ontologies.

1 INTRODUCTION

Ontologies have been developed to capture the knowledge of a real world domain. "Ontology is defined as a formal and explicit specification of a shared conceptualization of a domain. They provide a shared and common understanding of a particular domain of interest" [22]. A domain ontology defines a vocabulary of concepts and their relationships for a given domain.

The most important prerequisite for the success of the semantic Web research is the construction of reliable domain ontologies. In case of large and complex application domains this task can be costly, time consuming and error-prone. For this purpose, efforts have been made to facilitate the ontology engineering process, in particular the construction and enrichment of ontology from domain texts. Therefore, it is highly advisable to refer, in constructing ontology, to the documents available in the studied field.Ontology learning has recently become a major focus for research whose goal is to facilitate the construction of ontologies by decreasing the amount of effort required to produce an ontology for a new domain or to enrich existing ones. Ontology learning is a research field focused on constructing automatically (or semi-automatically) ontology from a set of relevant text documents [24].

Several barriers must be overcome after the ontologies become practical. A critical issue is the task of updating, and maintaining. In case of large and complex application domains this task can be costly, time consuming and error-prone. To reduce cost and time, it is highly advisable to refer, in constructing or updating ontology, to the documents available in the field. Textmining tools may be of great help in this task. Ontologies have been developed to capture the knowledge of a real world domain and applied to open and dynamic environments where the domain knowledge evolve and the user needs which they have to answer change. Therefore, ontologies must be regularly updated to take into account:

1. The evolutions of the conceptualized domain knowledge.

2. The change needs for the diverse human actors or software.

3. The integration of new concepts and new properties.

The enrichment aims to guarantee the consistency and the coherence of the ontologies, to keep the efficiency of the application based on them.

Ontologies building and enrichment are still a time consuming and complex tasks, requiring a high degree of human supervision and being still a bottleneck in the development of the semantic Web technology.

We propose the use of text mining techniques, especially mining the domain specific texts in order to find concepts which are related to each other. Such pair of related concepts will help the developer to build or enrich the domain ontology.

The building problem to solve is to get all the elements needed to build domain ontology. First, it is necessary to identify relevant concepts in the corpus of texts. Second, we try to find relevant or semantic relations between them.

The enrichment problem to solve is to extract relevant concepts from existing documents in a Medical domain, arrange them in subhierarchies, and detect relations among such concepts. The enrichment behaves like a classification of the extracted terms into the existing ontology by attaching them to the hierarchy.

In this paper we present an extraction, building and enrichment process. As input, our process receives a corpus of documents related to medical reports in gynaecology domain. As output in the building case, the process delivers a hierarchical ontology composed of a set of concepts and a set of non-hierarchical relations between those concepts. The developed ontology offers a good structuring, organization, and modelling of the knowledge relative to the domain. It represents the particular meanings of terms as they apply to gynecology domain. It provides information and knowledge for a better health informatics service. As output in the enrichment case, the process delivers an adapted ontology, i.e. its taxonomy is enriched, with new domain-specific concepts extracted from the corpus.

The rest of the paper is organized as follows: Section 2 presents the knowledge engineering field. Section 3 outlines the knowledge extraction approaches. Section 4 focuses on the ontology learning domain. The ontologies building methodologies are presented and some related work on ontologies enrichment is given. Section 5, describes our proposed process and gives more detail of the three major phases: knowledge extraction ontologies building and ontologies enrichment. Finally, Section 6 discusses conclusion and sketches future work.

2 KNOWLEDGE ENGINEERING

The Knowledge Engineering finds its origins in the Artificial Intelligence domain. It is interested more and more, to the problems of acquisition and the modelling of the knowledge. It offers the scientific method to analyze and treat them. It suggests concepts, methods and techniques allowing to model, formalizing, acquiring knowledge in the organization in a purpose of structuring or management [1].

In the literature, a large number of methods of Knowledge Engineering were originally dedicated to develop systems based knowledge. On a large list of them, we can cite: the SAGACE method [2], the Common KADS method [3], the REX method [4], and the MASK method [1].

3 KNOWLEDGE EXTRACTIONS

The Knowledge Extraction (KE) is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The KE is the process of discovery and interpretation of regularities on the data. We distinguish three (3) approaches of extraction:

3.1 Knowledge Extraction from the Data or Data Mining

The most well-known field of data mining is knowledge discovery, also known as knowledge discovery in databases. Just as many other forms of knowledge discovery it creates abstractions of the input data. The knowledge obtained through the process may become additional data that can be used for further use and discovery [6].

3.2 Knowledge Extraction from the Web or Web Mining

Web Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining [7].

3.3 Knowledge Extraction from the Text or Text Mining

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. It usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output [8].

4 ONTOLOGIES LEARNING

Ontology learning has recently become a major focus for research whose goal is to facilitate the construction of ontologies by decreasing the amount of effort required to produce an ontology for a new domain or enrich existing ones [24]. In this section, we present the ontologies building methodologies and some related work on ontologies enrichment.

4.1 Ontology Building Methotologies

When a new ontology is going to be developed, several basic questions arise related to the methodologies, tools and languages to be used in ontology development as reported in [23].

Building a well-developed and usable ontology represents a significant challenge. A range of methods and techniques have been reported in the literature regarding ontology construction methodologies [23]. Mike Uschold's methodology [9], Michael Gruninger and Mark Fox's methodology [10], and Methontology [11] are the most representative. These methodologies have in common that they start from the identification of the ontology's purpose and the need for domain knowledge acquisition. However, having acquired a significant amount of knowledge, Uschold's methodology and Gruninger and Fox's methodology propose coding in a formal language and Methontology proposes expressing the idea as a set of intermediate representations. These representations bridge the gap between how people think about a domain and the languages in which ontologies are formalized. Thus, Methontology enables experts and ontology makers unfamiliar with implementation environments to build ontologies.

For the ontology evaluation, Ushold's methodology includes this activity but does not state how to carry it out. Gruninger and Fox propose the identifying a set of competency questions. Evaluation in Methontology occurs throughout the ontology development.

For our purpose, we have chosen the Methontology for the application ontology building. It enables the construction of ontologies at the knowledge level. It includes the identification of the ontology development process and a life cycle based on evolving prototypes.

Manual ontology construction is costly, time consuming, error-prone, and inflexible to change, it is hoped that an automated ontology building process will result in more effective and more efficient ontology construction and also be able to create ontologies that better represent a specific domain [24]. The KE has recently become a major focus for research whose goal is to facilitate the construction of ontologies by decreasing the amount of effort required to build ontology for a new domain. Developing appropriate methods to support or automate ontology building process become an increasingly important goal.

Several disciplines have contributed to facilitate and expedite the construction of ontologies, especially natural language processing, data and text mining, clustering and machine learning [13].

For this purpose, we collect a set of medical reports in the gynaecology domain to text mining and build new ontology following these phases: specification, conceptualization, formalization, implementation and validation.

4.2 Ontology Enrichment

The ontologies applied to open and dynamic environments wherever the knowledge they model, as well as the user needs which they have to answer evolve. Therefore, they must be regularly updated at the risk of becoming obsolete.

The process of domain ontology enrichment has two inputs, an existing ontology which plays the role of background knowledge and a domain text corpus. The aim of our work is to automatically adapt the given ontology according to a domain specific corpus. We enrich the hierarchical backbone of the existing ontology, i.e. its taxonomy, with new domain-specific concepts extracted from the corpus [17].

The enrichment process is divided into two main phases: a phase of learning to look for new concepts and relations and a phase of investment of these concepts and relations to keep consistency of the ontology and protect its coherence. This process of enrichment is invoked at each evolution need [29], [30].

Many research works were realized in this context to guarantee the consistency and the coherence of the ontology with regard to the evolution of the data or the domain There are two main categories of approaches for ontology enrichment: methods based on distributional similarity and classification of terms into an existing taxonomy on one hand, and approaches using lexico-syntactic patterns, on the other hand. Our enrichment approach belongs to the first category [6], [17], [18], [29].

We propose the use of text mining techniques, especially mining the domain specific texts in order to find groups of concepts which are related to each other. We evaluate and update the existing ontology in case those concepts are already defined in the ontology, or to enrich the existing ontology in case those concepts are not defined in a purpose of maintenance.

5 PROPOSED PROCESS

The objective of our process is to facilitate ontology engineering from texts in real-world settings through several information extractions. Thus, we had to face (i) the discovery of relevant concepts, (ii) their organization in taxonomy, and (iii) the nontaxonomic relationships between concepts.

Ontology construction is the process whereby relations are defined between extracted terms from domain documents, and filtering out terms irrelevant to the domain to build new ontology.

In this paper, we propose an extraction and building process which includes four main phases [31]: the specification, the linguistic study and knowledge extraction, the ontology building and finally, the enrichment of the developed ontology (fig1).

The proposed process begins with the ontology specification phase. It identifies the knowledge domain and the purpose of the ontology, including the operational goal, the intended users and the scope of the ontology which contain the set of terms to be represented and their characteristics. The second phase is the linguistic study and the knowledge extraction. It introduces the following steps: corpus preprocessing, extraction of terms, cleaning and filtering, and finally classification. Then, the third phase consists in the construction of the ontology. It includes the following steps: conceptualization, formalization, implementation and validation test. The last phase concerns the ontology enrichment to guarantee the consistency and the coherence of the ontology and ensure the evolution of the data and/or the domain.

5.1 Ontology Specification

This phase aims at supplying a clear description of the studied problem and at establishing a document of requirements specification. Thus, we need to answer at some fundamental questions:Which domain will cover ontology?. In which purpose ontology will be used?

We describe the ontology to be built through the following five (5) aspects:

a). Knowledge Domain

The study domain concerns the speciality of gynaecology in medicine.

b). Operational Goal

The objective serves to structure the terms of the domain and to help the health agents in their work.

c). Users

The agents of health (doctors, nurses, geneacologist, midwife, health-officer, ...) in the maternity service of hospital and doctor's offices can use our ontology.

d). Information Sources

It is a question of selecting among various sources of knowledge those who will answer the goal of the study.

We extract new relevant vocabulary terms from medical reports collected from some doctor's offices and maternity service of hospital. We also inspired terms from some related work in building ontologies [6] [8] [12].

e). Ontology Scope

Among the terms, we quote: patient, consultation, treatment, pregnancy, childbirth, abortion, newborn child, treatment, disease, doctor... etc.

5.2 Linguistic Study and Extraction

After the pre-processing of the corpus, the sentences are treated by a morpho-syntactic analyzer to extract the terms. Afterward, some operations of cleaning are necessary such as remove of stop words, change the upper case characters to lower case and eliminate irrelevant and abbreviation terms. The syntactic analyzer Tree Tagger (API) [15] allows to annotate a text with information on the part of speech (kind of word: subject, verb, noun ...) and information of lemmatization. Then, the operation of filtering is going to serve us to build our dictionary of term. Next, the classification step serves to distribute the terms in two (2) lists, one for the relations and the other one for the candidate concepts using information of lemmatization. We are going to present in detail the four (4) steps of the linguistic study and extraction phase.

5.2.1 Corpus Pre-processing

The pre-processing aims to define a strategy to treat the missing data. It consists in normalizing the text to obtain coherent results and also, as possible, to correct human errors by the assistant of linguistic experts. This stage serves to normalize the diverse manners of writing the same word, to correct the obvious spelling mistakes or the typographic incoherence and to clarify certain lexical information expressed implicitly in texts. The textual or linguistic analysis of the corpus means systematizing and making more effective the search of terms in the texts. We also used the spell-checker [16] to avoid errors in the corpus. Then, the text is divided into a set of sentences to allow the use of the morphosyntactic analyzer Tree Tagger [15].

5.2.2 Extraction of Terms

Term Extraction is the basic task in the ontology learning research. Its objective is to obtain terms, which may be considered as linguistic realizations of domain specific concepts [25].

The extraction of terms then aims at listing all the terms contained in a corpus. To achieve this goal, we use two tools: R.TeMiS, version 10.6 [14] and TreeTagger, version 3.2 [15].

R.TeMiS 10.6: Is a tool serves to create, treat, analyze and extract terms from the corpus of texts. As input, corpus of texts must be organized into a set of sentences and stored them in a file of .txt extension. The advantage of this tool is to calculate the number of occurrence of each term. It is used as a filter to choose the most relevant terms to build our medical ontology. Also, R.TeMiS shows the stop-words to facilitate us their deletion [14].

TreeTagger 3.2: is a tool of morpho-syntactic labelling and lemmatization. It serves to assign to each term in the corpus its morpho-syntactic category (name, verb, adjective, article, proper noun, pronoun, abbreviation, etc.) and give for each term its lemmatization. As input, corpus of texts must be organized into a set of sentences, and stored them in a file of .txt extension. TreeTagger is used to classify extracted terms (concepts/relations) using the annotation and lemmatization information [15].

5.2.3 Cleaning and Filtering

After mining any text, we perform certain cleaning operations, such as remove the stop words, change the upper case characters to lower case and. remove the irrelevant and abbreviation terms.

Within the framework of the statistical methods, several measures are usually used to select the candidate terms. Among them, we can quote the number of appearances of a term within a corpus, as well as more complex measures such as the mutual information, tf-idf, is still used in statistical distributions methods [17]. However, all these techniques allow to detect only new concepts and none the exactly place of extracted concept in the ontology hierarchy. The method is based on the syntactic analysis, and use the grammatical techniques of a word or a group of words a sentence. They put the hypothesis that the grammatical dependences reflect semantic dependences [18]. Other approaches use syntactical patterns [19] to extract concepts and the relations between them.

In our work, we are going to use the number of occurrence of the terms which is going to help us to realize the first filter. The filtering serves to select the appropriate terms (candidate terms) suited for our domain ontology. Then, we focus on the semantic similarity between concepts and their properties.The similarity measure serves to compare the number of common properties with regard to the total number of properties. We choose the method where the similarity is established on the set theory. They produce a measure of similarity which is not inevitably metric. The ratio model of Tversky [27] takes into account the number of common properties and the differences between both concepts such as the coefficient of Jaccard:

SimTve(c1,c2) = f(C1 [intersection] C2)/f(C1 [intersection] C2) + [alpha]f(C1 - C2)+[beta]f(C2-C1) (1)

Where

C1: Set of properties of [c.sub.1].

C2: Set of properties of [c.sub.2]

f : An increasing and monotonous function and [alpha] [greater than or equal to] 0, [beta] [greater than or equal to] 0 are parameters which balance the differences.

The difference C1-C2 correspond to the properties possessed by C1 but absent in C2 and C2-C1 correspond to the properties possessed by C2 but absent in C1.

we are trying to define a second filter using semantic similarity method of Jiang and Conrath measure to best select the relevant terms [32]. In order to compute the total semantic distance between any two concepts in the taxonomy, Jiang and Conrath's measure uses the sum of the individual distances between the nodes in the shortest path and the amount of information which the two concepts shares in common. The semantic distance between any two concepts C1 and C2 in the taxonomy is given by:

SimJ&C(C1, C2) = IC(C1) + IC(C2) - 2 * IC (C)/2 (2)

Where

IC is the informational content of a concept C. It is defined as:

IC(C) = -log (P(C)) (3)

P(C) denotes the probability of occurrence of concept C instances. This probability is calculated by: frequency(C)/N, where N is the total number of concepts.

C is the the smallest concept that subsumes both C1 and C2.

The values of partial similarities obtained previously, will be balanced by coefficients. For example, the coefficient associated to the SimTve is given by the following formula:

TveW= [e.sup.SimTve] (4)

The other coefficients are calculated in the same way. Therefore, the global similarity (GSim) is given by:

GSim(C1, C2) = TveW x SimTve(C1, C2) + J&W x SimJ&C(C1,C2)/TveW + J&CW (5)

The filtering operation allows determining the most relevant terms among the significant terms.

5.2.4 Classification

The terms extracted from the previous step, were then classified into two categories of terms, following this idea. In this step, we are going to classify the semantic elements extracted according into two categories: the concepts and the relations. For that purpose, we are going to use the information provided by the TreeTagger tool to classify each term according to its category. Therefore, we are going to classify NAME (proper nouns) as concepts and the terms of type (verb) as relations.

5.3 Ontology Building

Our medical ontology to be built concerns the gynaecology domain.

5.3.1 Conceptualization

After the acquisition of the majority of knowledge in the first phase, we must organize and structure them by using semi-formal or intermediate representations which is easy to understand and independent of any implementation language. This phase contains several steps which are: build the glossary of terms; build the concepts hierarchy; build the binary-relations diagram; build the concepts dictionary; build the binary-relations-tables; build the attributes-tables; build the logicalaxioms table; build the instances-table and build the assertions-tables.

a) Build the Glossary of Terms

This glossary contains the definition of all terms relating to the domain (concepts, instances, attributes, relations) which will be represented in the domain ontology. It describes all the candidate terms extracted and filtered in the extraction phase, see TABLE 1.

b) Build the Concepts Hierarchy

The hierarchy of concepts classification shows the organization of the ontology concepts in a hierarchical order which expresses the relations sub-class and super-class.

We use the relation Sub-Class-Of between the classes to define their classification. C1 class is sub-class of C2 class if any instance of C1 class is an instance of C2 class. We follow a development process from top to bottom. We start with a definition of the general concepts of domain and then continue by the specialization of concepts. For example, we can start by creating classes for the general concepts: Consultation, Treatment, Childbirth, Abortion, Disease, Medical-supervision, Pregnancy, Health-structure and Person. For example, figure 2 shows the hierarchy of Childbirth concept.

c) Build the Binary Relations Diagram

We will build our diagram in two principal steps; initially, we determine the organization of concepts, then we will connect the concepts by relations so necessary.

We represent the binary relations between classes by a diagram in figure 3. In this diagram the classes are represented by rectangles and the relations by arrows (domain towards codomain) labelled by the name of the relation. We enrich this diagram by adding dotted arrows (sub-class towards class) to illustrate the organization between concepts.

d) Build the Concepts Dictionary

The concept dictionary contains some of the domain concepts, instances of such concepts, class and instance attributes of the concepts, relations whose source is the concept and, optionally, concept synonyms and acronyms (TABLE 2).

e) Build the Binary Relations Tables

The binary relations are represented in the form of properties which attach a concept to another. For each relation whose source is in the tree of concepts classification, we define: its name, the name of the source concept, the name of the target concept, cardinality and the name of the inverse relation if it exist (TABLE 3).

f) Build the Attributes Tables

The attributes are properties which take it values in the predefined types (String, Integer, Boolean, Date...). For each attribute appearing in the concepts dictionary, we specify: its name, type and interval of its possible values and its cardinality, like presented in TABLE 4.

g) Build the Logical Axioms Table

We will define the ontology concepts by using the logical expressions which are always true. In the table below, we define for each axiom, its description in natural language, the name of the concept to which the axiom refers, attributes used in the axiom and the logical expression. For example, TABLE 5 gives logic expression of the concepts Doctor and Abortion respectively.

h) Build the Instances Table

In this table, we describe the instances, which are identified in the concepts dictionary. For each instance, it is necessary to specify the instance name, the concept name to belong to it, the attributes and their values.

i) Build the Assertions Tables

We will present a description of some ontology instances, for that, we will specify the instances' names and values of the attributes for each one of them.

5.3.2 Formalization

In this phase, we use the DL (Description Logic) [20] formalism to formalize the conceptual model that we obtained it at the conceptualization phase.

DL forms a language family of knowledge representation; it represents knowledge relating to a specific area using "descriptions" which can be concepts, relations and instances. The relation of subsumption allows organizing concepts and relations in hierarchy; classification and instantiation are the basic operations of the reasoning on description logic, or terminological reasoning. Classification permits to determine the position of a concept and a relation in their respective hierarchies.

The DL consists of two parts: terminological language TBOX in which we define concepts and relations; and an assertion language ABOX in which we introduce the instances.

-TBOX construction: We define here concepts and relations relating to our domain, by using the constructors provided by description logic to give structured descriptions at concepts and relations; for example a childbirth is the birth of a newborn child, it is assisted by a midwife and it can be by vaginal way or by caesarean.

-ABOX construction: The assertion language is dedicated to the description of facts, by specifying the instances (with their classes) and the relations between them.

5.3.3 Implementation

The implementation deals with building a computable model. The effort is concentrated on the suitability of the OWL DL [20]. For checking, we need to use the inference services provided by many systems such as RACER [21]. It can work well with large ontologies. The use of the RACER system can make possible to read OWL file and to convert it in the form of a DL knowledge bases. It can also provide inference services. We use that to manipulate the application ontology and PROTEGE OWL [21] which offers a convivial graphical user interface. Additionally, PROTEGE OWL provides facilities to impose constraints to concepts and relations.

To evaluate correctness and completeness of domain ontology, we use query and visualization provided by PROTEGE OWL.

We use the built-in query engine for simple query searches and query plug-in to create more sophisticated searches. We also use visualization plug-ins to browse the application ontology and ensure its consistency.

5.3.4 Validation Test

The goal of the validation phase is to test and validate the ontology, as well as its environment and its documentation. The problems of coherence, correctness and completeness are then verified using the RACER inference engine.

Once the constructed ontology is validated, it is ready to be invoked by users' requests using nRQL formal language [21].

For example

Request in natural language:

Who is the doctor who examined the patient Majri_Sonia?

Request in nRQL language :

<Doctor, Examine, patient_Majri_Sonia>

(?x|Doctor[parallel]patient_Majri_Sonia[parallel]Examine|).

RACER system treats the request and gives the following answer:

(((?x |Doctor_Abdelli_Riadh|))).

5.4 Ontology Enrichment

The ontology enrichment tries to find new concepts and relations and the investment of them in the ontology [18].

The arrival of a new corpus of text expresses the need of enrichment to maintain our domain ontology. The process of extraction is invoked for text mining the domain specific corpus in order to find concepts which are related to each other. We evaluate and update the existing ontology in the case were those concepts are already defined in the ontology, or to enrich the existing ontology in the case were those concepts are defined in a purpose of maintenance [5].

The enrichment phase is divided into two main tasks: the learning to look for new concepts and relations and the investment of these concepts and relations to keep consistency of the ontology and protect its coherence. The phase of enrichment is invoked at each evolution need.

5.4.1 Retro-engineering Module

The retro-engineering either the retroconception is the activity which consists in studying an object to determine the internal functioning or the method of design. The objective aimed by the retro-engineering is to understand the internal structure of the ontology by using tools allowing the mapping from the OWL code [21] to the hierarchy of the ontology. The ontology structure permits to check the existence of the new concepts and the relations between them to locate them in the ontology hierarchy. For this purpose, we focus on the protege editor to allow the reading and the representation of the elements of ontology as well as their properties and the Java API insure the translation from the RDF language to the Java language [21].

5.4.2 Ontology Enrichment

This step uses the results obtained by the other modules to select the candidate concepts and the relations between them, to check their existence in order to be well placed in the ontology hierarchy.

a) Concepts Selection

After the extraction task, we shall have for result a set of significant, lemmatized and labelled terms. The activity of selection concerns the results of calculation of frequency of the inverted index. The selected terms are the most frequent (having the highest number of appearance).

Example:

D1: doctor (work for) Hospital.

D2: doctor (work in)

D3 service: doctor (work in) Laboratory Doctor D1, 1. D2, 1. D3, 1. Frequency (3), Works For D1, 2. D2, 2. D3, 2. Frequency (2). Work in D1, 2. D2, 2, D3, 2. Frequency (1). Doctor {Work For (2), Works With (1)}.

b) Check of the Elements Existence

To Check the existence of the concepts in the ontology, we focus on the semantic similarity between concepts and their properties.The similarity measure serves to compare the number of common properties with regard to the total number of properties. We propose a similarity model which computes the similarity between ontology entities and extracted elements combining various measurements strategies. The first strategy computes the similarity between concepts using the Jarowinkler distance [26]. The second strategy focuses on the ratio model of Tversky [27]. It takes into account the number of common properties and the differences between both concepts such as the coefficient of Jaccard. The third strategy concentrates on the acquisition of similarity based on Jiang and Conrath's taxonomy measure [32].

The Jaro-winkler distance is a variant of the Jaro distance metric. Jaro distance is a measure of similarity between two strings.

The Jaro metric equation for two strings s1 and

Dj(s1,s2) = 1/3 (c/[absolute value of s1] + c/[absolute value of s2] + c - t/c) (6)

Where

-[absolute value of s1] and [absolute value of s2] are the string lengths

-c is the common characters in the two strings; all the characters

for which s1[i]= s2[j] and |i-j|[less than or equal to] 1/2 min{ [absolute value of s1], [absolute value of s2]}. -t is the number of transpositions; each non matching character is a transposition.

The Jaro-winkler distance uses a prefix scale p which gives more favorable ratings to strings that match from the beginning for a set prefix length l. given two strings s1 and s2, their Jarowinkler distance dw is:

Dw(s1,s2) = Dj - (1 * p(1 - Dj)) (7)

Where:

-Dj is the Jaro distance for strings s1 and s2.

-l is the length of common prefix at the start of the string up to a maximum of 4 characters,

-p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. P should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is p=0.1.

The gobal similarity is calculated using formula five (5).

The result of the retro-engineering module with Java API [28] can be used to compute similarity between properties of the new concepts and properties of the existing concepts in the ontology hierarchy.

c) Investment of the Concepts in the Ontology

This activity is semi-automatic, because it requires the human intervention to make selections, before the placement of the new concept. For example for a concept X source, the target concept Y is selected as well as the relation linking them. The insertion of these new terms is assured by the API Jena of Java which allows locating the appropriate place for each pair of candidate concepts in the ontology. We can distinguish different cases:

* The extracted terms are instances of concepts

* The target concept and the source one exists in the ontology but the relation between them does not exist

* Only one of the two concepts exists in the ontology.

* The two concepts do not exist in the ontology.

5.4.3 Consistency Evaluation

A test of validation is executed on the enriched ontology, to make sure of its coherence and its consistency. In this approach we exploit postpone them tests supplied by the Racer reasoner [21].

6. CONCLUSION

The goal of this research study is to extract knowledge by mining corpus of text in medical domain of gynaecology to build a new ontology or to enrich the vocabulary of domain ontology. This article deals with knowledge extraction using a text-mining approach. More precisely, we concentrate on the extraction and construction process which includes four main phases: the specification, the linguistic study and knowledge extraction, the ontology building and finally, the enrichment of the developed ontology. The aim is to extract relevant concepts from existing documents in a medical domain, arrange them in subhierarchies, and detect relations among such concepts. The concepts of the inverted index and the similarity measure have been used to best select new concepts and relations between them concepts to place them in an ontology of medical domain.

We use also tools of terminological extraction such as: R.TeMiS, and TreeTagger for the Morpho-syntactic labelling and Protege OWL for the implementation of the ontology.

The proposed process is an end-to-end process. It includes the necessary phases to extract, build and maintain the domain ontology.

Some perspectives can be addressed for future work. First, we need to concentrate on the ontology enrichment with new concepts to be more meaning and thus really reflects the modelled domain. Second, we are planning to develop an algorithm based on the semantic similarity for the contextual filtering step in the knowledge extraction phase. Third, we will have to investigate the influence of existing ontologies such as WordNet [24] to be integrated as a core resource in the extraction phase. Fourth, we check the consistency and the coherence of the enriched ontology, with other ontologies which are in related to the enriched ontology (effects of propagation). Finally, we want to integrate the developed ontology with other ontology in the medical domain for example the endocrinology domain to answer value-added services

REFERENCES

[1] J.L.Ermine, <<Management et Ingenierie des Connaissances : Modeles et Methodes>>, Hermes-Lavoisier journees pp 212, hal-00986764, 2008.

[2] P.J. SAGACE, <<Une Representation des Connaissances pour la Supervision de Procedes Continus>>, 10iemes journees internationales, les systemes experts et leurs applications, Avignon, 1990.

[3] J.Breuker, <<Common KADS library for Expertise

Modelling, Reusable Problem Solving>>, IOS Press, Amsterdam, 1994.

[4] P. Malvache, P. Prieur, << Mastering Corporate Experience with the REX method: Management of Industrial and Corporate Memory>>, actes de la conference International Symposium on the Management of Industrial and Corporate Knowledge, Compiegne, 1993.

[5] R. Driouche, F. Sahnoun, S. Aftis "A Text Mining Approach to Enrich Domain Ontology" , Proceedings of the 3rd International Conference on Software Engineering and New Technologies, Hammamet, Tunisia, 2014.

[6] U.Fayyad and S.Piatetsky, << From Data Mining to Knowledge Discovery in Data Bases >>, Artifical intelligence magazine, Vol 17(3), pp 37-54, 1996.

[7] G. Shrivastava. K. Sharma, V. Kumar" Web Mining: Today and Tomorrow", In Proceedings of 3rd International Conference on Electronics Computer Technology (ICECT), pp.399-403, April 2011.

[8] C. Grouin, A. Rosier, O. Dameron, and P. Zweigenbaum, << Une Procedure d'Anonymisation a Deux Niveaux pour Creer un Corpus de Comptes Rendus Hospitaliers >>. In M. Fieschi, P. Staccini, O. Bouhaddou, and C. Levis, C. Editeurs Risques, technologies de l'information pour les pratiques medicales, chapitre 17, pp 23-34. Springer-Verlag, Nice, France, 2009.

[9] Uschold M and Gruninger M. Ontologies: principles, methods and applications. Knowledge Engineering . Review. Vol 11, No, 2, pp 93-155, 1996.

[10] Gruninger M. and Fox M Methodology for the Design and Evaluation of Ontologies. In proceeding of Workshop on Basic Ontological Issues in Knowledge Sharing, IJCAI-95 Workshop on Basic Ontological Issues in Knowledge Sharing, Montreal, pp 1-10, 1995.

[11] M. Fernandez M., Gomez-Perez A. and Juristo N., "Methontology: from ontological art towards ontological engineering", In Proceedings of the AAAI97 Spring Symposium Series on Ontological Engineering, Stanford, USA, pp 3340, 1997.

[12] B. Bachimont, Engagement semantique et engagement ontologique : conception et realisation d'ontologies en ingenierie des connaissances, In J. Charlet, M. Zacklad, G. Kassel & D. Bourigault Editeurs, Ingenierie des connaissances, evolutions recentes et nouveaux defis. Paris: Eyrolles.2000.

[13] C. D. Manning, P. Raghavan, and H. Schutze, An "Introduction to Information Retrieval", Cambridge University Press, 2008.

[14] https://www.projet-plume.org/fiche/rtemis.

[15] http://www.cis.unimuenchen.de/~schmid/tools/TreeTagger/.

[16] http://www.reverso.net/.

[17] P. Velardi P. Fabriani and M. Missikoff, <<Using text processing techniques to automatically enrich a domain ontology >>. In Proceedings of Proceedings of the international conference on Formal Ontology in Information Systems, ACMFOIS, New York, pp 270-284, 2001.

[18] R. BENDAOUD, << Construction et enrichissement d'une ontologie a partir d'un cornus de textes >>, Actes des Rencontres des Jeunes Chercheurs en Recherche d'information, Lyon, pp 353-358, 2006.

[19] M.A. HEARST, << Automatic acquisition of hyponyms from large text corpora >>, Proceedings of 14th International Conference on Computational Linguistics, Association for Computational Linguistics, USA, Vol 2, pp 539-545, 1992.

[20] F. Baader, D. Calvanese, D. L. McGuiness, D. Nardi, P. F. Patel-Schneider, The description logic handbook:Theory, Implementation and Applications, Cambridge University Press, Cambridge, UK, 2003.

[21] http://protege.stanford.edu/

[22] T. Gruber, "Towards Principles for the Design of Ontologies Used for Knowledge Sharing", International Journal of Human-Computer studies, Vol 43, pp 907-928, 1995.

[23] M. F. Lopez, and A. G. Perez, "Overview and Analysis of Methodologies for Building Ontologies", Knowledge Engineering Review, 17(2), 129-156, 2002.

[24] H. Luong, S. Gauch, and Q. Wang, "Ontology Learning Using Word Net Lexical Expansion and Text Mining", nttp://dx.doi.org/10.5772/5l14L

[25] J. I. Toledo-Alvarado, A. Guzman-Arenas, G. L. Martinez-Luna "Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools" Journal of Applied Research and Technology, pp 398-404, Vol. 10 No.3, June 2012.

[26] W. E. Winkler, "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage", Proceedings of the Section on Survey Research Methods (American Statistical Association), pp 354-359, 1990.

[27] A. Tversky, "Features of Similarity". Psychological Review, Vol 84(4), pp327-352, 1977.

[28] www.findjar.com/.../jars/org-netbeans-api-java1.4jar.html.

[29] L. Benghezaiel, C. Latiri, M. Benahmed, N. Gouider-Khouja, " Enrichissement d'Ontologie par une Base Generique Minimale de Regles Associatives", Laboratoire de recherche RIADI GDL, ENSI, Campus Universitaire La Manouba, Tunis, 2010.

[30] L. Stojanovic, A. Maedche, B. Motik N. Stojanovic, "User Driven Ontology Evolution Management", 13th International Conference on Knowledge Engineering and Knowledge Management, Ontologies and the Semantic Web, EKAW '02, London, pp 285-300, 2002.

[31] R. Driouche, H. Bensassi, N. Kemcha, " Domain Ontology Building Process Based on Text Mining From Medical Structured Corpus ", Proceedings of the International Conference on Digital Information Processing, Data Mining and Wireless Communications, Dubai, 2015.

[32] J. Jiang, D. Conrath, "Semantic similarity based on corpus statistics and lexical taxonomy", Proceedings of ROCLING X, 1997.

Driouche Razika

Lire Laboratory, Computer Science Department, NTIC Faculty, Abdelhamid Mehri University of Constantine, Route Ain El Bey, Algeria.

razika.driouche@gmail.com

TABLE 1: Table of Terms Glossary.

Concept      Description

Patient      Is a female person characterized by the age, the cycle,
             the parity, etc.

Foetus       Is the phase of the prenatal development. He begins from
             the third month of pregnancy. He succeeds the embryo and
             ends in the birth.

Caesarean    Is a surgical operation to extract a child from the
             maternal womb by incision the uterine wall.

...          ...

TABLE 2: Table of Concepts Dictionary.

Concept        Synonym     acronym          attribute

Childbirth    Delivery        -             Date Type

Foetus       Embryo Germ      -       Age Cranial- perimeter
                                        Cardiac- activity

...          ...           ...        ...

Concept      instance         relation

Childbirth       -       Produce Assisted by

Foetus           -       Require Carried by

...          ...         ...

TABLE 3: Table of Binary-Relations.

Relation    Source Concept   Source Cardinalit y   Target Concept

Examine         Doctor             (1, N)             Patient
Assist         Midwife             (1, N)            Childbirth
Realize         Doctor             (1, N)           Consultation
...              ...                 ...                ...

Relation    Target Cardinalit y   Inverse relation

Examine           (1, N)            Examined by
Assist            (1, N)            Assisted by
Realize           (1, N)             Realize by
...                 ...                 ...

TABLE 4: Attributes-Table.

Attribute            Type     Values arrange    Cardinality

Name-patient        String           -             (1,1)
Date-childbirth      Date            -             (1,1)
...                  ...            ...             ...

TABLE 5: Logical-Axioms Table.

Concept    Description           Logic Expression

Doctor     A doctor works in     [] [for all] (X).Doctor(X)
           a health-structure,   [intersection] [??] [there exists]
           examines patients,    (Y), Health-structure(Y)
           prescribes            [intersection] TWork(X, Y) []
           treatment and         [intersection] [there exists](Z).
           detects diseases      Patient(Z) [] [intersection]
                                 Examine(X, Z). [there exists] (W).
                                 [intersection] Treatment(W) []
                                 [intersection] [] describe(X, W)

Abortion   Abortion is an        [] [for all] (X).Abortion(X)
           interruption of       [intersection] [??] [there exists](Y),
           a pregnancy           pregnancy(Y) [intersection]
                                 [] Interrupt(X, Y)
COPYRIGHT 2015 The Society of Digital Information and Wireless Communications
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Razika, Driouche
Publication:International Journal of Digital Information and Wireless Communications
Article Type:Report
Date:Apr 1, 2015
Words:6918
Previous Article:Population mean estimate for adaptive modulation under large phase error in single beamforming sensor array.
Next Article:Heuristic methods to identify fuzzy measures applied to Choquet integral classification of breast cancer data.
Topics:

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters