Printer Friendly

Finding semantic relationship among associated medical terms.


There is a huge volume of data growing on the internet in the form of research papers and web documents. The amount of medical literature continues to grow and specialize. However interest in the field of biomedical research is grooming with passage of time because of frequently changing in human genome resulting in the information overloading in form of online publication. However finding the relevant information in this domain is still very problematic because most of the data on the internet is poorly structured, amorphous, and unable to deal with problems algorithmically. Most of the data is contained by the journal of medicines and biology which makes this type of textual mining a central and core problem. This has required the researchers to develop increasingly sophisticated information management and retrieval tools. Advance techniques in the area of information retrieval help in not only retrieving the facts stated explicitly in the papers but also help in finding implicit and the semantic relationship among the medical terms present in the paper. Some such information that is useful to extract from the published papers is gene-gene interactions, gene-disease interactions, disease-medicine inter etc. In this paper, we have focused on Disease-Medicine co-occurrence relationship extraction from the text of the literature.

Finding the relationship between disease and medicines demands laborious examination of several homogeneous and heterogeneous factors hundreds of time. The library of Medline contains a collection of above 14 million abstracts and full text papers which are always available for online user has been growing since 2002[1] with the pace of above 10,000 abstracts in a just seven days i.e. a week. The textual data that needs to be examined is not well structured which can be found with PubMed results which require analyzing unstructured biomedical data collection to get maximum benefits for biomedical applications. Search for medical literature is becoming much proactive by the patients and their families for getting information relative to their problem, although facts of such journal and their articles are intimidating to read for beginners.

It will be a very valuable contribution in the field of public health to auto-identification of relationship from medicinal records between the disease and external factors to support the process of diagnosis.

Unfortunately, very little efforts have been done in this domain to solve this problem effectively and efficiently. . Most of the peoples face serious problems in extracting and finding useful information to access the clinical support from the currently available search engines and other tools.

The techniques of data mining are used to extract interesting information or patterns which were previously unknown, non-trivial and potentially useful from large information repositories which is the spirit of Knowledge Discovery in Database (KDD) process. There are various forms of discovered pattern such as classification rules, association rules [15] etc. taking a collection of text corpus, text mining technique precede data analysis by undergoing several processes such as tokenization, parsing, pattern reorganization, syntactic and semantic analysis etc. this helps to evaluate results and new, previously unknown knowledge emerges. Several techniques like Natural Language Processing, Neural Networks, etc. are used to analyze and form text based hypothesis.

The aim of Information extraction is to process natural language text and to retrieve occurrences of such specialized event, object class and their occurrence in term of relationships among them by using syntactic parsing of text.

In the paper we are presenting a methodology for extracting useful information from MedLine papers. In this paper we are applying some techniques of data mining to extract useful pattern from huge corpus of clinical data.

The system tries to identify the relationship of an active disease and extract relevant medicine for the patient. For the problem mentioned domain of text mining has a lot of choices and techniques available and the more relevant of them are the Information extraction, Information Retrieval, Categorization, Concept Linkage and Topic tracking. The most relevant among them are the Information Retrieval and Information Extraction. Here is the major difference for adaptability of Information Extraction.

The core objective of proposed system is to find out or extracts only those documents that user is looking from a huge repository of documents or some other collection of facts and figures. The field of Information retrieval uses the technology of keywords matching and statistical techniques/methodologies which takes text as bags of words, so no textual analysis and interpretation is no more required; whereas information extraction techniques can only be applied when technology of NLP is used for analysis of sentences and text. So with difference of processing methodology, the paradigm of information retrieval becomes irrelevant to our domain and left information extraction relevant to our domain.

The contents of paper is organized in the following manner, in section two there is discussion of the related work in the biomedical domain for information extraction. In the third system architecture and the role of each component and its functions are defined. In the fourth section we will present the experimental results to validate our methodology and in the fifth section we will discuss the conclusion and then we will discuss our future work plans.


The purpose of system is to draw a methodology for finding an accurate preventive measure for the victim disease by providing it with the most recent and in detail information from the articles of medical journal without spending huge amount of time on searching the literature.


The first attempts at text mining of the biomedical literature date back to 1998. Some previous work was done for textual mining from the medicinal records has focused on textual classification or recognition of named entities (e.g., finding keywords to classify the records or finding the terms of disease or syndrome). NegEx [2] developed a system to identify diseases and detects fever patients by using records from articles of medical journals. Heinze et al. [3] developed system to learn lesson by using Natural Language Processing (NLP) engine for mining the transcriptions of dictated clinical records [17-18] and applied it to extract cancer-related reports [4]. Another distributed framework was DeGel [5] used for providing clinical-guideline specification, retrieval, application, and quality assessment. Several researchers used ontologies for Investigations which can be made for information extraction and knowledge discovery of biomedical literature. BioPatentMiner was also a system to facilitates information retrieval from biomedical patients, Mukherjea et al. [6]. This system find the medicinal term and then track the relationship, integrates this information from the patients with biomedical ontologies to create semantic web. Rindflesch et al. [7] proposed a method for investigating the interaction between the domain of ontology and improving natural language processing. Leroy and Chen [8] developed a tool to map the synonym of terms and semantically related concepts. Textpresso [9] was another system that uses a specially created ontology to combine keywords and concept co-occurrence at sentence level from biological literature The BioRAT [10] performed biological domain paper extraction.


Our paper focuses on developing a framework for extraction of relationship among disease and the medicine. The challenge for biomedical text mining is to assert its usefulness both for acquiring information with quality that approaches (or surpasses) hand-curated data and for reaching the widest coverage for [11]) system-wide analysis (e.g., characterizing complex diseases. Some applications in biomedical text mining have mirrored those of text mining at large, like document classification, data integration, literature-based discovery, and [12]). Others have been literature analysis (e.g., scientific trends and emerging topics more specific to biomedicine, such as biomedical annotation, phenome/phenotype analysis, public health informatics (e.g., news analysis [13], hospital rankings [14]), clinical informatics, and nursing informatics. The most flourishing areas, however, may be loosely defined as those closely linked to systems biology [16] and medical text mining. On the basis of analysis type each document could fall in following categories of paper, an essay, a book, chapter, a single paragraph, or even a single sentence.

4. System Architecture

4.1 Rules

The rules set consist of specific keyword that helps us to extract useful and interesting and related information. Our finding shows that whenever we start scanning any pertinent document for tokenization we have to look out on these specific keywords. Thus by looking at these keywords we can reduce our search space and using it as a constraint help us to make the data relevant. There were other words induced as rules for determining irrelevant word and also there were some keyword that gives both the useful and irrelevant information.


But in the current system we only used the keywords which give relevant information. Our findings showed that this reduce the search space by almost 90%. However if we apply all the constraints i.e. relevant information revealing rules and irrelevant information revealing rules and also the keywords falling in both the categories will give very optimal results but the proposed solution will become much exhaustive. The set of relevant and irrelevant rules or keywords are showed below.

Relevant keywords/ rules. {Conclusion, showed, our findings, experimental results proved, strongly recommended, proposed, responded, prove}

Irrelevant keywords/ rules . {comparison, compared, occurred, tested, range of value, report a case, no. of patients, recent literature, conducted, inhibited, frequency, no. of occurrences, symptoms, tested, in the study, periodic information, clinical course}

4.2 Tokenization

Tokenization is the process of splitting text in to some smaller units i.e. paragraph, sentence, phrases and single word. However a common delimiter is the space or tab and also string of some arbitrary length.

But in our case tokenization is done at two levels. One at sentence or full stop level and other at paragraph level. The purpose of sentence level tokenization is to find the actual relation of medical term to disease and at paragraph level is to show the context in which term has been used. Also compound terms were segregated by using space level tokenization to disambiguate the root of word.

4.3 Stopword Removal/ Stemming

Before text analysis a stop word list is developed for the removal of semantically insignificant words, this lists vary in size.

For our technique we have list of stop word including common words, phrases and characters. Stopword contains the high frequency terms that are to be ignored from the text as they are not giving any useful information for our scenario. The most common stopwords in our case are 'a', 'the', 'of'' etc.

Stemming also called lemmatization reduce a word to its root. For example 'reader' and 'reading' are reduced to 'read' so that terms can lead to similarity detection. Stemming does not seem to depend on the domain but depends on the language of text. But our findings show that stemming effects to the semantics of term.

4.4 Medical Dictionary

After having basic pre-processing on the text we wants to get only required medical terms. For this purpose we are going to match all the terms with Oxford Medical Dictionary. Thus we got only the terms that contains only the medical terms in any sentence either at single instance or at multiple levels. All non-medicals are pruned that will lead to huge filtration and leading to lemmatize the domain of work.

4.5 POS Tagger

The medical terms from above stage is provided to POS tagger to correctly elaborate all syntactic categories such as noun, verb, adjective, pronoun that can be used to identify part of speech. For purpose of our task we use Stanford University NLP POS Tagger which is an open source program to use.

From all the text we consider only four part of speech i.e. Noun, Pro-Noun, Verb and Adjective. Noun are used because each entity of our domain is treated as Noun by POS e.g. dengue, malaria etc and Pro-Noun are used because in most of the paragraph a term initially starts with the Name entity called Noun by our POS and in the remainder portion of paragraph terms occur as pro-noun. Each pronoun is tested by moving in backward direction to access the pointed Noun. Verb shows the link of relationship among the nouns. And adjectives are used to show the strength of relation e.g. severe, low, high. After getting tagged all the required terms of choice are selected and passed to next stage for further processing.

4.6 Association Rule Mining.

An association rule (AR) is an implication of the form X [right arrow] Y, where X and Y are two item sets.

In our case the two item sets are two medical terms having some relationship. For our task we took the item sets having maximum frequency association. The item with top three item set frequency in descending order will be provided to the user to take action.

4.7 Disambiguation

The word sense disambiguation problem is about finding the most probable meaning of a polysemous word. A common approach to solve is to consider the context in which term is used. It can be supervised and unsupervised. The supervised is carried out by mean of dictionary or thesaurus. The unsupervised disambiguation the different sense of words is unknown. Yarowsky model is quite famous to present unsupervised approach with high accuracy. For example the terms like ill, sick, in poor health etc are disambiguated and moved to its root that is ill.

4.8 Expert Evaluation

By viewing the critical nature of patterns and rules that will be generated, the whole output will be presented to a human expert who will examine the pattern and then present to user community. And the useful and verified patterns will be our knowledge or simply the result of our methodology.


Following is the used algorithm:
         Input. Disease, Rules.
         Output. Medicine, Semantic Relationship.

             1. For any disease do
                  Extract paper form Medline.
                  Extract abstract of paper.
             2. Find any rule/keyword from Rules catalogue.
                  For all rules r in Catalogue R.
                        Where r[euro]R
                  For each [r.sub.i] (i=1...........n)
             3. Tokenize particular sentence
             4. Remove all stopwords.
             5. Perform stemming.
             6. From tokenized sentence
                a. Extract sentence having at least one medicine
                   and one disease.
             7. Filtered sentence is passed from POS tagger to
                separate-required part of speech.
             8. Medicine/Disease and their actions are
                extracted Semantically.
             9. Medicines are associated and ranked based on
                frequency and superiority.
             10. Multiple synonym actions are replaced.
             11. Identified rules are validated from expert.
             12. Verified rules and semantic relationships are
                 then presented to user.



We performed the technique on the 50 papers. Papers were extracted from Medline. Only the abstract were used for scanning. On the basis of our keyword extraction rules set and then applying further analysis steps of architecture, we found the relevant medicines for the disease based on the semantics. We have selected most common diseases like malaria, diarrhoea, asthma, blood pressure, diabetes and hypertension. The superiority of a medicine is based on either the frequency or on the basis of recommendation made by author in paper. The superiority of medicine is represented. We have shown in figure 3 for the above mentioned diseases.
                                 Type 2
Diarrhea            Depression   Diabetes     Malaria

Codeine             Citalopram   Repaglinide  Atovaquone

Diphenoxylate       Duloxetine   Acarbose     Sulfadoxine

Steroid Budesonide  Serotonin    Metformin    Pyrimethamine

Loperamide                       Repaglinide  Artemether
                                 &            Lumefantrine




                                              Atovaquone &

Diarrhea            Diabetes        Asthma          Hypertension

Codeine             Glyburide       Beclomethasone  Propranolol
Phosphate                           Dipropionate

Diphenoxylate       Intensive       Ipratropium     Angiotensin
                    insulin         Bromide

Steroid Budesonide  Enalapril       Aminophylline   Chlorthalidone

Loperamide          Insulin         Isoprenaline    Nisoldipine

                    Repaglinide     Salbutamol



                    & Insulin



The future work involves expanding the project to finding the root cause of the disease and then by taking the patient history or condition and providing him the dose accordingly. The future idea is based on viewing the composition of medicine and after applying it on patient report identifying that is it be suiting him.


[1.] Qiankun Zhao, Sourav S. Bhowmick, Association Rule Mining: A Survey, Technical Report, Center for Advanced Information Systems (CAIS), Nanyang Technological University, Singapore, 2003.

[2.] W.W. Chapman, J.N. Dowling, and M.M. Wagner, "Fever detection from free-text clinical records for biosurveillance.", Journal of Biomedical Informatics, Volume 37, Issue 2(April 2004), pp.120-127.

[3.] D.T. Heinze, M.L. Morsch, and J. Holbrook, "Mining Free-Text Medical Reports.", Proceedings of the AMIA Annual Symposium, 2002, pp. 254-258.

[4.] B.W. Mamlin, D.T. Heinze, and C.J. McDonald, "Automated Extraction and Normalization of Findings from Cancer-Related Free-Text Radiology Reports.", Proceedings of the AMIA 2003 Annual Symposium 2003, pp. 420-424.

[5.] Y. Shahar, O. Young, E. Shalom, A. Mayaffit, R. Moskovitch, A. Hessing, and M. Galperin, "DEGEL: A Hybrid, multiple-ontology framework for specification and retrieval of clinical guidelines." Proceedings the Ninth Conference on Artificial Intelligence in Medicine Europe (AIME-03), Protaras, Cyprus, 2003, pp. 122-131.

[6.] S. Mukherjea, B. Bamba, P. Kankar, "Information Retrieval and Knowledge Discovery Utilizing a BioMedical Patent Semantic Web." Knowledge and Data Engineering, IEEE Transactions on Volume 17, Issue 8, Aug. 2005, pp. 1099 - 1110

[7.] T.C. Rindflesch, M. Fiszman, "The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic Pages to Enhance Web Information Retrieval.", "The propositions in biomedical text.", Journal of Biomedical Informatics 36(6),2003 , pp. 462-477

[8.] G. Leroy and H. Chen, "Meeting medical terminology needs-the ontology-enhanced Medical Concept Mapper." IEEE Transactions on Information Technology in Biomedicine 5(4), 2001, pp. 261-270.

[9.] H. Muller, E.E. Kenny, and P.W. Sternberg, "Textpresso: An ontology-based information retrieval and extraction system for biological literature", PLoS Biol 2(11), 2004

[10.] D.P. Corney, B.F. Buxton, W.B. Langdon, and D.T. Jones, "BioRAT: Extracting Biological Information from Full-Length Papers.", Bioinformatics, Vol. 20, No. 17, 2004, pp. 3206-3213.

[11.] Hettne K M, Mos M, Bruijn AG., et al. Applied information retrieval and multidiscipl research: new mechanistic hypotheses in complex regional pain syndrome J Biomed D Collab. 2007;2:2

[12.] Rebholz-Schuhman D, Cameron G, Clark D, et al. SYMBIOmatics: synergies in Medi Informatics and Bioinformatics-exploring current scientific literature for emerging topic Bioinformatics. 2007;8 Suppl 1:S18

[13.] Cerrito P. Inside text mining. Text mining provides a powerful diagnosis of hospital q rankings Health Manag Technol. 2004;25:28-3.

[14.] Ananiadou S, Kell D B, Tsujii J. Text mining and its potential applications in systems Trends Biotechnol. 2006;24:571-9.

[15.] D. Magdalene Delighta Angeline, I. Samuel Peter James. Association Rule Generation Using Apriori Mend Algorithm for Student's Placement", Int. j. emerg. sci. 2006. 2(1): 78-86

[16.] Roberts P M. Mining literature for systems biology Brief Bioinform. 2006;7:399-406.

[17.] Kareem, S., Bajwa, I.S. A Virtual Telehealth Framework: Applications and Technical Considerations. IEEE International Conference on Emerging Technologies 2011 (ICET 2011) NUST Pakistan

[18.] Kareem, S., Bajwa, I.S. Clinical Decision Support System based Virtual Telemedicine In: 3rd International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC 2011) pp:16-21 Hangzhou, China


Muhammad Tanvir Afzal studied Computer Science at Graz University of Technology, Austria and was awarded Ph.D. with distinction in 2010. He received his master's degree in Computer Science from Quaid-i-Azam University, Islamabad, Pakistan and secured Gold Medal in 2004. Currently, he is Assistant Professor and group leader for an active research group "Centre for Distributed and Semantic Computing" at Mohammad Ali Jinnah University, Islamabad, Pakistan, and adjunct professor at Institute for Information Systems and Computer Media at Graz University of Technology, Austria. He worked in software houses, R&D institutes, and universities at various levels. He worked on Context-aware systems for Journal of Universal Computer Science (J. UCS). His research is live in JUCS since 2007. He authored more than 35 publications in international journals and conferences. He is/was serving as editor, reviewer, session-chair for various reputable international journal and conferences. His research areas include personalized services, Digital Libraries, Sentiment analysis, Semantic Web, web/text mining, social web and social network analysis.

Malik Waqar Hussain has recieved the MS(CS)-Network(2011) from Muhammad Ali Jinnah University,Islamabad Pakistan and BS(CS) from The University of Azad Jammu & Kashmir, Muzaffarabad Pakistan in 2006. He has worked as faculty member in Department of Computer Sciences and Information Technology, AJKU. His area of research includes Vehicular Ad-Hoc Networks, Data Mining and Semantic Web.

Malik Waqar Hussain, Muhammad Tanvir Afzal, Muhammad Waqas

Centre for Distributed and Semantic Computing (CDSC) Department of Computer Sciences, Muhmmad Ali Jinnah University, Islamabad Islamabad Expressway, Near Kaakpul Kahuta Road, Zone-V Islamabad (1), (2), (3)
COPYRIGHT 2012 Springfield Publishing Corporation
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2012 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Hussain, Malik Waqar; Afzal, Muhammad Tanvir; Waqas, Muhammad
Publication:International Journal of Emerging Sciences
Article Type:Report
Geographic Code:9INDI
Date:Jun 1, 2012
Previous Article:ADAM: potential of PDM into clinical patient data management.
Next Article:A comparison of direct and indirect solvers for linear systems of equations.

Terms of use | Privacy policy | Copyright © 2022 Farlex, Inc. | Feedback | For webmasters |