Printer Friendly

Extraction of malignancy status and validation of pathological classification of breast cancer using machine learning approach.


Breast cancer is one of the most common causes of death in women worldwide and statistics reveals that India ranks top in the world in breast cancer deaths. This alarming fact necessitates early detection and treatment of the disease. Development of automated systems to process and analyze regional breast cancer data would help the Medical experts to understand the severity and spread of the disease in the patient population. Most of the Medical reports in India are written in natural language in a highly unstructured and heterogeneous format and processing of natural language text has many challenges as it requires handling of varied report formats, language style, and different representations of data. In India, medical data are rarely available for analysis and research. Hence many of the research works use online medical data such as Wisconsin Breast cancer data, SEER data etc. for their study. Developing an automated system for the medical domain also necessitates frequent and extensive support from Medical experts for meaningful implementation, analysis and inference from the results. The work presented in this paper was done with regional data and in consultation with domain experts.

The corpus for this work is a set of breast cancer pathology reports obtained from a hospital in South India. A list of malignancy and benign conditions relating to breast cancer was obtained from standard medical documents and extracted from natural language text using pattern-matching rules. The Gold standard values to evaluate the extraction process were obtained through manual scrutiny of the reports by the Pathologists. The system compares the extracted data with the Gold standard values, to derive the evaluation parameters namely Precision, Recall, Accuracy, and F-Measure.

In addition to extracting the details required to determine the malignancy status of patients from the natural language text, the system applies Machine learning approach to validate the Pathological classification in the report.As part of this work, Expectation-Maximization algorithm and Simple K-Means algorithm were used to cluster the extracted data. A classification model was built by the system by applying Decision-tree based classification algorithms. The system predicts the presence or absence of the pathological Tumour-Lymph node-Metastasis (pTNM) classification using the classification model built. The association between the malignancy status and the pTNM classification is also analyzed by the system.

The aim of the work in extracting both the malignant and benign conditions in patients and mining the same is to provide an overall pathological summary of the patients' conditions to the Medical experts and also to validate the pTNM classification which was derived by the Pathologist through manual scrutiny of reports.

The paper is organized as follows: Section 2 describes Related Works that process Medical reports applying Information Extraction, and Data Mining tasks on different datasets. Section 3 explains the Materials and Methods used, Section 4 presents the Results Analysis and Interpretation and Section 5 presents the Conclusion.

2. Related Works:

Extraction of disease-related information and mining of the details from breast cancer reports in natural language text have been done in the past. Applying NLP techniques to find breast cancer recurrence was done by David S Carrell et al. Efficient natural language processing on cancer documents require SNOMED CT codes and UMLS annotator to recognize the medical terms. Nguyen et al. developed a rule-based cancer stage classification system using GATE. The system also used the text to SNOMED CT mapping to identify the medical terms.

Mining of breast cancer data has been performed in many studies.Marafino et al. applied classifications on clinical text. Classifications on Breast Cancer data was performed by Chen Y Abraham et al., Gonzalez Otal R, J. L,Lavanya D, K. Usha Rani, and Paulin F and A. Santhakumaran.Raj kamalkaur et al. performed a two-level diagnosis of breast cancer using data mining techniques. Arul Murugan et al. analyzed the Breast cancer data using Data Mining tools Ruijuan Hu performed association analysis on medical data and Murat Karabatak et al. developed an expert system to detect breast cancer based on association rules.

Mining tools have been used in the study of breast cancer data. Zehra K. Senturk and Resul Kara used Rapid Miner for breast cancer diagnosis. Research work applying mining techniques to the Wisconsin Breast Cancer Data set have been done by Megha Rathi and Chetna Gupta, Ronak Sumbaly, N. Vishnusri, S. Jeyalatha, Vikas Chaurasia, Saurabh Pal, and Zehra Karapinar Senturk, Resul Kara. Ahmad LG, Eshlaghy AT, Poorebrahimi A, Ebrahimi M and Razavi AR have used online breast cancer data such as ICBC registry and Dursun Delen et al. have worked on SEER data. RonakSumbaly, N. Vishnusri, S. Jeyalatha have diagnosed breast cancer using data mining techniques on the Wisconsin data set. OnurInan et al. applied Association rules and Principal Component Analysis (PCA) on the Wisconsin Breast Cancer dataset. Tintu P B and R. Paulin worked on detection of breast cancer using the Wisconsin Breast cancer dataset. Gouda I Salama et al. used three databases in the Wisconsin Breast Cancer dataset and applied different classifiers on them.

The automated system developed as part of this work processes locally obtained highly unstructured textual Breast Cancer Pathological data, extracts the required information to determine the malignancy status of patients, and validates the Pathological classification manually determined by Medical experts, using machine learning approach.


3.1 The Dataset:

The corpus consists of 150 Breast Cancer Pathology reports obtained from a hospital in South India. The domain experts were consulted to ensure understanding of Medical perspectives so that the technical implementation is precise and the results and inferences are relevant for decision-making. The de-identified pathology reports have the following sections: Demographic information (with Report No. and Patient-Id only), Specimen, Clinical, Gross, Micro, and Impression.

The system can operate in batch mode to process multiple reports or process individual reports. Processing bulk data would provide a quick summary of the patient population in terms of criticality of the disease as revealed through malignancy conditions and the pathological classification (pTNM). The workflow of the automated system is shown in Figure 1.


3.2 Pre-processing of the Dataset and its contents:

The system loads the corpus consisting of multiple Pathology reports as pdf file or txt file. Preprocessing, information extraction and mining are performed on the pathology reports in the corpus. The first preprocessing step on the dataset is Report Segregation that splits a single file with multiple reports into individual reports. The Section segmentation process then extracts the various sections of the report and stores them into a database. The section headings are used in the report segregation and report segmentation tasks. Several preprocessing steps such as white space removal, normalization of measures, standardization of numerals in various forms, handling of abbreviations, variations in spellings of words and handling typographical errors are performed on the textual content of the reports. This standardizes and homogenizes the corpus for easy extraction.

The important preprocessing tasks performed relating to this work are abbreviation handling, variations in spelling and handling typographical errors. Ductal Carcinoma in situ is referred as DCIS in reporting. The system expands such common medical abbreviations using a standard list of medical abbreviation obtained online. The extraction process is applied to the entire content of the report.

3.3 Extraction of Malignancy Status:

Extraction of malignancy and / or benign conditions from the reports of patients is vital to pathological classification. Pattern-matching rules are used to extract the malignant and benign conditions from the natural language text in the reports. Five 'Malignant' conditions and six benign conditions short-listed in consultation with the domain experts are extracted from the text. The five malignant conditions extracted by the system include Ductal Carcinoma in situ, Infiltrating Ductal Carcinoma, Invasive Ductal Carcinoma, Invasive Lobular Carcinoma and Invasive Papillary Carcinoma, which are denoted by Ml to M5. The six benign conditions derived by the system denoted by B1 to B6 include Fibrosis, Simple cyst, Hyperplasia, Fibro adenoma, Phyllodes tumour, and Fat necrosis and oil cyst. The presence and / or absence of a malignant condition Mi and Benign condition Bi is indicated by Boolean values. The malignancy status and the benign status are determined using the formula.

M- status = M1 V M2 V M3 VM4 V M5

B- status = B1 V B2 V B3 V B4 V B5 VB6

The malignant and benign status from the text that are derived manually constitute the Gold standard to validate the correctness of the extracted details. The system automatically generates a Discrepancy report when there is a mismatch between the Gold standard and the extracted values of M-status and B-status.

3.4 Validation of Pathological Classification:

The application interfaces with WEKA and provides the extracted details in the database as comma separated values for the tool to perform the mining tasks namely Clustering, Classification, Prediction and Association analysis. Clustering is performed on Patient Status to identify the number of patients affected with breast cancer and those that are not, using M-Status and B-Status. The clustering process also groups the pathology reports as correct or incorrect based on M-Status and the pTNM Classification to validate the correctness of the pathological classification given by the domain expert.

Classification is an important component of machine learning algorithms to extract rules and patterns from data that could be used for prediction. Feature selection is an important step in building a classification model as it selects the distinguishing features from a set of features and eliminates the irrelevant ones. The features selected for the Classification task are the M-Status, B-Status and the pTNM classification. A Classification model is built to train the system for the prediction of the presence or absence of the pathological classification in the report. J48, Naive Bayes and Random Forest algorithms are applied for the classification process. The system uses a training set of 100 reports to build the Classification model and a test set of 50 for prediction. The rules used to build the Classification model are given in Table 1.

The validity of the pTNM classification is based on the rules in Table 1. The rules can be interpreted as follows.

i. The possibility of M-status and B-status being 0 is erroneous and an impossibility since a report will have either of these conditions after pathological examination by the domain expert.

ii. pTNM classification is 0 if B-status alone is 1.

iii. The pTNM classification is 1 if M-Status is 1.

iv. M-Status and B-Status are not mutually exclusive from the Medical perspective as a patient may have both malignant and benign conditions. Hence B-status can be 1 with a valid pTNM classification.

The system checks the association between the Malignancy status and pTNM in the reports to validate the correctness of the pathological classification.

Result Analysis And Interpretation:

This section presents the results of the extraction and mining tasks performed by the automated system. To evaluate the process that extracts the malignancy and benign conditions, the system derives the True positives (TP), True negatives (TN), False positives (FP) and False Negatives (FN) by comparing the Gold standard values and the extracted values. The analysis parameters are derived using the formula listed below.

Precision = TP / (TP+FP)

Recall = TP / (TP+FN)

Accuracy = (TP+TN)/(TP+FP+FN+TN)

F-measure = 2x((Precision x Recall)/(Precision+Recall))

The analysis report for the extraction of malignant and benign conditions are presented in Table 2 and Table 3.

The rate of precision in extracting the malignant conditions is less due to the possibility that the type of carcinoma (infiltrating / invasive) are represented indirectly in the textual content. This requires much finer search.

The extraction process performed with absolute precision for benign conditions is due to the fact that benign conditions are simple medical terms which are directly mentioned in the reports.

The corpus has 150 reports and the extraction process eliminated the two duplicate reports at the preprocessing stage. Hence, Clustering of the patient population using EM and Simple K-Means algorithm showed that 134 patients had malignancy and 14 were free of breast cancer. Clustering to check the correctness of the reports indicate that 134 reports had pathological classification and 14 had no classification. The classification algorithms J48, Naive Bayes and Random Forest were applied on the Gold standard data and the extracted data. The classifier model identified 84 reports with pTNM classification and 16with no pathological classification.

The Random Forest algorithm yielded the best results in predicting the presence or absence of pTNM classification based on the Classification model built on a training set of 100 reports. The prediction of pTNM classification on the test set of 50 yielded 100% accuracy.


The work presented here successfully extracted 5 malignant conditions with an average precision of and 6 benign conditions from pathology reports. The extraction showed an average precision of 98.36% for malignant conditions and 100% for benign conditions. In the future an exhaustive list of malignancy and benign conditions associated with breast cancer could be incorporated into the system. The developed system provides a considerably good automated support by providing the pathological summary on the patient population and validating the manual pathological classification in the natural language reports.

A few areas of enhancement in the system are identified for future work. The Pathological classification is not available in some reports. Handling these exceptions would provide a robust system. The vast and varying terms used in the reports necessitate the use of Medical Thesaurus to process natural language reports. Also the system could be tested only on a small dataset of 150 reports, which is a major constraint in deriving all the associations by the mining tool.


The authors would like to thank the Department of Pathology, Christian Medical College and Hospital, Vellore for providing them with the sample data for their study. Our special thanks to Dr. Marie Therese Manipadam, Department of Pathology, CMC, Vellore for sharing her domain expertise for the development of the automated system. We also thank Dr. Gunadalalshitha, Department of Pathology, CMC, Vellore for the manual scrutiny of the pathology reports and providing the Gold standard for evaluation. The authors would like to appreciate and acknowledge Mrs. R. Sreeja for developing the automated system.


[1.] Ahmad, L.G., A.T. Eshlaghy, A. Poorebrahimi, M. Ebrahimi and A.R. Razavi, 2013. Using Three Machine Learning Techniques for Predicting Breast Cancer Recurrence, J Health Med Inform, 4: 2.

[2.] Arul Murugan, S., M. Kannan, 2013. A Novel Approach for Analysis of Breast Cancer and Mental Health using various Data Mining Tools, International Journal of Advanced Research in Computer and Communication Engineering, 2: 7.

[3.] Chen, Y Abraham, A.Yang, 2006. Feature Selection and Classification using Flexible Neural Tree, Journal of Neurocomputing, 70(1-3): 305-313.

[4.] David, S., Carrell, Scott Halgrim, Diem-Thy Tran, Diana S.M. Buist, Jessica Chubak, Wendy W. Chapman and Guergana Savova, 2014. Using Natural Language Processing to Improve Efficiency of Manual Chart Abstraction in Research: The Case of Breast Cancer Recurrence, American Journal of Epidemiology Advance Access.

[5.] DursunDelen, Glenn Walker, 2005. Amit Kadam, Predicting breast cancer survivability: a comparison of three data mining methods, Artificial Intelligence in Medicine, 34: 113-127.

[6.] Edge, S.B., D.R. Byrd, C.C. Compton, et al., 2010. AJCC Cancer Staging Manual, 7th ed. New York, NY: Springer, pp: 347-76.

[7.] Gonzalez Otal, R., J.L. Lopez Guerra, C.L. Parra Calderon, A. Martinez, Garcia, V. Suarez Gironzini, J. Peinado Serrano, A. Moreno Condey M.J. Ortiz, Gordillo, 2013. Application of Artificial Intelligence in Tumors Sizing Classification for Breast Cancer, IWBBIO Proceedings Granada, pp: 18-20.

[8.] Gouda I. Salama, M.B. Abdelhalim and MagdyAbd-elghanyZeid, 2012. Breast Cancer Diagnosis on Three Different Datasets Using Mult-Classifiers, International Journal of Computer and Information Technology, (2277 - 0764) 01: 01.

[9.] Lavanya, D., K. Usha Rani, 2011. Analysis of Feature Selection with Classification: Breast Cancer Datasets, Indian Journal of Computer Science and Engineering (IJCSE), 2(5): 756-763.

[10.] Marafino, B.J., J.M. Davies, N.S. Bardach, et al., 2014. N-gram support vector machines for scalable procedure and diagnosis classification, with applications to clinical free text data from the intensive care unit, J Am Med Inform Assoc., 2: 871-875.

[11.] MeghaRathi and Chetna Gupta, 2014. An Approach to predict Breast Cancer and Drug Suggestion using Machine Learning Techniques, ACEEE Int. J. on Information Technology, 4: 1.

[12.] Murat Karabatak, M. CevdetInce, 2009. An expert system for detection of breast cancer based on association rules and neural network, Elsevier, Expert Systems with Applications, 36: 3465-3469.

[13.] Nguyen, D.H.M., J.D. Patrick, 2014. Supervised machine learning and active learning in classification of radiology reports, J Am Med Inform Assoc. 21: 893-901.

[14.] OnurInan, Mustafa SerterUzer and NihatY_lmaz, 2013. A New Hybrid Feature Selection Method Based on Association Rules and PCA for Detection of Breast Cancer, International Journal of Innovative Computing, Information and Control., 9(2): 727-739.

[15.] Paulin, F., A. Santhakumaran, 2011. Classification of Breast cancer by comparing Back propagation training algorithms, International Journal on Computer Science and Engineering (IJCSE), 3(1): 327-332.

[16.] Rajkamalkaur Grewal, Babita Pandey, 2014. Two Level Diagnosis of Breast Cancer Using Data Mining, International Journal of Computer Applications (0975-8887) 89: 18.

[17.] Ravi Kumar, G., Dr. G.A. Ramachandra, K. Nagamani, 2013. An Efficient Prediction of Breast Cancer Data using Data Mining Techniques, International Journal of Innovations in Engineering and Technology (IJIET), 2(4): 139 ISSN: 2319-1058, 139-144.

[18.] RonakSumbaly, N., Vishnusri, S. Jeyalatha, 2014. Diagnosis of Breast Cancer using Decision Tree Data Mining Technique, International Journal of Computer Applications, 98(10): 0975-8887.

[19.] Ruijuan Hu, 2010. Medical Data Mining Based on Association Rules, Computer and Information Science, 3: 4.

[20.] Tintu, P.B., R. Paulin, 2013. Detect Breast Cancer using Fuzzy C means Techniques in Wisconsin Prognostic Breast Cancer (WPBC) Datasets, International Journal of Computer Applications Technology and Research, 2(5) ISSN: 2319-8656, 614-617.

[21.] VikasChaurasia, Saurabh Pal, 2014. Data Mining Techniques: To Predict and Resolve Breast Cancer Survivability, International Journal of Computer Science and Mobile Computing, 3(1): 10-22.

[22.] VikasChaurasia, Saurabh Pal., 2014. A Novel Approach for Breast Cancer Detection using Data Mining Techniques, International Journal of Innovative Research in Computer and Communication Engineering, 2: 1.

[23.] ZehraKarapinarSenturk, Resul Kara, 2014. Breast Cancer Diagnosis via Data Mining: Performance Analysis of Seven different algorithms, Computer Science & Engineering: An International Journal, (CSEIJ), 4: 1.





(1) Johanna Johnsi Rani G, (2) Dennis Gladis, (3) Joy John Mammen

(1) Dept. of Computer Science, Madras Christian College, Tambaram, Chennai 600 059, INDIA

(2) Dept. of Computer Science, Presidency College, Chennai 600 005, INDIA

(3) Dept. of Immunohematology & Transfusion Medicine, Christian Medical College, Vellore 632 004, INDIA

Received 27 May 2016; Accepted 28 June 2016; Available 12 July 2016 Address For Correspondence:

Johanna Johnsi Rani G, Dept. of Computer Science, Madras Christian College, Tambaram, Chennai 600 059, INDIA
Table 1: Rules for Validation of pTNM Classification

S. No.    M-Status    B-Status    pTNM    Validity of Classification

1         0           0           0       Erroneous & impossible
2         0           1           0       Valid
3         1           0           1       Valid
4         1           1           1       Valid

Table 2: Analysis--Extraction of Malignant Conditions

Malignant Conditions     Pr.     Rec.    Acc.    F-measure

DCIS                     97.5    85.2    83.7    0.9094
Inf. DC                  97.8    94.4    92.5    0.9607
Inv. DC                  96.5    98.5    95.2    0.9749
Inv. LC                  100.0   98.6    98.6    0.9929
Inv. PC                  100.0   98.6    98.6    0.9929

Table 3: Analysis--Extraction of Benign Conditions

Benign Conditions       Pr.      Rec.     Acc.     F-measure

Fibrosis                100.0    100.0    100.0    1.0000
Simple cyst             100.0    100.0    100.0    1.0000
Hyperplasia             100.0    100.0    100.0    1.0000
Fibro adenoma           100.0    100.0    100.0    1.0000
PhyllodesTumour         100.0    100.0    100.0    1.0000
Fat Nec., & Oil cyst    100.0    100.0    100.0    1.0000
COPYRIGHT 2016 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2016 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Rani G., Johanna Johnsi; Gladis, Dennis; Mammen, Joy John
Publication:Advances in Natural and Applied Sciences
Date:Jun 30, 2016
Previous Article:A secure multichannel broadcasting using EMSDES in wireless sensor networks.
Next Article:An automated diagnosis of breast cancer using farthest first clustering and decision tree J48 classifier.

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |