Printer Friendly

An automated diagnosis of breast cancer using farthest first clustering and decision tree J48 classifier.


Breast cancer (BC) is the most common invasive cancer in females worldwide. BC is the second leading disease, next to lung cancer, which increases the mortality rate in women [1].It develops from breast cells, primarily in the milk ducts (duc-tal carcinoma) or glands (lobular carcinoma) which begin with the formation of a small tumor or a mass [2]. A mass with a smooth, well defined border is non-cancerous (benign). Mass with an irregular border or with speculations may be cancer-ous (malignant) [48]. Even though the causes of BC are not fully and thoroughly understood, every woman needs to be aware of her own chances of developing BC in order to be proactive about the risk reduction strategies and for better management of the disease. Independent studies have identified a number of factors that either increase or decrease the chances of developing breast cancer [3-15].

Data mining has become a popular technology in current research and for medical domain applications [17]. Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. The main goal of data mining is to discover new patterns for the users and to interpret the data patterns to provide meaningful and useful information for the users. Data mining is applied to find useful patterns to help in the important tasks of medical diagnosis and treatment. An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Unsupervised learning approach is employed to this model. Usually, the result of unsupervised learning is a new explanation or representation of the observation data, which will then lead to improved future responses or decisions. In this paper proposed a automatic diagnosis of breast cancer either benign or malignant using data mining techniques of farthest first clustering and j48 decision tree classification.

The rest of the paper is organized as follows: The study of related researches on the breast cancer is presented in section II. Section III gives brief explanation for the methods used in the outlier detection i.e farthest first clustering and j48 classifier. The proposed system is presented in section IV. The experimental results and data set are described in section V. Finally, section VI gives the conclusion of the paper.

II. Related Work:

Li, Peng, and Liu (2013) Discussed the quasi formal kernel common locality discriminant analysis for dimensionality reduction and reported a classification accuracy of 97.26%. Zheng, Yoon, and Lam (2014) achieved a classification accuracy of 97.38% with K-means algorithm and support vector machine based model. Ozs_en and Ceylan (2014) presented the performance of artificial immune system as a data reduction algorithm. In order to estimate the data reduction performance of artificial immune system, it is evaluated to the fuzzy c-means clustering algorithm. Both data reduction methods are combined with the artificial neural network classifier to get the classification results. They achieved a 97.80% classification accuracy for artificial immune system and artificial neural network combination, whereas a c fuzzy c-means clustering and artificial neural network combination they obtained classification accuracy of 90.04%.Chen (2014) proposed a combined model for breast cancer diagnoses that can work in the absence of labeled training data. Hence, this work explains the feature selection methods in unsupervised learning models. The model combines clustering and feature selection. Their work indicates that selecting a subset of relevant features instead of using all the features in the original data set can improve the interpretability of clustering results. Nguyen, Khosravi, Creighton, and Nahavandi (2015a) obtained a classification accuracy of 97.88% and presented a medical classification model which integrates wavelet transformation and interval type-2 fuzzy logic system. These mechanism are combined together in order increase the dimensionality and uncertainty properly. Interval type-2 fuzzy logic system consists of fuzzy c-means clustering based unsupervised learning and genetic algorithm based parameter tuning. These mechanisms have high computational costs and wavelet transform functions for reducing these computational costs. Nguyen, Khosravi, Creighton, and Nahavandi (2015b) reached a classification accuracy of 97.40% with a medical classification model based on fuzzy standard additive model and genetic algorithm. In this scheme, rule initialization is holded by the adaptive vector quantization clustering. A genetic algorithm is used for rule optimization and gradient descent algorithm for parameter tuning. Lee, Anaraki, Ahn, and An (2015) presented a classification approach based on fuzzy-rough feature selection and multi-tree genetic programming for intension pattern recognition using brain signal. The literature survey on the breast cancer diagnosis of this paper differs from the other studies in terms of a number of points. First, the propose classification approach utilizes the fuzzy-rough instance selection method in order to omit useless or erroneous instances based on fuzzy-rough set based concept. Secondly, a recent search approach, i.e. the re-ranking search method is integrated with consistency-based subset evaluation for sinking the number of wrapper estimations in feature selection. Finally, an optimal training set is achieved via fuzzy-rough instance selection and consistency- based feature selection and this set is utilized to build a classification approach based on fuzzy-rough nearest neighbor classifier. Rodger (2014a) proposed a statistical model based on fuzzy nearest neighbor, regression and fuzzy logic for improving energy costs savings in buildings. Rodger (2014b) presented a fuzzy feasibility Bayesian probabilistic estimation model for a supply chain. Pena-Reyes and Sipper(1999) proposed an approach that combined the fuzzy systems and evolutionary algorithms and achieved a classification accuracy of 97.80%. Chou, Lee, Shao, and Chen (2004) presented the artificial neural networks and the multivariate adaptive regression splines and achieved 98.25% classification accuracy. Karabatak and Ince (2009) obtained a classification accuracy of 97.4% with an association rule for dimension reduction and the neural network for performing classification. Shiv Shakti Shrivastava(2013) [12] focused on classification of data mining techniques on breast cancer data with using data mining software. A large amount of medical records are stored in databases. Mining is a relatively new field of researcher whose major objective is to acquire knowledge from large amounts of data. Decision trees are powerful classification algorithms that are become more popular with the growth of data mining in the field of information systems. Deepshree A. Vadeyar (2014) [2] focused on Website can be easily design but to efficient user navigation is not a easy task. The user behavior is keep change and developer view is quite different from what user wants. one The way is reorganization of website structure can improve navigation. For reorganization here proposed strategy is farthest first traversal clustering algorithm to perform clustering on two numeric parameters and for finding frequent traversal path of user Apriori algorithm is used. To make reorganization with fewer changes in website structure. E. Venkatesan (2015) [13] focused on compare the four decision tree algorithms in the prediction of the performance accuracy in breast cancer data. Classification is used to classify the elements permitting to the features of the elements through the predefined set of classes. Breast cancer accuracy can be increased by pruning. To reduce the cost of medicine for a patients. To help in improvement clinical studies and analyzing results. To conclude that the classification results of all the four algorithms J48 can show the better performance.


In this section farthest first clustering is used to cluster the data in to number of clusters. A small cluster is declared as outliers. The remaining outliers are detected by using LOF. If the value of LOF is greater than threshold means, the point is declared as outliers. Finally the data set is classified either benign or malignant using decision tree J48 algorithm.

A. Classification Technique:

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research. Usually, classification is a preliminary data analysis step for examining a set of cases to see if they can be grouped based on --similarity" to each other. The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data. Building effective classification systems is one of the central tasks of data mining. Given a classification and a partial observation, one can always use the classification to make a statistical estimate of the unobserved attribute values and as the departure point for constructing new models, based on user's domain knowledge.

B. Decision Tree Classifier:J48:

Quinlans C4.5 algorithm implements J48 to generate a trimmed C4.5 decision tree. The every aspect of the data is to split into minor subsets to base on a decision[4]. J48 examine the normalized information gain that actually the outcomes the splitting the data by choosing an attribute. In order to make the decision, the attribute utmost standardized information gain is used. The minor subsets are returned by the algorithm. The splitting methods stop if a subset belongs to the same class in all the instances. J48 constructs a decision node using the expected values of the class. J48 decision tree can handle specific characteristics, lost or missing attribute values of the data and differing attribute costs. Here precision can be increased by pruning.

The algorithm describes

Step 1: The leaf is labelled with the same class if the instances belong to the same class.

Step 2:For every attribute, the potential information will be calculated and the gain in information will be taken from the test on the attribute.

Step 3: Finally the best attribute will be selected based on the current selection parameter.

C. Clustering:

Clustering is the technique of partitioning of data objects into groups based on similar properties. A Cluster is a gathering of data objects having similar properties. The data objects within the same cluster have tough association between them and pathetic association between the data objects of different clusters. Clustering is an unsupervised learning so no training sample is accessible to partition the data. In this paper we focus on farthest first clustering techniques.

Farthest First Algorithm:

Farthest first algorithm same procedure as k-means, this also chooses centroids and allocate the objects in cluster[2].The max distance and initial seeds are value which is at largest distance to the mean of values. Here cluster task is different, at initial cluster we get link with high Session Count, like at cluster-0 more than in cluster-1, and so on.

Min{Max dist(Pi,P1),Max dist(Pi,P2)...}

In farthest first it takes point P i then chooses next point P i which is at maximum distance. P i is centroid and p1, p2,........, pn are points or objects of dataset belongs to cluster. Farthest first actually solves problem of k-centre and it is very efficient for large set of data. In farthest first algorithm we are not finding mean for calculating centroid, it takes centrod arbitrary and distance of one centroid from other is maximum and shows cluster assignment using farthest first. When we performed outlier detection for our dataset we get which objects is outlier.

D. Local Outlier Factor(Lof):

We used farthest first method to cluster the data, which help us to figure out the dense regions as well the sparse regions. Than we apply the LOF algorithm over the safe region first, although we call it safe region but to fully utilize the approach of LOF over the entire data and find out the most accurate results we still apply LOF over this region because simply using a clustering cannot figure out the local outliers. Later the same LOF algorithm is applied on the sparse region which is most desirable partition to contain outliers.

E. Pruning:

This important step to the result because of the outliers. All data sets contain a little subset of instances that are not well-defined, and differs from the other once on its neighbourhood. After the complete creation of the tree, that must classify all the instances in the training set, it is pruned. This is to reduce classification errors, caused by specialization in the training set; this is done to make the tree more general. When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of over fitting the data. Such methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data.

IV. Proposed System:

In this paper proposed an automated breast cancer diagnosis using farthest first clustering and decision tree j48 classifier. The proposed algorithm for automated breast cancer diagnosis is outlined in Table 1.
Table 1: Proposed algorithm for automated breast cancer diagnosis.

Step 1: Construct Clusters:

Cluster the entire dataset into k cluster using farthest first

Step 2: Clusters having fewer numbers of points:

If a cluster contains only fewer numbers of points than the
required number of outliers, the radius pruning is avoided for that

Step 3: Pruning points inside each cluster: Calculate distance of
each point of a cluster from the radius of the cluster. If the
distance of a point is less than the radius of a cluster, the point
is pruned.

Step 4: Detecting outlier points:

We introduce the notion of the local outlier factor LOF, which
captures exactly this relative degree of isolation. Then calculate
LOF for all the points that are left unpruned in all the clusters.
If the outlier factor is greater than threshold then it will
declare as outlier otherwise it is not a outlier.

Step 5: Classification

The preprocessed data set is classified either benign or malignant
using decision tree j48 classifier.

4. Experimental Results:

V Experimental Results:

A. Data Set:

The Breast cancer data are collected from the UCI machine learning repository. Data set having 699 instances, 2 classes (benign and malignant type of cancer ), and 9 integer-valued attributes is shown in Table 2. All experiments described in this paper were performed using Weka machine learning environment[3].The Weka is an ensemble of tools for data clustering and classification, WEKA version 3.7.13 was utilized as a data mining tool to estimate the performance s of the breast cancer diagnosis. This is because the WEKA tool offers a well defined framework for experimenters and developers to create and analyses the various clustering and classification techniques.

After removing outlier from the original data set, the resulting data contains 574 instances and 14 attributes. The detailed description of data after removing outlier from the original data set is outlined in Figure 1


From the above table the classification accuracy of breast cancer data set is 98,4%. The experimental results shows that the proposed model achieves better accuracy compared to the existing research in the same data set.


Today's clinical databases store detailed information about patient diagnoses, lab test results and details from patient treatments, a virtual gold mine of information for medical researchers. Utilizing data mining techniques with medical treatment data is a virtually unexplored frontier. Studying the extraordinary behavior of outliers helps uncovering the valuable knowledge hidden behind them and aiding the decision makers to improve the health care services. In this paper we proposed a automated diagnosis of breast cancer using farthest first clustering and decision tree J48 classifier. The Wisconsin breast cancer dataset from UCI is used in this research. Our experimental results show the effectiveness of the proposed model. The obtained classification accuracy of 98.4 % is a very optimistic result compared to the existing breast cancer diagnosis for the same data set.


[1.] Bo Liu, Yanshan Xiao, Philip S. Yu, Zhifeng Hao and Longbing Cao, 2014. "An Efficient Approach for Outlier Detection with Imperfect Data Labels", IEEE Transactions On Knowledge And Data Engineering, 26(7): 1602-1616.

[2.] Deepshree A. Vadeyar and H.K. Yogish, 2014. "Farthest First Clustering in Links Reorganization", International Journal of Web & Semantic Technology (IJWesT) 5: 3.

[3.] Delshi Howsalya Devi R and P. Deepika, 2015. ''Performance Comparison of Various Clustering Techniques for Diagnosis of Breast Cancer",IEEE International Conference On Computational Intelligence And ComputingResearch, pp: 400-404.

[4.] Gaganjot Kaura and Amit Chhabra, 2014." Improved J48 Classification Algorithm for the Prediction of Diabetes", International Journal of Computer Applications, 98(220.

[5.] Jinshan Tang, Rangaraj M. Rangayyan, Jun Xu, Issam El Naqa and Yongyi Yang, 2009." Computer-Aided Detection and Diagnosis of Breast Cancer With Mammography: Recent Advances", IEEE transactions on information technology in biomedicine, 13(2): 236-251.

[6.] Karthikeyan Ganesan, U. Rajendra Acharya, Chua Kuang Chua, Lim Choo Min, K. Thomas Abraham, and Kwan-Hoong Ng, 2013." Computer-Aided Breast Cancer Detection Using Mammograms: A Review", IEEE Reviews in Biomedical Engineering, 6: 77-98.

[7.] Linqi Song, William Hsu and Mihaela van der Schaar, 2015. "Using contextual learning to improve diagnosis accuracy: Application in breast cancer screening," IEEE Journal of Biomedical and Health Informatics, pp: 2168-2194.

[8.] Priyanka Sharma, 2015. "Comparative Analysis of Various Clustering Algorithms Using WEKA", International Research Journal of Engineering and Technology (IRJET), 02: 04.

[9.] Qian Xiao, Lisa B. Signorello, Louise A. Brinton, Sarah S. Cohen, William J. Blot and Charles E. Matthews, 2016 " Sleep duration and breast cancer risk among black and white women", Sleep Medicine, 20: 25-29.

[10.] Ronak Sumbaly, N. Vishnusri and S. Jeyalatha, 2014. "Diagnosis of Breast Cancer using Decision Tree Data Mining Technique", International Journal of Computer Applications, 98(10).

[11.] Shah, 2013. "comparison of data mining classification algorithms for breast cancer prediction", IEEE fourth international conference on computing, communication and networking technologies.

[12.] Venkatesan and T. Velmurugan, 2015." Performance Analysis of Decision Tree Algorithms for Breast Cancer Classification", Indian Journal of Science and Technology, 8(29).

[13.] Nguyen, T., A. Khosravi, D. Creighton and S. Nahavandi, 2015b. Classification of healthcare data using genetic fuzzy logic system and wavelets. Expert Systems with Applications, 42: 2184-2197.

[14.] Nahato, K.B., K.N. Harichandran and K. Arputharaj, 2015. Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Computational and Mathematical Methods in Medicine, pp: 1-13.

[15.] Adem Kalinli, Fatih Sarikoc, Hulya Akgun, Figen Ozturk, 2013. Performance comparison of machine learning methods for prognosis of hormone receptor status in breast cancer tissue samples. Elsevier: computer methods and programs in biomedicine, 110: 298-30.

[16.] Chen, A.H. and C. Yang, 2012. The improvement of breast cancer prognosis accuracy from integrated gene expression and clinical data. Expert Systems with Applications, 39: 4785-4795.

[17.] Chen, C.-H., 2014. A hybrid intelligent model of analyzing clinical breast cancer data using clustering techniques with feature selection. Applied Soft Computing, 20: 4-14.

[18.] Cheng, H.D., Juan Shan, Wen Ju, YanhuiGuo, Ling Zhang, 2010." Automated breast cancer detection and classification using ultrasound images: A survey", Elsevier: Pattern Recognition, 43: 299-317.

[19.] Noel Perez, 2015." Improving the Mann-Whitney statistical test for feature selection: An approach in breast cancer diagnosis on mammography", Elsevier: Artificial Intelligence in Medicine, 63: 19-31.

[20.] Karabatak, M. and M.C. Ince, 2009. An expert system for detection of breast cancer based on association rules and neural networks. Expert Systems with Applications, 36: 3465-3469.

[21.] sahan, S., K. Polat, H. Kodaz and S. Gunes_, 2007. A new hybrid method based on fuzzy-artificial immune system and k-nn algorithm for breast cancer diagnosis. Computers in Biology and Medicine, 37: 415-423.

[22.] Mu, T. and A.K. Nandi, 2007. Breast cancer detection from FNA using SVM with different parameter tuning systems and SOM-RBF classifier. Journal of the Franklin Institute, 344: 285-311.

[23.] Akay, M.F., 2009. Support vector machines combined with feature selection for breast diagnosis. Expert Systems with Applications, 36: 3240-3247.

[24.] Hassan, M.R., M.M. Hossain, R.K. Begg, K. Ramamohanarao and Y. Morsi, 2010. Breast-cancer identification using HMM-fuzzy approach. Computers in Biology and Medicine, 40: 240-251.

[25.] Chen, H.-L., B. Yang, J. Liu and D.-Y. Liu, 2011. A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis. Expert Systems with Applications, 38: 9014-9022.

[26.] Marcano-Cedeno, A., J. Quintanilla-Dominguez and D. Andina, 2011. WBCD breast cancer database classification applying artificial metaplasticity neural network. Expert Systems with Applications, 38: 9573=9579.

(1) R. Delshi Howsalya Devi and (2) P. Deepika

(1) Assistant Professor, K.L.N College of Engineering, Madurai, India.

(2) M.E Scholar K.L.N College of Engineering, Madurai, India.

Received 27 May 2016; Accepted 28 June 2016; Available 12 July 2016

Address For Correspondence:

R. Delshi Howsalya Devi, Assistant Professor, K.L.N College of Engineering, Madurai, India.
Table 2: Breast cancer data set

Attribute                      Min    Max    Mean     StdDev

Clump Thickness                1      10     4.418    2.816
Uniformity of Cell Size        1      10     3.134    3.051
Uniformity of Cell Shape       1      10     3.207    2.972
Marginal Adhesion              1      10     2.807    2.855
Single Epithelial Cell Size    1      10     3.216    2.214
Bare Nuclei                    1      10     3.545    3.644
Bland Chromatin                1      10     3.438    2.438
Mitoses                        1      10     1.589    1.715
                               2 for benign, count-458, weight-458
Class                          4 for malignant,count-241,weight-241

Table 3: Classification result of breast cancer diagnosis

Time taken to build model: 0.08 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances            565       98.4321 %
Incorrectly Classified Instances           9         1.5679 %
Kappa statistic                          0.9536
Mean absolute error                      0.0293
Root mean squared error                  0.1236
Relative absolute error                 8.6899 %
Root relative squared error            30.1199 %
Coverage of cases (0.95 level)         98.9547 %
Mean rel. region size (0.95 level)     55.4007 %
Total Number of Instances                 574
COPYRIGHT 2016 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2016 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Devi, R. Delshi Howsalya; Deepika, P.
Publication:Advances in Natural and Applied Sciences
Date:Jun 30, 2016
Previous Article:Extraction of malignancy status and validation of pathological classification of breast cancer using machine learning approach.
Next Article:An efficient MAC management in WSN for precision agriculture.

Terms of use | Privacy policy | Copyright © 2022 Farlex, Inc. | Feedback | For webmasters |