Effective diagnosis of heart disease through stacking approach.
Heart disease is the highest deadly disease all over the world. The World Health Organization (WHO) has estimated that 12 million deaths occur worldwide, every year due to the Heart diseases . In the age group of 25-69 year, there is 25% of deaths due to heart diseases. The deaths occur in urban and rural areas due to heart ailments are 32.8% and 22.9% respectively. The mortality is over 80% because of heart disease around the world. According to the estimation of WHO, the mortality will be almost 23.6 million people because of heart disease by 2030. The diagnosis of diseases is a significant and tedious task . Some of the risk factors related to the heart disease include age, sex, blood pressure, cholesterol, fasting blood sugar, etc.,. Many machine learning techniques are available in the prediction of heart disease such as artificial neural network, support vector machines, naive bayes etc., Machine learning is a type of artificial intelligence that makes the machines to learn and makes predictions based on the learned data. There are two types of machine learning tasks such as Supervised and Unsupervised learning.
The ensemble methods utilize several learning algorithms in order to obtain improved predictive performance than the individual learning algorithms. An ensemble is a supervised learning algorithm because training can be made in order to make predictions. There are many types of ensembles such as bootstrap aggregating which is also known as bagging, bayesian parameter averaging, boosting, stacking etc., There are two main approaches in model ensembling. The first approach is combining the similar classifiers together. And the second approach is combining the different classifiers using model stacking. Model ensembling increases the accuracy by stacking the prediction of multiple models together . The objective of this work is to predict the heart disease using machine learning algorithms.
The paper is organized as follows: section II describes related work, section III provides an overview of dataset and methods used. Section IV explains the proposed methodology, Section V represents experimental results and analysis and the conclusion is given in section VI.
A number of approaches have been used to predict the risk of heart diseases. Some of them is listed below.
V.Subha et al.,  discussed about SVM and the ensemble classifiers such as bagging, boosting and random subspace in the prediction of heart disease. According to  the performance of different algorithms are analyzed and it shows that the boosting ensemble classifier performs well in terms of accuracy, sensitivity, specificity, PPV (Positive Predictive Value), NPV (Negative Predictive Value)
Saba Bashir et al.,  used the combination of three classifiers such as Naive Bayes, Decision tree and Support Vector Machine. These classifiers are combined using majority voting scheme. The accuracy obtained by this method is 81.8%
Kathleen H. Miao et al.,  used the adaptive boosting algorithm for classification and prediction of heart disease. It is one of the ensemble learning method and is applied to different datasets. The adaptive boosting algorithm adds the weak classifiers iteratively which results in forming strong ensemble learning classifier.
ShashikantGhumbre et.al.,  have projected a Decision Support System for the diagnosis of Heart disease by means of radial basis function network structure and Support Vector Machine and the results have denoted that SVM with Sequential minimize optimization is equivalently good as the Radial basis function network in the diagnosis of Heart disease.
S. Vijiyarani et al.,  proposed a research work to predict the heart disease using classification tree technique. The classification tree algorithms used and tested in this work are Decision Stump, Random Forest and LMT Tree algorithm. The objective of this work was to compare the outcomes of the performance of different classification techniques for a heart disease dataset. This work was done with WEKA tool.
SonamNikhar et al.,  applied the Naive Bayes and decision tree classifier in the prediction of heart disease. The results in this paper  shows that the decision tree performs better than bayesian classifier.
Jaymin Patel et al.,  applied different algorithms in heart disease prediction. The algorithms used in this paper are J48, logistic model tree algorithm and random forest algorithm. The results in  shows J48 has best performance over other algorithms.
A. Data Collection:
Theheart disease dataset was taken from the UCI machine learning repository  and it is made up of 76 raw attributes from which 14 attributes were published by various researchers . These attributes are very vital in the diagnosis of heart disease. The dataset has 303 instances. The 12 attributes considered in this research work are stated below. The description of Cleveland heart disease dataset is tabulated in table 1.
B. R Language:
R is an open source programming language and Rstudio is an Integrated Development Environment (IDE). It is used to do manipulation and analysis of various data's in the datasets. Various plots can be made using R language and it is utilized for software development activities in data mining, machine learning and in various fields. It is an effective, extensible and ample environment for various statistical computations and graphics. One of the key features of R language is that it supports user-created R packages and we can import data containing variety of file formats such asCSV (Comma Separated Values), XML(), binary files.
R language has various data structures. It includes vectors, matrices, arrays, data frames (similar to tables in relational database in DBMS) and lists. There are many packages available for R and we can use the package whenever we are in need by using library(package_name) command. There are various interfaces are available for R language. Among them RStudio is commonly used interface.
C. Random Forest (RF):
Random forests otherwise called random decision forests is one of the ensemble learning method for various tasks such as classification, regression etc., At training time, it constructs large number of decision trees  and gives the output class which is either classification or regression of the individual trees. Random Forests are an improvement over bagged decision trees.
D. Generalized Boosted Regression Modeling (GBM):
GBM is a powerful machine learning algorithm and it can do regression, classification and ranking, which produces prediction model in the form of an ensemble of weak prediction models typically decision trees. It builds the model in a stage wise fashion and it generalizes them.
E. Linear Discriminant Analysis (LDA):
Linear discriminant analysis (LDA) is a method to find a linear combination of predictors or variables that separates two classes or targets. It can also be separated more than two classes . This method is widely used in machine learning, statistics and pattern recognition. In machine learning applications, it is used as dimensionality reduction technique. And it also used for data classification.
F. Support Vector Machine (SVM):
Support vector Machine is a machine learning algorithm. They belong to a family of generalized linear classifiers It can be used for regression as well as classification tasks . SVM uses the hyperplane to separate the data set into two classes. Support vectors are defined as points nearer to the hyperplane. These points are considered as critical elements of the data set. It uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data.
Stacking is also called stacked generalization. It involves training a learning algorithm to combine the predictions of several other learning algorithms. The first step is all the other algorithms are trained using the available data.
Then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. It has been successfully used on supervised learning tasks such as classification, regression and distance learning. It can also be used for unsupervised learning such as density estimation. The point of stacking is to explore a space of different models for the same problem.
We can improve our overall performance, and often we can end up with a model which is better than any other individual model.
The main objective of this research is to develop a prediction system for heart disease using data mining techniques such as random forest, generalized boosted regression modelling, linear discriminant analysis, support vector machines. All the already existing works used 13 attributes for prediction.
The proposed work used only 11 attributes for prediction. Using Random forest the important attributes is extracted (Mean Decrease in Gini Index) and is sorted by variable importance.
The following steps are used for heart disease prediction.
1. Load the dataset.
2. Split the dataset into training and testing (75%--training, 25%--testing)
3. Find the important attributes using random forest fit.
4. Train using different models such as random forest, Generalized Boosted Regression Modeling, Linear discriminant analysis, support vector machine.
5. Run the models on the testing data.
6. Stack models together and combine with random forests.
7. Run stacked model on testing data.
8. Compare the performance measures of individual models with the stacked models.
Experimental Results And Analysis:
A. Data Source:
This dataset was taken from the UCI machine learning repository. The Cleveland heart disease dataset is made up of 75 raw attributes from which 13 attributes were published. In this work only 11 attibutes was taken. The statistics of heart disease dataset is tabulated in table 2.
B. Performance Measures:
The performance of the system is evaluated by using three measures. Accuracy, Specificity and sensitivity.
Accuracy = (TP+TN) /(TP+TN+FP+FN)
Specificity = TN/(TN+FP)
Sensitivity = TP/(TP+FN)
TP = number of samples classifies as true while they were true.
TN = number of samples classifies as false while they were actually false
FN = number of samples classifies as false while they were actually true
FP = number of samples classifies as true while they were actually false
The proposed stacked model was tested on Cleveland heart data set using 11 attributes. The developed method was implemented using R. Performance of the proposed stack model is compared with the individual models like random forest, generalized boosted regression modelling, linear discriminant analysis, support vector machines and the proposed stacking model performs better based on the performance measures like accuracy, sensitivity and specificity.
From the above table the stacked model is better when compared to other models in terms of accuracy, sensitivity and specificity. The accuracy obtained by the stacked model is 91.89%. The Sensitivity obtained is 97.14% and the specificity is 87.18%. The stacked model is better in performance when compared to the models like Random Forest, Generalized Boosted Regression Modeling, Linear Discriminant Analysis and Support Vector Machine.
The graph shows that the stacked model performance is better when compared to other models.
The objective of our work is to accurately predict the presence of heart disease with reduced list of attributes. Originally 13 attributes were involved in predicting the heart disease. In our work , Random Forest is used for finding the important attributes which contribute more towards the diagnosis of heart disease which indirectly reduces the number of tests taken by the patient. Thirteen attributes are reduced to 11 attributes using Random Forest Fit. Subsequently, different techniques like random forest, Generalized Boosted Regression modelling, Linear Discriminant Analysis, Support Vector Machine and stacked model are used to predict the diagnosis of patients with increased accuracy as obtained before the reduction of number of attributes. Also, the observation shows that the stacked model outperforms the other four methods. We intend to extend our work by incorporating fuzzy learning models and genetic algorithms to evaluate the intensity of heart disease prediction.
[1.] Suganya, R., S. Rajaram, A. Sheik Abdullah and V. Rajendran, 2016. "A Novel Feature Selection method for predicting heart diseases with Data Mining Techniques", Asian Journal of Information Technology, pp: 1314-1321.
[2.] Beant Kaur, Dr. Williamjeet Singh, 2015. "Analysis of heart attack prediction system using genetic algorithm", International Journal of advanced Technology in Engineering and Science, 3: 87-94.
[3.] Randa El Bialy, Mostafa A. Salama, Omar Karam, 2016. "An ensemble model for Heart disease data sets: A generalized model", ACM, pp: 191-196.
[4.] Subha, V., M. Revathi, D. Murugan, 2015. "Comparative Analysis of Support Vector Machine Ensembles for Heart Disease Prediction", pp: 386-390.
[5.] Saba Bashir, Usman Qamar, M. YounusJaved, 2014. "An ensemble based decision support framework for intelligent heart disease diagnosis", International Conference on Information Society, pp: 259-264.
[6.] Kathleen H. Miao, Julia H. Miao and George J. Miao, 2016. " Diagnosing Coronary Heart Disease Using Ensemble Machine Learning ", pp: 30-39.
[7.] ShashikantGhumbre, ChetanPatil, and Ashok Ghatol, 2011. "Heart disease diagnosis using support vector machine". In: International Conference on computer Science and Information Technology (ICCSIT') Pattaya.
[8.] Vijiyarani, S., S. Sudha, 2013. "An Efficient Classification Tree Technique for Heart Disease Prediction", International Conference on Research Trends in Computer Technologies (ICRTCT--2013), pp: 6-9.
[9.] The UCI Machine Learning Repository[online]. https://archive.ics.uci.edu/ml/datasets/Heart+Disease
[10.] BenishFida, MuhammedNazir, NawazishNaveed, SheerazAkram, 2011. "Heart disease classification ensemble optimization using genetic algorithm", IEEE 14th International Multitopic Conference, pp: 19-24.
[11.] Zinzendoff Okwonu and Abdul Rahman Othman, 2012. "A Model Classification Technique for Linear Discriminant Analysis for Two Groups", International Journal of Computer Science Issues, pp: 125-128.
[12.] Gerard Biau, 2012. Analysis of a Random Forests Model, Journal of Machine Learning Research, pp: 1063-1095.
[13.] LEO BREIMAN, Stacked Regressions, Kluwer Academic Publishers, Machine Learning, 1996. 24: 49-64.
[14.] Obenshain, M.K., 2004. "Application of Data Mining Techniques to Healthcare Data", Infection control and Hospital Epidemiology, 25(8): 690-695.
[15.] Purusothaman, G. and P. Krishnakumari, 2015. "A Survey of Data Mining Techniques on Risk Prediction:"Heart Disease Indian Journal of Science and Technology, 8(12).
[16.] Jian Yu, Min-Shen Yang, 2007. "A Generalized Fuzzy Clustering Regularization Model with optimally tests and model Complexity analysis", IEEE transactions on Fuzzy Systems, 15(5): 904-915.
[17.] Das, R., I. Turkoglu, Sengur, 2009. "An Effective diagnosis of heart disease through neural network ensembles." Expert Syst, Appl, 364(4): 7675-7680.
[18.] Cristianini, N., J. Shawe-Taylor, 2000. "An Introduction to Support Vector Machines". Cambridge University Press", Cambridge.
[19.] SonamNikhar, A.M. Karandikar, 2016. " Prediction of Heart Disease Using Machine Learning Algorithms ", International Journal of Advanced Engineering, Management and Science (IJAEMS), pp: 617-621.
[20.] Jaymin Patel, Prof.TejalUpadhyay, Dr. Samir Patel, 2015. " Heart Disease Prediction Using Machine learning and Data Mining Technique", International Journal of Computer Science & Communication, pp: 129- 137.
(1) Mrs. K. UmaMaheswari, (2) Dr. A. Valarmathi, (3) Ms. J. Jasmine
(1) Assistant Professor, Department of Information Technology, Anna University, BIT Campus TiruchirappaHi-24.
(2) Head & Assistant Professor, Department of Computer Applications, Anna University, BIT Campus TiruchirappaHi-24.
(3) Student(M.E CSE), Anna University, BIT Campus, TiruchirappaIIi-24.
Received 12 May 2017; Accepted 5 July 2017; Available online 28 July 2017
Address For Correspondence:
(1) Mrs. K. UmaMaheswari, Assistant Professor, Department of Information Technology, Anna University, BIT Campus T iruchirappalli-24.
Caption: Fig. 1: Variable Importance using Random Forest Fit
Caption: Fig. 2: Comparison of Performance measures of various techniques
Table 1: Description of dataset Sl.No Attribute Name Description 1 Age Age in years 2 Sex sex (1 = male; 0 = female) 3 cp chest pain type (1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic) 4 trestbps resting blood pressure (in mm Hg) 5 chol serum cholestorol(in mg/dl) 6 thalach maximum heart rate 7 exang exercise induced angina (1 = yes; 0 = no) 8 oldpeak ST depression induced by exercise relative to rest 9 slope the slope of the peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping 10 ca number of major vessels (0-3) colored by flourosopy 11 thal 3 = normal; 6 = fixed defect; 7 = reversable defect 12 num diagnosis of heart disease (angiographic disease status 0: < 50% diameter narrowing, 1: > 50% diameter narrowing) Table 2: Statistics of heart disease dataset Sl.no Dataset Training Testing dataset dataset 1 Cleveland heart 75% 25% disease dataset Table 3: Confusion Matrix of Proposed stacked model with other techniques Technique Class Absence Presence Random Forest Absence 32 7 Presence 3 32 Generalized Boosted Absence 34 9 Regression Modeling Presence 1 30 Linear Discriminant Absence 33 9 Analysis Presence 2 30 Support Vector Machine Absence 33 10 Presence 2 29 Stacked Model Absence 34 5 Presence 1 34 Table 4: Accuracy, Sensitivity. Specificity for various classifiers Classifiers Accuracy Sensitivity Specificity Random Forest(RF) 86.49 91.43 82.05 Generalized Boosted Regression 86.49 97.14 76.92 Modeling(GBM) Linear Discriminant Analysis 85.14 94.29 76.92 (LDA) Support Vector Machine(SVM) 83.78 94.29 74.36 Stacked Model(SM) 91.89 97.14 87.18
|Printer friendly Cite/link Email Feedback|
|Author:||UmaMaheswari, K.; Valarmathi, A.; Jasmine, J.|
|Publication:||Advances in Natural and Applied Sciences|
|Date:||Jul 1, 2017|
|Previous Article:||An analysis of effective clustering techniques for web data mining.|
|Next Article:||An efficient stereo matching method to reduce disparity quantization error.|