Printer Friendly

A comparative study of data mining algorithms for decision tree approaches using WEKA tool.

INTRODUCTION

Data mining is the process of analyzing data from unusual aspects and abbreviation it into useful information. Information that can be used to increase revenue, cut costs, or both. Data mining software consists of a number of analytical tools for analyzing data. It allows users to analyze data from many different angles and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among a lot of fields in large relational databases. WEKA is best to open source tool to predict and analysis when used with training mode with followed by test options for 10 folds cross-validation and 66 percentage split mode for training the classifier. In data mining, a classification is a form of data analytical process that can be used to extract the models describing on important data classes. There are many classification techniques but decision tree is the most commonly used classification algorithm because of its ease of implementation and easier to understand compared to other classification algorithms.

The structure of the paper is as follows: section 2 discusses some related research work regarding the analysis of accuracy of various data mining algorithms. Section 3 describes how accuracy was calculated and compared for the all seven data mining algorithms considered for experiment. The results and discussions are given in section 4. Finally some conclusion and feature scope are given in section 5.

2. Related Works:

Classification techniques can be compared on the basis of analytical correctness, rapidity, strength, scalability and interpretability criteria. Data mining is the useful tool to discovering the knowledge from large data. Different methods and algorithms are available in data mining. Classification is the most common method used for finding the mining rule from the large database. Decision tree method generally used for classification, because it is the simple hierarchical structure. Various data mining algorithms available for classification based on Artificial Neural Network, Nearest Neighbour Rule & Baysen classifiers but the decision tree mining is a simple one [1]. The tree induction begins with a root node that represents the entire dataset, given dataset and recursively split the data into a few subsets by testing for a given attribute at each level of the node. Author [2] proposed an agricultural organization work with a large amount of data. Processing and retrieval of significant data in this abundance of agricultural information is necessary. Utilization of information and communication technology enables automation of extracting significant data in an effort to obtain knowledge and trends.

Data mining is the process that results in the discovery of new patterns in large data sets. The goal of the data mining process is to extract knowledge from an existing data set and transform it into a human understandable format for advance use [3]. The authors [4] introduced a Mean Absolute Errors and Root Mean Squared Errors of data mining techniques in the considered scenarios. A correlation review of classification algorithm using some free available data mining and knowledge discovery tools such as WEKA, Rapid Miner, Tanagra, Orange, and Knime. The accuracy of classification algorithms like a Decision Tree, Decision Stump, K-Nearest Neighbor and Naive Bayes algorithm have been compared using all five tools [5]. The system that helps farmers in all manners, that is, in education, weather forecasting, crop analysis and understanding it more clearly [6].

The generic description about the C4.5 algorithm, all tree induction methods begin with a root node that represents the entire, given dataset and recursively split the data into smaller subsets by testing for a given attribute at each node [7]. Quinlan [8] summarizes an approach to synthesizing decision trees that have been used in a variety of systems, and it describes one such system, ID3 in detail. Results from various studies show the ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. Decision trees are few of the most extensively researched domains in Knowledge Discovery. Irrespective of such advantages as the ability to explain the choice procedure and low computational costs, decision trees also usually produce relatively great outcomes in assessment with other machine finding out formulas. Although the best decision tree induction algorithms, such as J48, had been developed some time ago, they continue to be regularly used for solving everyday classification tasks [9]. The author [8, 10] proposed ID3 and C4.5 algorithms, for example, use information gain and gain ratio measures respectively but differ in regard to the tests performed on their attributes. The decision tree induction is one of the most employed methods to extract knowledge from data since the representation of knowledge is very intuitive and easily understandable by humans.

Data stream classification performance can be measured by various factors such as accuracy, computational speed, memory and time taken for processing. These classification algorithms get less time span to examine data and construct a model, may be only once with less amount of resource, time and prediction [11]. Increasing usage of data mining techniques on medical data for discovering useful trends or patterns that are used in diagnosis and decision making. Data mining techniques include clustering, classification, regression, association rule mining, CART (Classification and Regression Tree) are widely used in healthcare and other applicable domain [12]. To analyze the raw data manually and find the correct information from it is a tough process. But Data mining technique automatically detect the relevant patterns or information from the raw data, using the data mining algorithms. In Data mining algorithms, Decision trees are the best and commonly used approach for representing the data [18].

The author presented a comparison of existing algorithms for prediction is done and a generic framework for the hybrid prediction model is proposed. This model would find the new interesting pattern. Also given an overview of prediction methods which are used in different application and listed prediction algorithms. Although prediction using unstructured data is an emerging research topic and its results have relatively low accuracy, it has created a new way for us to collect, extract and utilizes the wisdom of crowds in an objective manner with low cost and high efficiency [19]. The author [20] proposed the main objective is to provide the performance appraisal report of an employee using a Decision Tree algorithm. The data mining classification methods like decision tree, rule mining, clustering etc. can be applied for evaluating the employee data for giving stress, remedy to solve the stress and career advancement. In order to provide solution to reduce the stress for an employee, the historical data stored in the table are subjected to learning by using the decision tree algorithm and the performance are found by testing the attributes of an employee against the rules generated by the decision tree classifier. In this work presents the data mining techniques like Classification, Clustering and Associations Analysis which include algorithms of Decision Tree. focus on this specific work. We present the correlation attribute evaluation feature selection for dimensionality reduction and t-test for comparing the performance of different classifiers before and after dimensionality reduction. Different classifiers used in this work are Naive Bayes (NB), k- Nearest Neighbour (kNN), Classification tree (CT) and Clark and Nilbert2 (CN2). Empirical results shown that CN2 classifier is best for the multi-dimensional thyroid dataset by comparing classification accuracy of the different classifiers. There is no significant difference between before and after the dimensionality reduction in the four classifiers in the performance measure [24]. The process of predicting customer behaviour and selecting actions to influence that behaviour to benefit the banking industry. Data mining provides the technology for the banking industry to analyze mass volume of data and/or detect hidden patterns in bank data to convert raw data into valuable information. This paper discusses the potential value and applications of data mining tools for effective customer relationship marketing in the banking industry [25].

3. System Overview:

Decision tree are great tools for classification and prediction in the area of data mining and it represents rules and it is a classifier in the form of a tree structure where each node is either a leaf node, indicating a class of instances. A decision node that specifies some analysis to be carried out on a single attribute value, with one branch and sub-tree for each feasible outcome of the test. A decision tree can be used to classify an instance by starting at the root of the tree and stirring it until the last level of a leaf node, which provides the classification of the instance.

3.1 J48:

C4.5 tree induction methods begin with a root node that represents the entire, given dataset and recursively split the data into smaller subsets by testing for a given attribute at each node. The subtrees denote the partitions of the original data set that satisfy specified attribute value tests. This process typically continues until the subsets are "pure" that is, all instances in the subset fall in the same class, at which time the tree growing is terminated.

C4.5 designed by J. Ross Quinlan is so named because it is a descendant of the ID3 approach to inducing decision trees, a decision tree is a series of questions systematically arranged so that each question queries an attribute and branches based on the value of the attribute. At the leaves of the tree are placed predictions of the class variable.
Algorithm:

1: Tree = {}, Input: an attribute-valued dataset D
2: if D is "pure" OR other stopping criteria met then
3:   terminate
4: end if
5: for all attribute [a.sup.[epsilon]] D do
6:   Compute information-theoretic criteria if we split on a
7: end for
8: abest = Best attribute according to above computed criteria
9: Tree = Create a decision node that tests abest in the root
10: Dv = Induced sub-datasets from D based on abest
11: for all Dv do
12:   Treev = C4.5(Dv)
13:   Attach Treev to the corresponding branch of Tree
14: end for
15: return Tree


3.2 Random Tree (RT):

A random tree usually refers to randomly built trees which have nothing to do with machine learning. The popular machine learning framework WEKA tool uses the word to refer to a decision tree built on a random split of columns. Random Tree on its have possession of tends to be too weak and you want to have it included in an ensemble algorithm, in addition to Random Forest, you could have Bagging (bagged RT's) or AdaBoost (boosted RT's) to make it strong enough.

3.3 Decision Stump (DS):

The Decision stump is used for generating a decision tree with only one single split. The resulting tree can be used for classifying unseen examples. Decision stump can be very efficient when boosted with operators like the AdaBoost operator. The examples of the given example set have several attributes and every example belongs to a class (like yes or no). The leaf nodes of a decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with every branch of the node being a feasible value of the attribute. The decision stumps may appear very simple, when boosted, they yield good classifiers algorithm in practice. The algorithm we illustrate here can also be used as a basic building block for more difficult base learners, such as trees or products [13].
Algorithm:

1. [[gamma].sub.0] [left arrow]     //Edge of constant classifier
[n.summation over (i=1)]            h0(x) [equivalent to] 1
[w.sub.i][y.sub.i]

2. [[gamma].sup.*] [left arrow]     // Best edges
[[gamma].sub.0]

3. for j [left arrow] 1 to d        // All numeric features

4. [gamma] [left arrow]             // Edge of the constant
[[gamma].sub.0]                     classifier

5. for i [left arrow] 2 to n        // All points in order
                                    [x.sup.(j).sub.1] [less than or
                                    equal to] ... [less than or equal
                                    to] [x.sup.(j).sub.n]

6. [gamma] [left arrow]             // Update edge of positive stump
[[gamma].sub.0]

7. if [x.sup.(j).sub.i-1] [not      //No threshold if identical
equal to] [x.sup.(j).sub.i] then    coordinates

8. if [absolute value of            // Found better stump
([gamma])] > [absolute value of
([[gamma].sup.*])] then

9. [[gamma].sup.*] [left arrow]     // Update best edges
[gamma]

10. [j.sup.*] [left arrow] j        // Update index of best feature

11. [[theta].sup.*] =               // Update best threshold
([x.sup.(j).sub.i] +
[x.sup.(j).sub.i-1])/2

12. if [[gamma].sup.*] =            //Did not beat the constant
[[gamma].sub.0]                     classifier

13. return sign ([[gamma].sub.0])   // [+ or -] constant classifier
x [h.sub.0]

14. [MATHEMATICAL EXPRESSION NOT    // Best stump
REPRODUCIBLE IN ASCII]


3.4 Logistic Model Tree (LMT):

Logistic Model Tree (LMT) algorithm makes a tree with binary and multiclass target variables, numeric and missing values. So this technique uses logistic regression tree. LMT produces on its own outcome in the form of tree containing binary splits on numeric attributes. A logistic model tree basically consists of a standard decision tree structure with logistic regression functions at the leaves, much like a model tree is a regression tree with regression functions at the leaves. The author [14] presented ideas lead to the following algorithm for constructing logistic model trees.

Algorithm:

1. Tree growing starts by building a logistic model at the root using the LogitBoost algorithm. The number of iterations is determined using fivefold cross-validation. In this process, the data is split into training and test set five times, for every training set Logit Boost is run to a maximum number of iterations and the error rates on the test set are logged for every iteration and summed up over the different folds. The number of iterations that has the lowest sum of errors is used to train the Logit Boost algorithm on all the data. This gives the logistic regression model at the root of the tree.

2. A split for the data at the root is constructed using the C4.5 splitting criterion. Mutually binary splits on numerical attributes and multiway splits on small attributes are considered. Tree growing continues by sorting the appropriate subsets of data to those nodes and building the logistic models of the child nodes in the following way: the LogitBoost algorithm is run on the subset associated with the child node, but starting with the committee, weights and probability estimates of the last iteration performed at the parent node. Again, the optimum number of iterations to perform is determined by fivefold cross-validation.

3. Splitting continues in this fashion as long as more than 15 instances are at a node and a useful split can be found by the C4.5 splitting routine.

4. The tree is pruned using the CART pruning algorithm as outlined.

3.5 Hoeffding Tree (HT):

A Hoeffding tree algorithm is an incremental, decision tree induction that is capable of learning from very big data streams, assuming that the distribution generating examples does not transform over time. Hoeffding

trees exploit the fact that a small sample can often be enough to choose an optimal splitting attribute. This idea is supported mathematically by the Hoeffding bound, which quantifies the number of observations needed to estimate some statistics within a prescribed precision. Hoeffding option tree induction algorithm, where nmin is the grace period, G is the split criterion function, R is the range of G, [tau] is the tiebreaking threshold, [delta] is the confidence for the initial splits, [delta] is the confidence for additional splits and max Options is the maximum number of options reachable by a single example [15].
Algorithm:

1: Let HOT be an option tree with a single leaf (the root)
2: for all training examples do
3: Sort example into option nodes L using HOT
4: for all option nodes l of the set L do
5: Update sufficient statistics in l
6: Increment nl, the number of examples seen at l
7: if nl mod nmin = 0 and examples seen at l not all of same class
   then
8: if l has no children then
9: Compute [bar.[G.sub.l]] ([X.sub.i]) for each attribute
10: Let [X.sub.a] be attribute with highest [bar.[G.sub.l]]
11: Let [X.sub.b] be attribute with second-highest [square root of
   ([R.sup.2] ln(1/[delta]')/2[n.sub.l])]
12: Compute Hoeffding bound [square root of ([R.sup.2]ln(1/
   [delta]')/2 [n.sub.l])]
13: if [bar.[G.sub.l]] ([X.sub.a]) - [bar.[G.sub.l]] ([X.sub.b]) >
   [epsilon] or [epsilon] < [tau] then
14: Add a node below l that splits on [X.sub.a]
15: for all branches of the split do
16: Add a new option leaf with initialized sufficient statistics
17: end for
18: end if
19: else
20: if l.optionCount < maxOptions then
21: Compute [bar.[G.sub.l]] ([X.sub.i]) for existing splits and
   (non-used) attributes
22: Let S be existing child split with highest [bar.[G.sub.l]]
23: Let X be (non-used) attribute with highest [bar.[G.sub.l]]
24: Compute Hoeffding bound [square root of ([R.sup.2] ln(1/
   [delta]') / 2[n.sub.l])]
25: if [bar.[G.sub.l]] (X) - [bar.[G.sub.l]] (S) > [epsilon] then
26: Add an additional child option to l that splits on X
27: for all branches of the split do
28: Add a new option leaf with initialized sufficient statistics
29: end for
30: end if
31: else
32: Remove attribute statistics stored at l
33: end if
34: end if
35: end if
36: end for
37: end for


3.6 Reduce Error Pruning (REP):

The easy form of pruning is reduced error pruning. Initial at the leaves, each node is replaced with its most well-liked class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed. Traversing above, the inside nodes from the foot to the top of a tree, the REP algorithm checks for every single internal node, whether substituting it alongside the most recapped class that does not cut the accuracy of trees. In this case, the node is pruned. The procedure endures till each more pruning would cut the accuracy.

Error Pruning Tree ("REPT") [16] is fast decision tree learning and it builds a decision tree based on the information gain or reducing the variance. The basic of pruning of this algorithm is it used REP with back overfitting. It kindly sorts values for the numerical attribute once and it handling the missing values with the embedded method by C4.5 in fractional instances. In this algorithm, we can see it used the method from C4.5 and the basic REP also count in its process.

3.7 Random Forest (RF):

The Random Forests classifier is one of the best among classification techniques that able to classify huge amounts of data with accuracy. Random Forests are an collection learning method for classification and regression that construct a number of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of "weak learners" can come together to form a "strong learner". Random Forests are a wonderful tool for making predictions considering they do not overfit because of the law of large numbers. Introducing the right kind of randomness makes them accurate classifiers and repressors'. They are a simple tool to use without having a model or to produce a reasonable model fast and efficiently.

Random Forests are easy to learn and use for both professionals and lay people--with little research and programming required and may be used by folks without a strong statistical background. Minimally put, you can securely make more precise predictions without most basic mistakes common to other methods. The Random Forests algorithm was developed by Leo Breiman and Adele Cutler. Random Forests grows many classification trees.

Algorithm:

1. Once a node is split on the best eligible splitter the process is repeated in its entirety on each child node

2. A new list of eligible predictors is selected at random for each node

3. With a large number of predictors the eligible predictor set will be quite different from node to node TM Important variables will make it into the tree (eventually)

4. Explains in part why the trees must be grown out to absolute maximum full size

5. Aim for terminal nodes with one data record

RESULTS AND DISCUSSIONS

The effective usage of data mining tools enables us to find important parameters that reflect the effect of a particular game data set. C4.5 algorithms for categorization problems in machine learning and data mining. It is targeted at supervised learning: Given an attribute-valued dataset where instances are described by collections of attributes and belong to one of a set of mutually exclusive classes, C4.5 algorithm a mapping from attribute values to the particular classes that can be applied to categorize new, unseen instances.

For instance, in Table 1.1 rows denotes specific days, attributes denotes weather conditions on the given day and the class denotes whether the conditions are conducive to playing golf. Thus, each row denotes an instance, described by values for attributes such as Outlook (a ternary-valued random variable), Temperature (continuous-valued), Humidity (also continuous-valued), Windy (binary), and Boolean PlayGolf? class variable. All the data are considered for training purpose and it is used in the algorithms.

C4.5 has additional features such as tree pruning, improved use of continuous attributes, missing values handling and inducing rule set. The above questions are faced by any classification approach modeled after trees and similar or other reasonable; decisions are made by most tree induction algorithms. The practical utility of C4.5, however, comes from the next set of features that build upon the basic tree induction algorithm.

Figure 1 shows J48 pruned tree and figure 2 shows random tree. Both the trees are derived and formed from the given weather data set. Let us observe how the first attribute chosen for a decision tree is the Outlook attribute. To find out, let us first estimate the entropy of the class random variable (PlayGolf?). This variable takes two values with probability 9/14 (for "Yes") and 5/14 (for "No"). The entropy of a class random variable that takes on c values with probabilities p1, p2, ..., pc is given by,

[c.summation over (i=1)] - [P.sub.i] [log.sub.2] [P.sub.i] (1)

The entropy of Play is thus

= -(9/14) log2(9/14) - (5/14) log2(5/14) = 0.409776 + 0.53051 = 0.940

The average 0.940 bits should be transmitted to communicate in sequence about the PlayGolf? random variable. The goal of C4.5 tree induction is to ask the right questions so that this entropy is reduced. We consider each attribute in turn to assess the improvement in entropy that it affords. For a given random variable, say Outlook, the improvement in entropy, represented as Gain (Outlook), is calculated as,

Entropy(PlayGolf? in D) - [summation over (v)] [absolute value of ([D.sub.v])]/[absolute value of (D)] Entropy(PlayGolf? in [D.sub.v]) (2)

where v is the set of possible values (in this case, three values for Outlook), D denotes the entire dataset, Dv is the subset of the dataset for which attribute Outlook has that value, and the notation | * | denotes the size of a dataset (in the number of instances).

Suppose D is a set of 14 datasets in which one of the attributes is Outlook. The value of outlook can be classified as Sunny, Overcast and Rainy.

Gain (D, Outlook) = Entropy (D) - (5/14) * Entropy (Dsunny) - (4/14) * Entropy (Dovercast) - (5/14) * Entropy (DRainy) (3)

Suppose D is a set of 14 datasets in which one of the attributes is windy, humidity and temperature. Then the value of windy can be classified as weak and strong. The value of humidity can be classified as normal and high. The value of temperature can be classified as hot, mild and cold. Every attribute, the gain is calculated and considered the maximum gain is used in the decision node, to find which attribute will be the root node in the decision tree, the gain is calculated based on equation 1, 2 and 3 for four attributes.

Gain (D, Outlook) = 0.236

Gain (D, Windy) = 0.048

Gain (D, Humidity) = 0.015

Gain (D, Temperature) = 0.028

Here the Outlook attribute is the highest gain, therefore it is used as a decision attribute in the root node. Since outlook has three possible values, the root node has three branches sunny, overcast and rainy. In next which attribute should be tested at the sunny branch node.

Gain (Dsunny, Temperature) = 0.571

Gain (Dsunny, Windy) = 0.02

Gain (Dsunny, Humidity) = 0.971

Compare above three observations humidity has the highest one. It is used as the decision node. Then next level of a tree is formed based on the highest gain of the attributes. This process goes on until all the data is classified perfectly or run out of the attributes.

The author [17] proposed a technical report, evaluating results of Machine Learning experiments using Recall, Precision, and F-Measure. In the Medical Sciences, Receiver Operating Characteristics (ROC) analysis has been borrowed from signal processing to become a standard for evaluation and standard setting, comparing True Positive Rate and False Positive Rate.

Accuracy:

It shows the quantity of the total number of instance predictions which are properly predicted.

Accuracy (A) = TP + TN/N (4)

Where N is a total number of classified instances, True Positive (TP) means correctly predicted of positive classes and True Negative (TN) describe correctly predicted of negative classes.

Precision:

It is used to determine exactness. It is the ratio of the predicted positive cases that were accurate to the total number of predicted positive cases.

Precision (P) = TP/TP + FN (5)

Where True Positive (TP) means correctly predicted of positive classes and True Negative (FP) describe wrongly predicted as positive classes.

Recall:

The recall is determined the completeness and it is the part of positive cases that were properly known to the total number of positive cases. It is also known as sensitivity or true positive rate (TPR).

Recall (R) = TP/TP + FN (6)

Where True Positive (TP) means correctly predicted of positive classes and True Negative (FN) describe total wrongly predicted as negative classes.

F-Measure:

The harmonic mean of precision and recall. It is an important measure as it gives equal importance to precision and recall.

F--Measure = 2 * Re call * Pr ecision/Pr ecision + Re call (7)

ROC Area:

ROC is a comparison of two operating characteristics True Positive Rate (TPR) and False Positive Rate (FPR). It is also known as a receiver operating characteristic curve. A receiver operating characteristic curve is a graphical representation which interprets the performance of a data mining classifier algorithms as its bias threshold is varied. It is an outcome of plotting the true positive rate vs. false positive rate at various threshold settings. It is a graphical approach for displaying the tradeoff between TPR and FPR of an algorithm. TPR is plotted along the y-axis and FPR is plotted along the x-axis. The performance of each algorithm represented as a point on the ROC curve. ROC curve shows a relation between Recall and Not Precision. Either Recall or Sensitivity is calculated of the probability that your estimate is 1 given all the samples whose true class label is 1. It is an evaluation of how many of the positive samples have been well-known as being positive. The calculation of the probability that your guess is 0 given all the samples whose true class label is 0. It is a work out of how many of the negative samples have been recognized as being negative. The formula is defined as:

ROC Area: TP Rate = TP/TP + FN * 100, FP Rate= FP/FP + TN * 100 (8)

PRC Area:

The PRC is known as precision-recall characteristics curve. It is a comparison of two operating characteristics (PPV and sensitivity) as the criterion changes. A PRC curve represents a graphical area which illustrates the routine of binary classifiers as its discrimination threshold is varied. PPV (Positive Predictive Value) is a fraction of true positives out of test outcomes positive. While sensitivity is a fraction of true positive out of conditions positive.

Data mining technique automatically detect the relevant patterns or information from the raw data, using the data mining algorithms. In Data mining algorithms, Decision trees are the best and commonly used approach for representing the data. Using these Decision trees, data can be represented as a most visualizing form. Many different decision tree algorithms are used for the data mining technique. Each algorithm gives a unique decision tree from the input data. The Random Tree algorithm from a single input weather dataset. This algorithm is efficient for predicting the accurate results compare with other algorithms results. The Random Tree (RT) compares with many results and obtains the best one. The result obtained from Random Tree (RT) is more accurate than J48, Decision Stump (DS), Logistic Model Tree (LMT), Hoeffding Tree (HT), Reduce Error Pruning (REP) and Random Forest (RF).

The value of correctly and incorrectly classified instance both in percentage and number is given in Table 2. Out of seven classification algorithms, Random tree algorithm outperforms other algorithms by yielding an accuracy of a correctly classified instance as 85.743% and incorrectly classified instance as 14.2857% for 14 instances and it is depicted in figure 3. Table 3 shows the test accuracy based on entropy information score for all the seven algorithms. Four algorithms namely J48, RT, DS, and REF produce positive values and all the remaining algorithms yielding negative values. The random tree algorithm yields the best performance score of 980, 9.5155 and 0.6797 and it is graphically represented in figure 4.

Table 4 summarize the test accuracy based on mean absolute error, root mean squared error, relative absolute error and root relative squared error and it is graphically shown in figure 5. Random tree algorithm gives better performance by producing less error. The weighted average of accuracy by class based on True Positive, False Positive, Precision, Recall, F-Measure, ROC, and PRC area are calculated by using the equations 4, 5, 6, 7, and 8 respectively and the values are given in Table 5. Figure 6 shows the graph of weighted average accuracy for all the seven algorithms.

Conclusion and Future Scope:

In this paper, we have conducted an experiment to analyze the accuracy of seven data mining classification algorithms for various decision tree approaches by using the weather data set. Open source data mining tool WEKA is used for conducting an experiment. Precision, Recall, and F-Measure were calculated to find the accuracy of the data mining classification algorithms. In addition, ROC and PRC also calculated to improve the accuracy measurement. The experiment shows that Random Tree algorithm performs with an accuracy of 85.714% for the weather data set. In future, the proposed method will be extended to other data sets from the areas like agriculture, medical, banking, stock market etc.,

REFERENCES

[1.] Hssina, B., A. Merbouha, H. Ezzikouri, M. Erritali, 2014. A Comparative Study of Decision Tree ID3 and C4.5. International Journal of Advanced Computer Science and Applications, Special Issue on Advances in Vehicular Ad Hoc Networking and Applications, pp: 13-19.

[2.] Milovi, B., V. Radojevi, 2015. Application of Data Mining in Agriculture, Bulgarian Journal of Agricultural Science, 21(1): 26-34.

[3.] Patel Hetal., Patel Dharmendra, 2014. A Brief survey of Data Mining Techniques Applied to Agricultural Data, International Journal of Computer Applications, 95(9): 6-8.

[4.] Monique Pires Gravina de Oliveira, Felipe Ferreira Bocca, Luiz Henrique Antunes Rodrigues, 2017. From spreadsheets to sugar content modeling: A data mining approach, Science Direct, Computers and Electronics in Agriculture, 132: 14-20.

[5.] Naik Amrita., Samant Lilavati, 2016. Correlation Review of Classification Algorithm Using Data Mining Tool: WEKA, Rapidminer, Tanagra, Orange and Knime, Science Direct, Procedia Computer Science, 85: 662-668.

[6.] Vinciya, P., A. Valarmathi, 2016. Agriculture Analysis for Next Generation High Tech Farming in Data Mining, 6(5): 481-488.

[7.] Xindong Wu., Vipin Kumar, 2009. The top ten Algorithms in Data Mining, CRC Press Taylor & Francis Group.

[8.] Quinlan, J.R., 1986. Induction of Decision Trees, Machine Learning, 1(1): 81-106.

[9.] Kapoor Prerna., Rani Reena., 2015. Efficient Decision Tree Algorithm Using J48 and Reduced Error Pruning, International Journal of Engineering Research and General Science, 3(3): 1613-1621.

[10.] Quinlan, J.R., 1993. C4.5: Programs for Machine Learning, Morgan Kaufmann.

[11.] Tusharkumar Trambadiya, Praveen Bhanodia, 2012. A Comparative Study of Stream Data Mining Algorithms, International Journal of Engineering and Innovative Technology, 2(3): 49-154.

[12.] Mohammed Abdul Khaleel, Sateesh Kumar Pradham, G.N. Dash, 2013. A Survey of Data Mining Techniques on Medical Data for Finding Locally Frequent Diseases, International Journal of Advanced Research in Computer Science and Software Engineering, 3(8): 149-153.

[13.] Balazs Kegl., 2009. Introduction to AdaBoost, https://users.lal.in2p3.fr/kegl/teaching/stages/notes/tutorial

[14.] Landwehr, N., M. Hall, E. Frank, 2005. Logistic model trees, Machine Learning, 59(1-2): 161-205.

[15.] Bernhard Pfahringer, Geoffrey Holmes, Richard Kirkby., 2007. New Options for Hoeffding Trees, Springer-Verlag Berlin Heidelberg, pp: 90-99.

[16.] Haizan W. Nor, Mohamed W., Mohd Najib Mohd Salleh, Abdul Halim Omar, 2012. A Comparative Study of Reduced Error Pruning Method in Decision Tree Algorithms, IEEE International Conference on Control System, Computing and Engineering, pp: 23-25.

[17.] David, M., W. Powers, 2007. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation, School of Informatics and Engineering Flinders University, Australia, Technical Report.

[18.] Sathyadevan Shiju Nair, R. Remya, 2014. Comparative Analysis of Decision Tree Algorithms: ID3, C4.5 and Random Forest, Computational Intelligence in Data Mining, 1: 549-562.

[19.] Elangovan, K., T. Sethukarasi, 2016. Knowledge Enrichment of prediction Using Machine Learning Algorithms for Data Mining and Big Data: a Survey. Advances in Natural and Applied Sciences, 10(15): 23-30.

[20.] Gokulakannan, E., K. Venkatachalapathy, 2015. Comparison Study of Severaldata Mining Algorithms to Predict Employee Stress Advances in Natural and Applied Sciences, 9(8): 7-14.

[21.] WEKA: http//www.cs.waikato.ac.nz/ml/weka.

[22.] Kohavi, Ronny, Quinlan J. Ross, 2002. Data Mining Tasks and Methods: Classification: Decision Tree Discovery, In Handbook of Data Mining and Knowledge Discovery, pp: 267-276.

[23.] Han, J., M. Kamber, 2004. Data Mining: Concept and Techniques, Morgan Kaufmann Publishers.

[24.] Senthilkumar, D., N. Sheelarani, S. Paulraj, 2015, Classification of Multi-dimensional Thyroid Dataset Using Data Mining Techniques: Comparison Study, Advances in Natural and Applied Sciences, 9(6): 2428.

[25.] Ogwueleka., Francisca Nonyelum, 2009. Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry, Advances in Natural and Applied Sciences, 3(1): 73-78.

(1) Rajesh P and (2) Karthikeyan M

(1) Assistant Professor/Programmer, Department of Computer and Information Science, Faculty of Science, Annamalai University, Annamalai Nagar--608 002, Tamil Nadu, India,

(2) Assistant Professor/Programmer, Department of Computer and Information Science, Faculty of Science, Annamalai University, Annamalai Nagar--608 002, Tamil Nadu, India,

Received 12 May 2017; Accepted 5 July 2017; Available online 28 July 2017

Address For Correspondence:

P. Rajesh, Assistant Professor/Programmer, Department of Computer and Information Science, Faculty of Science, Annamalai University, Annamalai Nagar--608 002, Tamil Nadu, India,

E-mail: rajeshdatamining@gmail.com

Caption: Fig. 1: J48 Pruned Tree

Caption: Fig. 2: Random Tree

Caption: Fig. 3: Correctly and Incorrectly Classified Instance

Caption: Fig. 4: Test accuracy based on Entropy Information Score

Caption: Fig. 5: Test accuracy based on mean absolute error.

Caption: Fig. 6: Weighted average of Accuracy
Table 1: Weather Data Set

Day   Outlook    Temperature   Humidity   Windy   Play Game
                                                    Golf

1      sunny         85           85        F        No
2      sunny         80           90        T        No
3     overcast       83           78        F        Yes
4      rainy         70           96        F        Yes
5      rainy         68           80        F        Yes
6      rainy         65           70        T        No
7     overcast       64           65        T        Yes
8      sunny         72           95        F        No
9      sunny         69           70        F        Yes
10     rainy         75           80        F        Yes
11     sunny         75           70        T        Yes
12    overcast       72           90        T        Yes
13    overcast       81           75        F        Yes
14      rain         71           80        T        No

Table 2: Correctly and Incorrectly Classified Instance

Sl. No.   Algorithms   Correctly    Incorrectly   Correctly
                       Classified   Classified    Classified
                       (%)          (%)           (Nos.)

1         J48          64.2857      35.7143       9
2         RT           85.7143      14.2857       12
3         DS           35.7143      64.2857       5
4         LMT          42.8571      57.1429       6
5         HT           57.1429      42.8571       8
6         REP          64.2857      35.7143       9
7         RF           64.2857      35.7143       9

Sl. No.   Incorrectly
          Classified
          (Nos.)

1         5
2         2
3         9
4         8
5         6
6         5
7         5

Table 3: Test accuracy based on Entropy Information Score

Sl.   Algorithms   Relative      Information    Information
No.                Information   Score (bits)   Score
                   Score (%)                    (bits/instance)

1     J48          501.6781      4.8710         0.3479
2     RT           980.0194      9.5155         0.6797
3     DS           88.3530       0.8579         0.0613
4     LMT          -86.8869      -0.8436        -0.0603
5     HT           -74.2873      -0.7213        -0.0515
6     REP          13.2017       0.1282         0.0092
7     RF           -8.9462       -0.0869        -0.0062

Table 4: Test accuracy based on mean absolute error, root mean
squared error, relative absolute error and root relative squared
error.

Sl.   Algorithms   Mean       Root mean   Relative    Root relative
No.                absolute   squared     absolute    squared error
                   error      error       Error (%)   (%)

1     J48          0.2857     0.4818      60.0000     97.6586
2     RT           0.1429     0.3780      30.0000     76.6097
3     DS           0.4256     0.5127      89.3750     103.9181
4     LMT          0.4804     0.5917      100.8822    119.9233
5     HT           0.4929     0.5096      103.5107    103.2992
6     REP          0.4725     0.4958      99.2308     100.4896
7     RF           0.4725     0.5210      99.9192     105.6016

Table 5: Weighted average of Accuracy by Class based on TP, FP,
Precision, Recall, F-Measure, ROC Area and PRC Area

Sl.   Algorithms   TP Rate   FP Rate   Precision   Recall   F-Measure
No.

1     J48          0.643     0.465     0.629       0.643    0.632
2     RT           0.857     0.168     0.857       0.857    0.857
3     DS           0.357     0.713     0.381       0.357    0.367
4     LMT          0.429     0.673     0.429       0.429    0.429
5     HT           0.571     0.683     0.396       0.571    0.468
6     REP          0.643     0.643     0.413       0.643    0.503
7     RF           0.643     0.465     0.629       0.643    0.632

Sl.   ROC Area   PRC Area
No.

1     0.789      0.808
2     0.844      0.808
3     0.456      0.642
4     0.489      0.659
5     0.100      0.427
6     0.178      0.470
7     0.400      0.555
COPYRIGHT 2017 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Rajesh, P.; Karthikeyan, M.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Jul 1, 2017
Words:6502
Previous Article:Comparison of discrete L-band Raman fiber amplifier in two different configurations.
Next Article:Experimental and finite element analyses study of superelasticity behavior of shape memory alloy NiTinol wire.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters