# Experimental comparisons of multi-class classifiers.

The multi-class classification algorithms are widely used by many areas such as machine learning and computer vision domains. Nowadays, many literatures described multi-class algorithms, however there are few literature that introduced them with thorough theoretical analysis and experimental comparisons. This paper discusses the principles, important parameters, application domain, runtime performance, accuracy, and etc. of twelve multi-class algorithms: decision tree, random forests, extremely randomized trees, multi-class adaboost classifier, stochastic gradient boosting, linear and nonlinear support vector machines, K nearest neighbors, multi-class logistic classifier, multi-layer perceptron, naive Bayesian classifier and conditional random fields classifier. The experiment tested on five data sets: SPECTF heart data set, Ionosphere radar data set, spam junk mail filter data set, optdigits handwriting data set and scene 15 image classification data set. Our major contribution is that we study the relationships between each classifier and impact of each parameters to classification results. The experiment shows that gradient boosted trees, nonlinear support vector machine, K nearest neighbor reach high performance under the circumstance of binary classification and minor data capacity; Under the condition of high dimension, multi-class and big data, however, gradient boosted trees, linear support vector machine, multi-class logistic classifier get good results. At last, the paper addresses the development and future of multi-class classifier algorithms.Keywords: multi-class classifier, tree classifier, boost classifier, support vector machines, combining boost classifier

Povzetek: V prispevku je podan pregled klasifikatorjev z vec razredi.

1 Introduction

Multi-classification problem [1] is that of supposing a set of training data([x.sub.1], [c.sub.1]), ..., ([x.sub.n], [c.sub.n]), where x [member of] [R.sup.p] are finite set of input features, [c.sub.i] [member of]{1, 2, ..., K) are class numbers of output variables. The purpose of multi-classification task is to find a classifying rules based on the training sample, then given a new features, outputting a classifying category. Today multi-classifier algorithms are applied to a variety of application areas such as: radar signal classification, character recognition, remote sensing, medical diagnostics, expert systems, voice recognition domains and etc.

Multiple classifiers has a long history. Selfridge et al. [2] first propose a multi-classification system based on 'winners get all' solution which chooses the optimal solution as a multi-classifier output. Kanal and Minsky [3, 4] play an important role in multi-classifiers development. They claim that any classifying algorithm does not solve all problems. We need to design specific classifying algorithm for different problems. Multilayer perceptron [5] is an artificial neural network model that can resolve this kind of nonlinear data. Decision tree is an ancient non-parametric classification algorithm that classifies the samples according to the classifying rules. Leo Breiman[6] proposes random forest as a good solution to the scalability issues of single decision tree. Adboost algorithm is proposed by Yoav Freund and Robert Schapire[34] is a meta-classification algorithm which can be combined with other classification algorithms to enhance its performance. Multi-class logistic classifier proposed by Jerome Friedman et al. [7] is another important improvement of enhancing the basic boosting algorithm. K nearest neighbor [5] classifies samples based on the adjacent spatial relationships of features. In 1980s with the rise of data fusion and learning model in statistics and management science, Bayesian expert [8-10] system is widely used. Since the 1990s, Vapnik proposes support vector machines, transforming the feature from low dimensional space into high dimensional space, which is a better ways to classify the features. Nonlinear support vector machine gets a great success, however it is not ideal in some cases i.e. the original features are already high dimensional space, so people propose a linear support vector machine[11] for these cases. Because of the complexity of the data, a single classifier is often difficult to obtain good performance for specific applications, it is a growing tendency to improve classifying performance by a combination of classification methods [12].

How to solve the multi-classification problems is challenging. There are two ways to deal with it [13]. Nilsson et al. [14] first use combination of binary classifiers to solve the multi-classification problems. The other way is that directly extends binary classifier into multiple classifier.

The main contributions of this paper are as following:

1. We compare twelve most commonly used algorithms for multi-classification in several aspects such as principle, important parameters, running time performance, and etc.

2. This paper considers the impact of differences of type and size in data sets. We chose binary classification and multi-classification, a small amount of data and a large size set as the evaluation data sets.

3. This paper gives in-depth analysis of multicategory classification for each class.

4. In this paper, we investigate the relationship between the single classifier and a combination of single classifier.

This paper discusses the 12 multiple classifiers: decision trees, random forests, extremely randomized trees, multi-class adaboost, stochastic gradient boosting, support vector machines (including linear and nonlinear), K nearest neighbors, multi-class logistic classifier, multilayer perceptron, naive Bayesian classifier and conditional random fields classifier.

The paper is organized as follows: section 1 is an introduction of studying content. In section 2 we discuss the related works. Overview of algorithms are discussed in section 3. Section 4 presents an experimental setup and parameters settings. In Section 5, we explain experimental results. Section 6 concludes the paper.

2 Related works

There are few comprehensive comparisons of multi-classification algorithms. King et al. [15] is the most comprehensive and earliest study of multi-classification algorithms including CART, the traditional algorithm C4.5, Naive Bayes, K nearest neighbor, neural networks, and etc. However after that a few emerging algorithms such as support vector machines, random forests have been widely used. Further, data sets they used are too small while comparing to current big data. Then again their evaluation criteria is simple. Bauer et al. [16] thoroughly compare voting classification algorithms including bagging, boosting and its improved versions, but it is only comparing these two types of voting algorithm. LeCun et al. [17] use accuracy, rejecting rates and running time as the evaluation criteria. They compare algorithms: K-nearest neighbor, linear classifier, the main ingredient and polynomial classifiers on handwritten recognition data set. But only one data set used, it is not sufficient to evaluate different application scenarios. Ding et al. [18] use neural networks and support vector machines for protein test data set. They do detailed comparison of the accuracy of different parameters, but this comparison is relatively simple and data set is small. Tao et al. [19] study the decision trees, support vector machines, K nearest neighbor classification of gene sequences of organization application. However, this comparison only discusses single dataset. Foody et al [20] study multi-class image classification by support vector machines. But they only compare support vector machines, even linear support vector machine is not involved. Chih-Wei Hsu et al. [21] also study multi-class support vector machines for a more in-depth theoretical analysis and comparison, but is limited to multi-class support vector machine. Caruana et al. [22] study supervised learning algorithm (support vector machines, neural networks, decision trees, and etc.) in 9 different criteria such as ROC area, F evaluation and etc., but the literature is only discussed two classification data sets. Krusienski et al. [23] compare the Pearson correlation method, Fisher linear discriminant, stepwise linear discriminant, linear support vector machines, and Gaussian kernel support vector machines on P300 Speller data set. However, the data set is relatively simple, and small size of data is often difficult to compare running performance of support vector machine between linear and non-linear scenarios.

Table 1 lists the current situation of classifier comparisons.

Table 1 shows that the majority of current surveys focus on multi-class classifiers in a particular field, such as medicine, biology, remote sensing images, and etc. The data sets and methods used for evaluation is relatively simple. King et al's evaluation [15] is more comprehensive, however the comprising algorithms are classical. After that new algorithms emerge. Our comparing algorithms are the newest and most representative of the current tendency in a variety of application domains.

3 Overview of Algorithms

3.1 Brief introduction of algorithms

In order to better understand various classifiers to compare, we briefly introduce multiple classifier algorithms.

3.1.1 Decision tree

In machine learning, decision trees [24] is a predictive model. A decision tree is a flowchart-like structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label, and then decision is taken after computing all attributes. A path from root to leaf represents classification rules. Decision tree is actually an adaptive basis function model [25], and can be expressed as follows

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

Where [R.sub.m] is the m'th region, [w.sub.m] is the mean response in this region, and v., encodes the choice of variable to split on, and the threshold value, on the path from the root to the m'th leaf.

Classification tree is an ancient method, it has various variants, typically such as[26] ID3 (Iterative Dichotomiser 3) proposed by Ross Quinlan, is a greedy approach that in each iteration choose the best attribute value to split the data, but this method has the problem of local optimum. C4.5 also proposed by Ross Quin lan, is an improved ID3, can be used for classification. CART(classification and regression Trees)[27] proposed by Breiman has same process with C4.5 algorithm. except that the C4.5 uses an information entropy rather than Geni coefficient used by CART.

3.1.2 Random forests

Random forests are proposed by Leo Breiman and Adele Cutler[28], It is an ensemble learning method for classification (and regression) that is built by constructing a multitude of decision trees at training time and outputting the class by voting of individual trees.

Random forests are a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging [29],

Random forests have the advantages of computing efficiency, improving the prediction accuracy without significantly increase of computational cost. Random forest can be well predicted up to thousands of explanatory variables [30], known as one of the best current algorithms [31].

3.1.3 Extremely randomized trees

Extremely randomized trees have been introduced by Pierre Geurts, Damien Ernst and Louis Wehenkel[32], The algorithm of growing extremely randomized trees is similar to random forest, but there are two differences:

1. Extremely randomized trees don't apply the bagging procedure to construct a set of the training samples for each tree. The same input training set is used to train all trees.

2. Extremely randomized trees pick a node split very extremely (both a variable index and variable splitting value are chosen randomly), whereas random forests find the best split (optimal one by variable index and variable splitting value) among random subset of variables.

3.1.4 Multi-class adaboost classifier

Boosting has been a very successful technique for solving the two-class classification problem [33], In going from two class to multi-class classification, most boosting algorithms have been restricted to reducing the multi-class classification problem to multiple two-class problems [34]. 3.1.5 Stochastic gradient boosting

Gradient boosting proposed by Friedman [35] is a method to improve basic boosting algorithm. The traditional boosting method is adjusted weights to correct classification samples and error samples based on gradient descend at each iteration. The major difference of gradient boosting from the traditional boosting method is that purpose at each iteration is not to reduce the losses, but in order to eliminate the loss. The new model at each iteration is based on the residuals of former process. Inspired by Breiman[6]'s randomized bagging idea, Friedman introduces stochastic gradient boosting by randomized down-sampling to train basic classifier.

3.1.6 Support vector machines

Support vector machines [36, 37] is the method that mapped feature vector into a higher dimensional vector space, where a maximum margin hyper-plane is established in this space. So we choose the hyper-plane so that the distance from it to the nearest data point on each side is maximized. The greater the distance between the nearest data of different classes is, the smaller the total error classification is. The multi-class support vector machines [21] can be defined as follows

Supposing l groups sample: ([x.sub.1], [c.sub.1]) ..., ([x.sub.1], [c.sub.1]), where [x.sub.i] [member of] [R.sup.n], i = 1, ..., l, [c.sub.1] [member of]{l, ..., k} is the type of [x.sub.i]. The i - th support vector machine solve this problem.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

[[xi].sup.i.sub.j] [greater than or equal to] 0, j = 1, ..., l (4)

Where training sample [x.sub.i] is mapping into high dimensional space by function [phi] and regularization parameters c. Minimizing (i/2)[([w.sup.i]).sup.T] [w.sup.i] means that we have to minimize 2/[parallel][w.sup.i][parallel], the margins between two group data. The penalty C[t.summation over (j=1)][[xi].sup.i.sub.j][([w.sup.i]).sup.T] is to reduce the number of training error. The core concept of support vector machine is to seek a balance between the regularization term (1/2)[([w.sup.i]).sup.T][w.sup.i] and the training errors.

We get k decision functions after solving formula (4).

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (5)

After that, we have the largest value of decision functions as the predictive class of x.

3.1.7 Linear support vector machines

SVM uses a nonlinear mapping that converts low-dimensional feature space into a high-dimensional space to get better discriminative. However, in other applications, the input feature itself is a high-dimensional space, if more is mapped to more high latitudes, it may not be able to get better performance. Their own space can be directly used as identification. The linear support vector machine SVM [38] is suitable for this scenario. For multi-classification, Crammer et al. [39] propose this method to solve the problem. We define the original question as follow

Supposing the training data set T = j([x.sub.1], [c.sub.1]), ([x.sub.2], [c.sub.2]), ..., ([x.sub.N], [c.sub.N])}, where, [x.sub.i] [member of] [chi] [subset equal to] [R.sup.n] is feature vector, [y.sub.i] [member of] Y = (1, 2, ..., k) is the type of instance, i = 1, 2, ..., l. The multi-class problem can be formulated as the following primal problem.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (6)

where, C > 0 is the regularization parameter, [w.sub.m] is the weight vector associated with class m, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] if [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. Note that in (7), the constant m = [c.sub.i] corresponds to the non-negativity constraint: [[xi].sub.i] [greater than or equal to] 0. The decision function is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (7)

3.1.8 K nearest neighbors

K nearest neighbours algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbour.

3.1.9 Multi-class logistic classifier (maximum entropy classifier)

In some cases, multi-class logistic regression well fits features. It has the formula [25]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (8)

Where, p(y=c|x, W) is the predictive probability, y is the class type of totally c. [w.sub.c] is the weight of class c, and approximated by maximum posterior probability. With this, the log-likelihood has the form

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (9)

Where, [[mu].sub.ic] = p([y.sub.i] = c|[x.sub.i], W). This model can be optimized by L-BFGS[41],

3.1.10 Multilayer perceptron

A multilayer perceptron (MLP) is a feed forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function.

MLP[57] has evolved over the years as a very powerful technique for solving a wide variety of problems. Much progress has been made in improving performance and in understanding how these neural networks operate. However, the need for additional improvements in training these networks still exists since the training process is very chaotic in nature.

3.1.11 Naive Bayesian classifier

A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model"

Assuming sample X, belongs to type [C.sub.i]. The class-conditional density is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (10)

Where, f(*|[C.sub.j]) is the maximum likelihood. Input features is x, c is class type.

A simpler alternative to generative and discriminative learning is to dispense with probabilities altogether, and to learn a function, called a discriminant function, that directly maps inputs to outputs. The decision function of naive Bayesian classifier is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (11)

3.1.12 Conditional random fields classifier

CRFs (Conditional random fields) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF can take context into account; e.g., the linear chain CRF popular in natural language processing predicts sequences of labels for sequences of input samples.

We define a CRF on observations x and random variables Y as follows:

Let G = (V, E) be a graph such that Y = [([Y.sub.v]).sub.v[member of]V], so that Y is indexed by the vertices of G. Then (X, Y) is a conditional random field when the random variables Yu, conditioned on X, obey the Markov property with respect to the graph:

p([Y.sub.v] |X, [Y.sub.w], m [not equal to] v) = p([Y.sub.v]| X, [Y.sub.w], w - v) (12)

Where w [??] v means that w and v are neighbors in G. What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets X and Y, the observed and output variables, respectively; the conditional distribution P(y|x) is then modeled. For classification problem, we compute the maximum conditional probabilistic distribution.

3.2 Comparison of algorithms

With reference to [45] for the multi-classification, we made a comparison of 12 algorithms as shown in table 2.

We analyze and summarize in Table 2 as follow:

1. Algorithms type

Aside from Naive Bayes, others are discriminant model. Bayesian algorithm by learning the joint distribution P(C,X), then obtains the conditional probability P(X \o. The classifying prediction is achieved by maximizing likelihood approximation. However the discriminant method directly makes prediction by the discriminant function or conditional probability.

2. Algorithms trait

Decision trees, random forest, extreme random tree, multi-class adaboost upgrade and stochastic gradient boosting all belong to model with adaptive basis functions that can be grouped into common additive model[25], as shown in Equation 13.

f(x) = [alpha] + [f.sub.1]([x.sub.1]) + ... + [f.sub.D]([x.sub.D]) (13)

Where [f.sub.i] is the sub-model obtained through training sample. fix) is the superposition of sub-models. Decision tree is the basic sub-model for tree-like algorithms. Upon whether the use of all samples for training, these kinds of algorithms can be divided into random forests and extremely random tree. Random forests is to build submodel through random bagging sampling. However, extremely random tree obtains sub-model using all training samples, but randomly selecting the splitting threshold. Multi-class adaboost and stochastic gradient boosting is a linear combination of sub-models (weak classifiers). The difference lies in their learning algorithms.

Multilayer Perceptron, linear and non-linear support vector machines can be classified as kernel methods [46]. The unified formula has the form

f(x) = w x [PHI](x) + b (14)

where w is real weight, b [member of] R is bias. [PHI](x) function is the type of the classifiers, for MPL [PHI](*) = ([[PHI].sub.1], [[PHI].sub.1](*), ..., [[PHI].sub.N](*)), the i-ih hidden is defined as [[PHI].sub.1](x) = h([v.sub.i]x + [d.sub.i]) x h is the mapping function, generally a hyperbolic function or shape function B is chosen by MPL. For linear support vector machine <t>(.v) is linear function, rather than polynomial function, Gaussian kernel function, and etc. for non-linear support vector machine.

K nearest neighbor model is constructed according to the division of distance relationship of the training feature space, being classified by a majority voting.

Multi-class logistic regression (maximum entropy) is the probabilistic choice model with constraints that uncertain contents are treated with equal probability of using entropy maximization to represent.

Naive Bayes classifier is based on the conditional independence assumption of training samples, learning parameters with the joint probability distribution though Bayes' theorem, then classifying a given sample, by maximizing the posterior probability to get corresponding class.

3. Learning policy, loss and algorithms

Decision trees, random forests, extremely randomized trees belong to the maximum likelihood approximation of learning strategies, with the loss of log-likelihood function.

The decision tree's optimal strategy is to learn some features through a recursive process and split the tree in accordance with the feature of the training samples. In order to have a better generalization ability, the decision tree model has to be pruning for removal of over-fitting.

Random forests is based on random sampling based on the approach (bagging) to form more stars forest trees. Extremely randomized trees is randomly selecting the splitting value to build decision trees forests.

Multilayer Perceptron, linear and non-linear support vector machines are to separate hyper-plane. The difference is that multilayer perceptron is to minimize the error hyper-plane, however linear and non-linear support vector machine is a minimal loss of hinge page. Perceptron learning algorithm is stochastic gradient descent, linear support vector machine is a sequential dual method, and non-linear support vector machine is a sequential minimal optimization method.

K-nearest neighbor is based on distance of feature space.

Multi-class logistic classifier (maximum entropy) learning strategies can be seen as either maximum likelihood estimation, or a logical loss minimization. Loss of function is a linear logarithmic function. The model learning is the maximum likelihood estimation or regularized maximum likelihood estimation under certain training conditions. Optimization algorithm is L-BGFS.

Naive Bayesian probability is a typical generative model. The maximum likelihood estimation and Bayesian posterior probabilities is calculated based on the joint probability and conditional independence assumption.

4. Classification strategy

Decision tree uses the IF-THEN rules for classification decisions based on the value of the model learned. Random forests and extremely random tree is based on the voting of every single decision tree classification, then taking the highest vote as the final results.

Multilayer Perceptron, multi-class logistic regression (maximum entropy), linear and non-linear support vector machine have the same form of classification decisions

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (15)

The difference lies in the choice of [phi](x). Multilayer perceptron machine chooses B-shaped function, hyperbolic tangent function, and etc.; Log-linear functions is chosen for multi-class logistic regression; linear support vector machine choses a linear function; nonlinear support vector machine's choice is non-linear kernel function.

K nearest neighbor is a majority voting that output classification is determined by choosing K nearest voting in light of test sample's distance to the learned model.

Naive Bayesian decision strategy is the rule of maximizing the posterior probability.

4 Experimental Setup

In order to evaluate the performance of various types of classifiers, we implemented our comparisons based on Darwin [47], Opencv[56], Libsvm [48] and liblinear [11] in C++. This paper compared 12 kinds of algorithms.

4.1 Performance comparisons

Confusion matrix (Confusion Matrices) [49, 50] is a common performance evaluation method in pattern recognition classification, which characterizes the relationship between the type of real classes and the recognition classes. For multi-classification problem (For simplifying the representation, we take three categories as example) is illustrated in Table 2.

Where A, B and C are three classes, AA, BB and CC represent the correct prediction number of samples, the remaining number of samples is representative of the error prediction. AA represents the number of samples correctly identified as samples A. AB is predictive number that original Sample A which is incorrectly predicted as Sample B. The remaining items have the same meaning.

Total accuracy rate can be calculated based on confusion matrix as follows.

TA = (AA + BB + CC)/(AA + AB + ... + CC) (16)

Where AA + AB + ... + CC is the total number of sample. TA is the total accuracy.

Precision and recall are quantitative evaluation method. They are not only used to evaluate the accuracy of each class, but also the important standard to measure the performance of the classification system.

Precision is the fraction of retrieved instances that are relevant. Precision reflects the classification accuracy. In practical applications, the average precision are often used to evaluate multi-classification (taking categories as example), which is calculated as follows

avg precision = (AA/(AA + BA + C4) + BB/(BA + BB + BQ + CC/(AC + BC + CC)) (17)

Recall is the fraction of relevant instances that are retrieved. Recall reflects the classification comprehensiveness. In practical applications, the average recall are often used to evaluate multi-classification (taking categories as example), which is calculated as follows

avgrecall=(AA/(AA + AB + AC) + BB/(BA + BB + BC) + CC/(CA + CB + CC)) (18)

[F.sub.1] Measure is an integrated measurement method of the recall and precision. Higher values reflect the recall and precision better integrated. [F.sub.1] is defined as

[F.sub.1] = 2* [avgprecision * avgrecall/avgprecision + avgrecall (19)

Jaccard Coefficient [51] is used for comparing the similarity and diversity of sample sets. Taking class A as example, the formula is

[JC.sub.A] = AA/AA + AB + AC + BA + CA (20)

Average Jaccard coefficient reflects the average of the various categories Jaccard coefficient, which is calculated as

avgJ[C.sub.A] = ([AA/AA + AB + AC + BA + CA] + BB/AB + BB + CB + BA + BC] + [CC/CA + CB + CC + AC + BC])/3 (21)

Jaccard coefficient predicts more accurately reflects higher. Jaccard coefficient can evaluate multi-class classification in each class.

4.2 Data sets and data transformation

We evaluate the performance of these algorithms on five data sets that consists of three binary classification data sets and two multi-classification data sets.

4.2.1 SPECTE Heart data set

SPECTF[52] means single-photon emission computed tomography cardiac data sets. SPECTF is a new data set[53]. In cleaning process, the records with missing information, incomplete picture records are filtered, since the original image scale is not uniform, so the original image is transformed into gray image (0 to 255).

After cleaning, the data set contains 44 features that record 22 region of interest for the cardiac systolic and expanded state. The data type is an integer type (rang from 0 to 100). The data set has total 267 instances of which containing 80 training samples and 187 test samples. Each instance has two states: normal and abnormal. In experiment, we converted class label into class 0 (normal) and class 1 (abnormal). Histogram distribution of training samples showed normal shape is shown in Figure 1 (a).

4.2.2 Ionosphere data set

Ionosphere data set[52] includes is a set of radar data sets created by a military system acquisition in Labrador NATO airbase. Goose Bay, Canada.

The data set has total of 34 feature (integer, decimal type) that record 17 sub-pulses labels and values. The data set consists of 351 instances of which 100 instances are training samples, 251 instances are test samples. Each instance has two states: good and bad. In the experiment we converted class label into class 0 that represents the existence of the facts) and class 1 the non-existent facts. Feature type of the dataset is real type, range from -1 to a. Histogram of the distribution of the training samples is shown in Figure 1 (b).

4.2.3 Spam data set

Spam[52] is a spam filtering data set that has total of 57 data features. 48 consecutive real number is used to represent the percentage of messages of 48 words (i.e. make, address, all, conference) and 6 word occurrences (!, (, [,, $, #) in email.

The data set has a total of 4601 instances, where the training instances are 3065, testing instances are 1,536. Each instance has two states: 0 for not spam and 1 for junk mail. The data set is real type feature information, the range is 0 to 9090. The training sample distribution histogram is scattered, as shown in Figure 1 (c).

4.2.4 Optidigits data set

Optdigits[52] is the information collection extracted from 43 individuals (30 of them for training, 13 as a testing) by the handwritten Optical Character Recognition bit image Standardization American National Standards Institute of Technology (NIST).

The data set has a total of 5620 instances, where the training instances are 3823, test instances are 1797. The instances have ten classes, from 0 to 9 digits. Feature data type is a positive integer, ranging from 0 to 16. Histogram distribution of training samples is shown in Figure 1 (d).

4.2.5 Scene 15 data set

Scene 15[54] data set is a post-processing data set. The original data set has total of 15 classes, 4485 images. We randomly selected sample of 2250 as training samples, 2235 samples for testing.

We used the method of PHOW (Pyramid Histogram Of visual Word) [54] for feature extraction. 12000 descriptors were obtained for each image. In order to obtain a better classification within the features, we used method proposed by vedaldi[55] for features kernel transformation, finally we got 36,000 dimensional feature descriptors to characterize a single image. In the experiment we converted 15 classes into class 0 to 14, representing the bedroom, CALsuburb, industrial, kitchen, livingroom, MITcoast, MITforest, MIThighway, MITinsidecity, MITmountain, MITopencountry, MITstreet, MITtallbuilding, PARoffice and store respectively. Feature data type is a positive real number, the range is -0.3485 to 0.4851, further feature data is relatively concentrated, similar to normal training. Histogram distribution of training samples is shown in Figure 1 (e).

This data set has large dimensions of features and a large size of data. Selection of this data set is mainly to test algorithms adaptability for real application scenarios.

In Figure 1, we demonstrated our diversity of data sets. SPECIF, ionoispher and spam are small scale data sets. They are for binary classification. Optigits and Scene 15 are large scale data sets. They are for multiclass classification. At the same time, the range of our feature vector value are large.

The range of Figure 1 feature values is also huge. The range of SPECIF is from 0 to 100, ionoispher from 1 to 1, and spam from 0 to 1000, Optigits from 0 to 20 and Scene 15 from -0.5 to 0.5. The diversity for our comparing experiments are necessary. This shows that our experiments are rich and valuable.

4.2.6 Dataset selection

We select the data set is based on the following considerations.

1. Both binary classification and multi-classification data sets were taken into account for the purpose that some algorithm are superior to binary classification, while others are more suitable for multi-classification problems.

2. For better evaluation of the adaptability of the algorithms, we chosen both low dimensional data set with small size and high dimensional data set with large size.

3. We also paid attention to the data type differences, both pure integer and double data type were included. There were also real data type features, which had either wholly positive or negative features.

4. It can be seen in Figure 1 that the histogram of the distribution of the data we have chosen the difference is quite different. Some are concentrated, just like normal distribution, while others are relatively sparse.

4.3 Parameter setting and selection

4.3.1 Decision tree

1. The maximum depth of tree: The default value was set to 1. With the value increase, classification accuracy and running time will increase. We set maximum depth to 2, 4, 6, 8, 10, 12, 14, 16, 18 and 20 respectively. We found the highest accuracy rate when the maximum depth is 10.

2. Splitting Criteria Consideration: We tested the wrong classification rule, entropy rule, and Geni coefficient rule respectively. We found that Geni coefficient got the best of all, and mistake classification rule got the worst of all.

3. The first split minimum number of samples were tested with value of 0, 2, 4, 6, 8 and 10 respectively. We found the minimum number of samples was increased, but performance degraded.

4. The maximum number of features of the training sample, the default was set to 1000. 4.3.2 Random forests

1. A maximum depth of random forests, as the same of decision tree, the optimal value was set to 10.

2. The maximum number of training samples feature, the default was set to 1000, the value should change with the number of sample features change, as long as the sample itself to adapt to the largest number of features.

3. The number of decision trees: we tested the value of 20, 40, 60, 80, 100, 120, 140, 160, 180 and 200 respectively. We found that the number increases, the execution time also increases, and after over the value of 100, the improvement of the accuracy rate is not obvious. So we set it to 100.

4. The accuracy of the random forest was used to control the iteration. We tested the value of 0.0001, 0.001, 0.01 and 0.1 respectively. We found that the smaller the value was, the longer the execution time was, and after over the value of 0.001, the accuracy had no substantially change. We set it to 0.001.

4.3.3 Extremely randomized trees

Parameters were consistent with the random forest.

4.3.4 Multi-class adaboost classifier

1. For boosting methods, we tested discrete boosting, gentle boosting and real boosting respectively, found that gentle boosting is the best of all.

2. Tree depth was initialized with value of 10.

3. Shrinkage factor: we set it with value of 0.1, 0.2, 0.1 and 0.4 respectively, and found that 0.1 was the best of all.

4. Maximum boosting number: we set it to the value of 10, 50, 100, 200 and 400 respectively. We found that after over the value of 100, the precision had no significant increase. So we set it to value of 100.

4.3.5 Stochastic gradient boosting

1. Maximum depth of the tree, same as in decision tree, was set it to 10.

2. Loss function type for classification problems was generally selected for cross-entropy loss.

3. Shrinkage factor was set to 0.1.

4. Subsampling percentage was used to control the sampling percentage for every single tree in the training process, we set it to 0.8.

5. Maximum lift was set to 100.

4.3.6 Support vector machines

1. Type: we set it to c - Support Vector Classification and v - support vector classification, it was found that v support vector classification is better than c - support.

Penalty factor for c - support vector classification was set to 1.

Penalty factor for v -support vector classification, value range was (0,n, we set it to 0.5, p was set to 0.1. 2, Kernel types are linear, radial basis, sigmoid-type function, POLY.

From table 4, we tested non-linear support vector machines total accuracy on four data sets. We selected v - Support Vector POLY for classification as a non-linear support vector machines kernel type.

4.3.7 Linear support vector machines

From table 5, we can see that L2R_L2LOSS_SVC for loss function got good accuracy, so we chose L2R_L2LOSS_SVC loss function as loss type of linear support vector machines.

4.3.8 K nearest neighbors

Nearest neighbor K was set to the value of 1, 3, 5, 7, 9, 11, 13 and 15 respectively. The accuracy was found to reduce with increasing K value, so we set it to 1.

4.3.9 Multi-class logistic classifier

According to our single test with variable regularization factor values of 1.0e-12, 1.0e-ll, 1.0e-10, 1.0e-9, 1.0e-8, 1.0e-7 and 1.0. we found that the best results is the regularization factor with value of 1.0e-9. So in our comparing experiments, the regularization factor (REG_STRENGTH) was set to 1.0e-9.

Similarly, we did same experiment, most optimum results were found with the iteration range from 800 to 980. So in our comparing experiments, maximum number of iterations was set to 1000 for avoidance of losing optimum values.

4.3.10 Multilayer perceptron

1. The propagation algorithm was backward and forward propagation respectively. Apparently backward propagation algorithm achieved significantly higher accuracy.

2. Gradient weight was typically set to 0.1.

3. The momentum, front weights reflecting differences in two iterations, was typically set to 0.1.

4.3.11 Naive Bayesian Classifier

No parameters needed to be set.

4.3.12 Conditional Random Field Classifier

We only need set the classification type in this experiments.

5 Experimental results and analysis

Firstly, we need define descriptions for the symbol in tables: DecisionTree represents decision tree classifier. RandomForest represents the random forests. ExtraTrees is extremely random tree. BoostedClassifier represents the multi-class adaboost classifier. GradientBoostTree represents stochastic gradient boosting, libSVM is support vector machines. libLinear represents linear support vector machine. Knearest represents K nearest neighbor classifier, MultiClassLogistic is multi-class logistic classifier. MultiLayerPerceptron represents multilayer perceptron.NormalBayesianNet represents the naive Bayesian classifier. CRF represents Conditional Random Fields classifier

5.1 Overall accuracy on five data sets

From table 6, on SPECTF data set, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value, following by nonlinear support vector machine, random forest, adaboost classifiers and CRF have achieved relatively good performance.

From table 7, on Ionoshere datasets, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value, following by adaboost classifier, extremely randomized trees, CRF and linear support vector machines.

From table 8, on the spam dataset stochastic gradient boosting accuracy achieved the highest overall average precision, recall and Jaccard value, following by multilass adaboost classifier, random forests. For binary classification problem but, under the low-dimensional data set, the stochastic gradient boosting and ensemble classifiers such as multi-class adaboost showed a high performance.

From table 9, on optdigits data set, for the multi-classification problem, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value. K-nearest neighbor, non-linear support vector machine, random forests, extremely randomized trees and CRF also achieved good overall performance. Linear support vector machines, many types of logic, multi-layer perceptron is manifested in ordinal performance, and decision trees and Naive Bayes classifier achieved the worst performance of all.

From table 10, due to the high dimensional feature data of Scene 15 data set, multi-layer perceptron, naive Bayesian classifier and nonlinear support vector machine were failed. This mainly caused by the intrinsic shortcomes of nonlinear SVM, Bayesian networks and multi-layer perceptron. They have too many inner loops and intermediate phased which need computation and storage. This situation is worse when feature data is high dimensional. Another reason is that our testing environment is limited. If we have enough memory and strong CPU capability, I think this phenomenon will disappear. In another view, this also reveals that they have tough condition for real application.

However, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value, following by linear support vector machines, multi-class logistic regression. Other algorithms got poor performance.

5.2 Inter class accuracy and jaccard coefficient evaluation on five data sets

The statistical results above reflected the overall performance of the algorithms. However, inter classes [F.sub.1] and Jaccard could reflect more detailed information. Accuracy of each class may vary greatly due to differences of the data. Overall each accuracy was determined by average of sum each class accuracy.

From table 11. on the data set SPECTF, class 0 represents normal, class 1 represents abnormal. Stochastic gradient boosting was fully recognized, so it had the highest [F.sub.1] and Jaccard coefficients in each sub class (both class 0 and class 1). Remains of algorithms' [F.sub.1] and Jaccard coefficients were not high in class 0, however there were high accuracy in class 1. This indicated that the overall accuracy was boosted by the accuracy of the class 1. Distribution trends of Jaccard coefficient was in accordance with that of [F.sub.1]. This meant that the higher [F.sub.1] was, the higher Jaccard coefficient was in class 1. Further, support vector machines, random forests and CRF also had high [F.sub.1] and Jaccard coefficient in class 1.

From table 12, on data set Ionosphere, class 0 represents the presence of the fact, class 1 represents no presence of the fact. Performance was higher in class 0 than that in class 1. Stochastic gradient boosting to achieve the highest value. Random forests, CRF and multi-class adaboost classifier also got good performance.

From table 14, on optdigits dataset, class 0 to class 9 represent ten digit of 0 to 9. All algorithms had relatively equal performance in each class. Stochastic gradient boosting, the non-linear support vector machines, random forests, Multilayer Perceptron, linear multi-class support vector machines and multi-class logistic classifier had high performance in each class.

From table 15 and table 16, on Scene15, class 0 to class 14 represented fifteen classes. Multilayer perception and non-support vector machines were failed because of computation cost, and naive Bayesian classifier was failed due to the huge storage. Stochastic gradient boosting, linear support vector machines achieved good performance, following by multi-class logistic classifier.

5.3 Running time performance on five data sets

From table 17. the running time of the 11 kinds by an algorithm on the five data sets can be seen that:

1. On a small data sets (SPECTE and Ionosphere), running time of multilayer perceptron was significantly slower than that of other algorithms, while other algorithms' running time were almost same. Linear support vector machines' (based on liblinear) running time was inversely lower than that of nonlinear support vector machines based on libsvm.

2. On the large data sets (spam, optidigits, scene 15), the differences of running time were significant. It is clear that the linear support vector machines were significantly faster than the non-linear support vector machines.

3. For tree classifiers, decision tree was the fastest of all, following by random forests. The slowest was extremely randomized trees.

4. For boosting methods, stochastic gradient boosting was slower than the multiclass adaboost.

5. Due to the large dimensionality of data, non-linear support vector machines, and Bayesian multi-layer perceptron did not succeed.

6. CRF running time is better than ExtraTrees, but slower than RandomForest.

7. In short, for running time efficiency, naive Bayes classifiers, K nearest neighbor and decision tree were basically fast, following by random forests, multi-class logic and linear support vector machines. The stochastic gradient boosting was the slowest of all.

6 Conclusion and future work

This article compares 12 kinds of commonly used multi-classification algorithm. In the experiments, we found that:

1. The same algorithm on different data sets showed different performance. It was the key to choose a more adaptive algorithm based on the data set.

2. Stochastic gradient boosting achieved the best classification accuracy in all test data sets, but its running time was slower than other algorithms except the multilayer perceptron.

3. The composite classifiers performed well than single classifier. For example, stochastic gradient boosting, random forest, extremely randomized trees were all better than the basic decision tree. However at the same time, the more complex combination model was, the longer running time was.

4. Linear support vector machines achieved good results of both accuracy and total execution on large data sets while compared with the nonlinear support vector machine.

There are still some deficiencies in our comparative study, further research is need:

1. We compare only the basic algorithm of 12 kinds of algorithms, every algorithm has its variants, which are better than the original algorithms.

2. How to choose the optimal parameter settings for algorithm is critical for its performance. There are still more works need to be done.

3. When we deal with large data sets, how much sample should choose for training? How to find the best balance between training time and accuracy is worthy of further exploration.

4. The combination of classifiers can often lead to higher accuracy, but as mentioned above, model training time will significantly increase. Stochastic gradient boosting were obtained good accuracy in our tests on five data sets, however the running time is longer. Actually stochastic gradient boosting have many sub routine which has sub-iteration, so this will elapse many running time. Because the main routine are highly correlate the sub-iteration, so it cannot directly parallel the sub-iteration.

How to improve the running time performance with a little bit of decrease in accuracy is a meaningful research, in other words, we need to find a balance between accuracy and running time performance.

Acknowledgement

This work was supported in part by Scientific Research

Fund of Sichuan Provincial Education Department of China (No. 15TD0038) and Fundamental Research Funds for the Central University.

References

[1] ZHU J, ROSSET S, ZOU H, et al. Multi-class adaboost [J]. Ann Arbor, 2006, 1001(48109): 1612.

[2] SEFFRIDGE O G. Pandemonium: a paradigm for learning in mechanization of thought processes [J]. 1958.

[3] KANAL L. Patterns in pattern recognition: 1968-1974 [J]. Information Theory, IEEE Transactions on, 1974, 20(6): 697-722.

[4] MINSKY M L. Logical versus analogical or symbolic versus connectionist or neat versus scruffy [J]. AI magazine, 1991, 12(2): 34.

[5] ROSENBLATT F: DTIC Document, 1961.

[6] BREIMAN L. Random forests [J]. Machine learning, 2001, 45(1): 5-32.

[7] FRIEDMAN J, HASTIE T, TIBSHIRANI R. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors) [J]. The annals of statistics, 2000, 28(2): 337-407.

18] FRENCH S. Group consensus probability distributions: A critical survey [J]. Bayesian statistics, 1985, 2(183-202.

[9] BENEDIKTSSON J A, SWAIN P H. Consensus theoretic classification methods [J]. Systems, Man and Cybernetics, IEEE Transactions on, 1992, 22(4): 688-704.

[10] BERNARDO J M, SMITH A F. Bayesian theory [J]. Measurement Science and Technology, 2001, 12(2): 221.

[11] FAN R E, CHANG K W, HSIEH C J, et al. LIBLINEAR: A library for large linear classification [J]. The Journal of Machine Learning Research, 2008, 9(1871-4.

[12] DUIN R. TAX D. Experiments with classifier combining rules [J]. Multiple classifier systems, 2000, 16-29.

[13] ALY M. Survey on multi-class classification methods [J]. Neural networks, 2005, 1-9.

[14] NILLSON N. Learning machines: Foundations of trainable pattern classifying systems [MJ. McGraw-Hill, New York. 1965.

[15] KING R D, FENG C, SUTHERLAND A. Statlog: comparison of classification algorithms on large real-world problems [J]. Applied Artificial Intelligence an International Journal, 1995, 9(3): 289-333.

[16] BAUER E, KOHAVI R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants [J], Machine learning, 1998, 36(1): 2.

[17] LECUN Y, JACKEL L, BOTTOU L. et al. Comparison of learning algorithms for handwritten digit recognition; proceedings of the International conference on artificial neural networks, F, 1995 [C].

[18] DING C H, DUBCHAK I. Multi-class protein fold recognition using support vector machines and neural networks [J]. Bioinformatics, 2001, 17(2): 107-38.

[19] LI T, ZHANG C, OGIHARA M. A comparative study of feature selection and multi-class classification methods for tissue classification based on gene expression [J]. Bioinformatics, 2004, 20(15): 2429-37.

[20] FOODY G M, MATHUR A. A relative evaluation of multi-class image classification by support vector machines [J]. Geoscience and Remote Sensing, IEEE Transactions on, 2004, 42(6): 1335-43.

[21] HSU C-W, LIN C-J. A comparison of methods for multi-class support vector machines [J]. Neural Networks, IEEE Transactions on, 2002, 13(2): 415-25.

[22] CARUANA R, NICULESCU-MIZIL A. An empirical comparison of supervised learning algorithms; proceedings of the Proceedings of the 23rd international conference on Machine learning, F, 2006 [C], ACM.

[23] KRUSIENSKI D J, SELLERS E W, CABESTAING F, et al. A comparison of classification techniques for the P300 Speller [J]. Journal of neural engineering, 2006, 3(4): 299.

[24] WITTEN I H. FRANK E. Data Mining: Practical machine learning tools and techniques [M], Morgan Kaufmann, 2005.

[25] MURPHY K P. Machine Learning: a Probabilistic Perspective [J]. 2012,

[26] QUINLAN J R. Induction of decision trees [J]. Machine learning, 1986, 1(1): 81-106.

[27] BREIMAN L, FRIEDMAN J, STONE C J, et al. Classification and regression trees [M]. Chapman & Hall/CRC, 1984.

[28] HO T K. Random decision forests; proceedings of the Document Analysis and Recognition, 1995, Proceedings of the Third International Conference on, F, 1995 [C]. IEEE.

[29] BREIMAN L. Bagging predictors [J]. Machine learning, 1996, 24(2): 123-40.

[30] BREIMAN L. Statistical modeling: The two cultures (with comments and a rejoinder by the author) [J]. Statistical Science, 2001, 16(3): 199-231.

[31] IVERSON L R, PRASAD A M, MATTHEWS S N, et al. Estimating potential habitat for 134 eastern US tree species under six climate scenarios [J]. Forest Ecology and Management, 2008, 254(3): 390-406.

[32] GEURTS P, ERNST D, WEHENKEL L. Extremely randomized trees [J]. Machine learning, 2006, 63(1): 3-42.

[33] HASTIE T, TIBSHIRANI R, FRIEDMAN J, et al. The elements of statistical learning: data mining, inference and prediction [J]. The Mathematical Intelligencer, 2005, 27(2): 83-5.

[34] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting [J]. Journal of computer and system sciences, 1997, 55(1): 119-39.

[35] FRIEDMAN J H. Stochastic gradient boosting [J]. Computational Statistics & Data Analysis, 2002, 38(4): 367-78.

[36] BURGES C J C. A tutorial on support vector machines for pattern recognition [J]. Data mining and knowledge discovery, 1998, 2(2): 121-67.

[37] CORTES C, VAPNIK V. Support-vector networks [J]. Machine learning, 1995, 20(3): 273-97.

[38] KEERTHI S S, SUNDARARAJAN S, CHANG KW, et al. A sequential dual method for large scale multi-class linear SVMs; proceedings of the Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, F, 2008 [C]. ACM.

[39] CRAMMER K, SINGER Y. On the algorithmic implementation of multi-class kernel-based vector machines [J]. The Journal of Machine Learning Research, 2002, 2(265-92.

[40] COVER T, HART P. Nearest neighbor pattern classification [J]. Information Theory, IEEE Transactions on, 1967, 13(1): 21-7.

[41] LIU D C, NOCEDAL J. On the limited memory BFGS method for large scale optimization [J], Mathematical programming, 1989, 45(1): 503-28.

[42] RONSENBLATT F. The perception: a probabilistic model for information storage and organization in the brain [J]. Psychological review, 1958, 65(386-408.

[43] BISHOP C M. Pattern recognition and machine learning [M]. Springer New York, 2006.

[44] WASSERMAN P D, SCHWARTZ T. Neural networks. II. What are they and why is everybody so interested in them now? [J]. IEEE Expert, 1988, 3(1): 10-5.

[45] Hang Li. Statistic Learning [M]. Tsinghua press, 2012.

[46] COLLOBERT R, BENGIO S. Links between Perceptron, MLPs and SVMs; proceedings of the Proceedings of the twenty-first international conference on Machine learning, F, 2004 [C]. ACM.

[47] GOULD S. DARWIN: A Framework for Machine Learning and Computer Vision Research and Development [J]. Journal of Machine Learning Research, 2012, 13(12): 3499-503.

[48] CHANG C C, LIN C J. LIBSVM: a library for support vector machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.

[49] FAWCETT T. An introduction to ROC analysis [J]. Pattern recognition letters, 2006, 27(8): 861-74.

[50] DAVIS J, GOADRICH M. The relationship between Precision-Recall and ROC curves; proceedings of the Proceedings of the 23rd international conference on Machine learning, F, 2006 [C]. ACM.

[51] JACCARD P. The distribution of the flora in the alpine zone. 1 [J]. New Phytologist, 2006, 11(2): 37-50.

[52] ASUNCION A, NEWMAN D J. UCI machine learning repository [M], 2007.

[53] KURGAN L A, CIOS K J, TADEUSIEWICZ R, et al. Knowledge discovery approach to automated cardiac SPECT diagnosis [J], Artificial Intelligence in Medicine, 2001, 23(2): 149-69.

[54] LAZEBNIK S, SCHMID C, PONCE J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories; proceedings of the Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, F, 2006 [C]. IEEE.

[55] VEDALDT A, ZISSERMAN A. Efficient additive kernels via explicit feature maps [J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2012, 34(3): 480-92.

[56] Bradski G, Kaehler A (2008) Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Incorporated

[57] Delashmit WH, Manry MT (2005) Recent developments in multilayer perceptron neural networks. In: Proceedings of the seventh Annual Memphis Area Engineering and Science Conference, MAESC.

Lin Li

Institute of Intelligent Computing and Information Technology, Chengdu Normal University No.99, East Haike Road Wenjiang District, Chengdu, China

E-mail: lilin200909@gmail.com

Lin Li, Yue Wu and Mao Ye

School, of Computer Science and Engineering, University of Electronic Science and Technology of China No.2006, Xiyuan Ave, West Hi-Tech Zone, Chengdu, China

Received: June 18, 2014

Table 1: Comparisons of multi-class classification algorithms. Comparison of Reference algorithms Data sets King et Symbolic Satellite al. [15] learning (CART, image, hand C4.5, New ID, written digits, [AC.sup.2]. vehicle, Cal5, CN2), segment, Credit statistic risk, Belgian learning( data, Shuttle Bayesian control data, network, K- Diabetes data, nearest, kernel Heart disease density, linear and head discrimination, injury, German quadratic credit data discrimination, logistic regression), neural network Bauer et Bagging, Segment, DNA, al. [16] boosting and chess, its variants waveform, sat-image, mushroom, nursery, letter, shuttle Ding et Support Protein test al. [18] decision dataset vector, neural network Tao et Support ALL, GCM, al. [19] decision SRBCT, vector, MLL-leukemia, Bayesian lymphoma, network, NCI60, HBC K-nearest, decision tree Foody et Support Airborne sensor al. [20] decision data vector, decision tree, discriminating analysis, neural network Chih-Wei Hsu Support Iris, wine, et al [21] decision vector glass, vehicle, segment, DNA, sat-image, letter, shuttle Caruana et Support RMS, MXE, CAL. al. [22] decision ADULT. vector, neural network, decision tree, memory based learning, bagged tree, boosted tree, boosted stumps Krusienski Pearson related P300 Speller et al. [23] methods, Fisher linear discrimination, stepwise linear discrimination, linear support decision vector, Gaussian kernel support decision vector Our method Decision tree, SPECTF, random forests, Ionosphere, extremely spam, optdigits randomized and Scene 15 trees, multi-class adaboost classifier, multi-class logistic classifier, stochastic gradient boosting, multilayer perceptron, K nearest neighbors, naive Bayesian classifier and support vector machines(including linear and nonlinear) Evaluation Reference criteria Research domains King et Accuracy General purpose al. [15] Bauer et Average error General purpose al. [16] rate, variance, bias Ding et Accuracy, Bioinformatics al. [18] Q-percentage, Accuracy Tao et Accuracy Bioinformatics al. [19] Foody et Accuracy, Remote imaging al. [20] Chih-Wei Hsu Accuracy, General purpose et al [21] running-time Caruana et Accuracy, General purpose al. [22] square error, inter-class cross entropy, ROC regions, F evaluation, recall and precision, average precision and recall, lift, probability calibration Krusienski ROC curve Medicine domain et al. [23] Our method Overall General purpose accuracy, average precision and recall, average Jaccard, inter-class F and Jaccard evaluation, running performance. Table 2: Summary of twelve multi-class methods. Algorithms Algorithms type Algorithms Learning policy characteristic Decision tree Discriminant Classification Regularized tree maximum likelihood estimation Random tree Discriminant Classification Regularized tree(based on maximum bagging) likelihood estimation Extremely Discriminant Classification Regularized random tree tree maximum likelihood estimation Multi-class Discriminant Linear Addition adaboost combination of minimization weak loss classifier(based on decision tree) Stochastic Discriminant Linear Addition gradient combination of minimization boosting weak loss classifier(based on decision tree) Non-linear Discriminant Super-plane Minimizing the Support vector separation, loss of regular machine (based kernel trick hinge, soft on libsvm) margin maximization Linear SVM Discriminant Super-plane Minimizing the (based on separation loss of regular liblinear) hinge, soft margin maximization K-nearest Discriminant Distance of feature space Multi-logistic Discriminant Conditional Regularized (Maximum probabilistic maximum entropy) distribution, likelihood Log-linear estimation model Multilayer Discriminant Super-plane Minimization of perceptron separation error separation distance point to the hyperplane Naive Bayesian Generative Joint Maximum classifier distribution of likelihood feature and estimation, class, Maximum conditional posterior independent probability assumption Conditional Discriminant Conditional Maximum Random Fields probabilistic likelihood distribution estimation, under observing Regularized sequence, maximum Log-linear likelihood model estimation Algorithms Loss of Learning Classification learning algorithms strategy Decision tree Logarithmic Feature IF-THEN policy likelihood loss selection, based on tree generation, spitting prune Random tree Logarithmic Building Sub-tree voting likelihood loss multidecision tree based on bagging of subsampling Extremely Logarithmic Building Sub-tree voting random tree likelihood loss multidecision tree Multi-class Exponent loss Forward Linear adaboost additive combination of algorithm weighted maximum weak classifiers Stochastic Exponent loss Stochastic Linear gradient gradient combination of boosting descent weighted algorithm maximum weak classifiers Non-linear Hinge loss Sequential Maximum class Support vector minimal of test samples machine (based optimization on libsvm) algorithm (SMO) Linear SVM Hinge loss Sequential dual Maximum (based on method weighted test liblinear) sample K-nearest \ \ Multiple voting, empirical loss minimization Multi-logistic Logistic loss L-BFGS Maximum (Maximum likelihood entropy) estimation Multilayer Error Random gradient Maximum perceptron separation decrease weighted test distance point sample to the hyper-plane Naive Bayesian Logarithmic Probabilistic Maximum classifier likelihood loss computation posterior probability Conditional Logarithmic Random gradient Maximum Random Fields likelihood loss decrease, likelihood quasi-newton estimation methods Table 3: Statistics of confusion matrices for samples classification. Prediction class True class A B C A AA AB AC B BA BB BC C CA CB CC Table 4: Overall accuracy of libSVM on four data sets. Total Accurac Method Data sets SPECTF Ionosphere C-SVM POLY 0.770 0.565 RFB 0.898 0.749 Sigmoid 0.080 0.693 function v-SVM POLY 0.748 0.733 RFB 0.903 0.796 Sigmoid 0.080 0.840 function Total Accurac Method Data sets Spam optdigits C-SVM POLY 0.618 0.978 RFB 0.772 0.562 Sigmoid 0.374 0.101 function v-SVM POLY 0.629 0.934 RFB 0.779 0.586 Sigmoid 0.648 0.101 function Table 5: Overall accuracy of liblinear on four data sets. Total accuracy Cost type Data sets SPECTF Ionosphere Spam Optdigits L2R L2 LOSS SVC 0.620 0.856 0.898 0.947 L2R L2LOSSSVC DUAL 0.577 0.856 0.617 0.939 L2R L1LOSS SVC DUAL 0.577 0.848 0.865 0.935 MCSVM CS 0.805 0.860 0.631 0.933 Table 6: Overall accuracy on SPECTF data set. Algorithm Overall Average accuracy precision DecisionTree 0.657 0.5421 RandomForest 0.759 0.5721 ExtraTrees 0.668 0.5537 BoostedClassifier 0.716 0.5847 GradientBoostTree 1.000 1.000 libSVM 0.770 0.5664 libLinear 0.620 0.5342 Knearest 0.604 0.5666 Multi-classLogistic 0.620 0.5342 MultiLayer 0.679 0.5377 Perceptron NaiveBayesianNet 0.588 0.510 CRF 0.683 0.543 Algorithm Average Average jaccard recall coefficient DecisionTree 0.631 0.381 RandomForest 0.686 0.456 ExtraTrees 0.667 0.394 BoostedClassifier 0.754 0.440 GradientBoostTree 1.000 1.000 libSVM 0.662 0.458 libLinear 0.611 0.356 Knearest 0.724 0.362 Multi-classLogistic 0.611 0.356 MultiLayer 0.612 0.391 Perceptron NaiveBayesianNet 0.532 0.327 CRF 0.652 0.439 Table 7: Overall accuracy on Ionosphere data set. Average Overall Average Average jaccard Algorithm accuracy precision recall coefficient DecisionTree 0.792 0.765 0.799 0.635 RandomForest 0.828 0.807 0.854 0.691 ExtraTrees 0.888 0.862 0.897 0.780 BoostedClassifier 0.900 0.876 0.899 0.798 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.749 0.735 0.775 0.584 libLinear 0.856 0.867 0.787 0.694 Knearest 0.844 0.822 0.804 0.691 Multi-classLogistic 0.768 0.732 0.750 0.593 MultiLayerPerceptron 0.768 0.742 0.775 0.603 NaiveBayesianNet 0.733 0.683 0.633 0.500 CRF 0.863 0.870 0.793 0.612 Table 8: Overall accuracy on spam data set. Average Overall Average Average jaccard Algorithm accuracy precision recall coefficient DecisionTree 0.905 0.905 0.896 0.819 RandomForest 0.930 0.936 0.918 0.862 ExtraTrees 0.863 0.897 0.832 0.738 BoostedClassifier 0.940 0.942 0.933 0.882 GradientBoostTree 0.981 0.981 0.980 0.962 libSVM 0.618 0.687 0.522 0.332 libLinear 0.898 0.892 0.901 0.811 Knearest 0.775 0.765 0.764 0.622 Multi-classLogistic 0.923 0.921 0.918 0.852 MultiLayerPerceptron 0.908 0.904 0.905 0.827 NaiveBayesianNet 0.602 0.801 0.500 0.301 CRF 0.784 0.778 0.790 0.653 Table 9: Overall accuracy on optdigits data set. Average Overall Average Average jaccard Algorithm accuracy precision recall coefficient DecisionTree 0.847 0.850 0.847 0.738 RandomForest 0.966 0.966 0.966 0.936 ExtraTrees 0.969 0.970 0.969 0.942 BoostedClassifier 0.870 0.880 0.870 0.778 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.978 0.979 0.978 0.959 libLinear 0.947 0.948 0.946 0.901 Knearest 0.979 0.980 0.979 0.961 Multi-classLogistic 0.943 0.944 0.943 0.893 MultiLayerPerceptron 0.946 0.947 0.946 0.897 NaiveBayesianNet 0.843 0.881 0.844 0.732 CRF 0.960 0.962 0.958 0.932 Table 10: Overall accuracy on Scene 15 data set. Average Algorithm Overall Average Average Jaccard accuracy precision recall coefficient DecisionTree 0.397 0.391 0.376 0.246 RandomForest 0.531 0.548 0.493 0.339 ExtraTrees 0.655 0.639 0.637 0.479 BoostedClassifier 0.422 0.502 0.424 0.274 GradientBoostTree 0.999 0.999 0.999 0.999 libSVM Invalid Invalid Invalid Invalid libLinear 0.815 0.813 0.8113 0.694 Knearest 0.565 0.602 0.5421 0.382 Multi-classLogistic 0.794 0.797 0.7915 0.665 Mu1tiLayerPerceptron Invalid Invalid Invalid Invalid NaiveBayesianNet Invalid Invalid Invalid Invalid CRF 0.650 0.625 0.621 0.456 Table 11: Inter-class accuracy on SPECTF data set. [F.sub.1] Jaccard coefficient Algorithm Class 0 Class 1 Class 0 Class 1 RandomForest 0.286 0.855 0.167 0.747 ExtraTrees 0.244 0.788 0.139 0.650 BoostedClassifier 0.312 0.822 0.185 0.697 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.271 0.863 0.157 0.760 libLinear 0.202 0.751 0.113 0.601 Knearest 0.260 0.730 0.149 0.575 Multi-classLogistic 0.202 0.751 0.113 0.601 MultiLayerPerceptron 0.211 0.799 0.118 0.665 NaiveBayesianNet 0.154 0.728 0.083 0.572 CRF 0.278 0.813 0.156 0.745 Table 12: Inter-class accuracy on Ionosphere data set. Fx Jaccard coefficient Algorithm Class 0 Class 1 Class 0 Class 1 RandomForest 0.919 0.840 0.850 0.724 ExtraTrees 0.916 0.835 0.844 0.717 BoostedClassifier 0.926 0.847 0.863 0.734 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.796 0.674 0.661 0.508 libLinear 0.903 0.723 0.824 0.566 Knearest 0.890 0.735 0.802 0.581 Multi-classLogistic 0.827 0.651 0.706 0.482 MultiLayerPerceptron 0.820 0.678 0.695 0.513 NaiveBayesianNet 0.822 0.464 0.698 0.302 CRF 0.908 0.838 0.846 0.721 Table 13: Inter-class accuracy statistics on Spam data set. Fx Jaccard coefficient Algorithm Class 0 Class 1 Class 0 Class 1 RandomForest 0.944 0.908 0.894 0.831 ExtraTrees 0.897 0.798 0.814 0.664 BoostedClassifier 0.952 0.923 0.908 0.857 GradientBoostTree 0.985 0.977 0.970 0.955 libSVM 0.757 0.107 0.609 0.056 libLinear 0.914 0.878 0.841 0.783 Knearest 0.815 0.715 0.687 0.557 Multi-classLogistic 0.937 0.903 0.881 0.823 MultiLayerPerceptron 0.924 0.886 0.859 0.795 NaiveBayesianNet 0.752 0.002 0.603 0.001 CRF 0.933 0.891 0.885 0.838 Table 14: Inter-class accuracy on optdigits data set. Algorithm Class 0 Class 1 Class 2 DecisionTree [F.sub.1] 0.933 0.830 0.813 Jaccard 0.875 0.710 0.685 RandomForest [F.sub.1] 0.989 0.965 0.983 Jaccard 0.978 0.933 0.966 ExtraTrees [F.sub.1] 0.992 0.963 0.994 Jaccard 0.983 0.928 0.989 BoostedClassifier [F.sub.1] 0.960 0.804 0.930 Jaccard 0.923 0.672 0.870 GradientBoostTree [F.sub.1] 1.000 1.000 1.000 Jaccard 1.000 1.000 1.000 libSVM [F.sub.1] 0.997 0.978 0.994 Jaccard 0.994 0.957 0.989 libLinear [F.sub.1] 0.992 0.914 0.983 Jaccard 0.983 0.842 0.966 Knearest [F.sub.1] 1.000 0.965 0.994 Jaccard 1.000 0.933 0.989 Multi-classLogistic [F.sub.1] 0.977 0.957 0.963 Jaccard 0.956 0.918 0.929 MultiLayerPerceptron [F.sub.1] 0.986 0.948 0.938 Jaccard 0.973 0.902 0.884 NaiveBayesianNet [F.sub.1] 0.942 0.914 0.940 Jaccard 0.890 0.842 0.886 CRF Algorithm Class 3 Class 4 Class 5 DecisionTree [F.sub.1] 0.817 0.803 0.891 Jaccard 0.690 0.671 0.804 RandomForest [F.sub.1] 0.962 0.981 0.975 Jaccard 0.926 0.962 0.952 ExtraTrees [F.sub.1] 0.956 0.986 0.975 Jaccard 0.915 0.973 0.952 BoostedClassifier [F.sub.1] 0.813 0.899 0.919 Jaccard 0.685 0.816 0.850 GradientBoostTree [F.sub.1] 1.000 1.000 1.000 Jaccard 1.000 1.000 1.000 libSVM [F.sub.1] 0.972 0.994 0.986 Jaccard 0.946 0.989 0.973 libLinear [F.sub.1] 0.942 0.964 0.941 Jaccard 0.889 0.930 0.889 Knearest [F.sub.1] 0.978 0.981 0.986 Jaccard 0.957 0.962 0.973 Multi-classLogistic [F.sub.1] 0.934 0.961 0.937 Jaccard 0.877 0.925 0.881 MultiLayerPerceptron [F.sub.1] 0.925 0.951 0.957 Jaccard 0.861 0.906 0.918 NaiveBayesianNet [F.sub.1] 0.891 0.011 0.967 Jaccard 0.804 0.006 0.936 CRF Algorithm Class 6 Class 7 Class 8 DecisionTree [F.sub.1] 0.940 0.836 0.777 Jaccard 0.887 0.719 0.636 RandomForest [F.sub.1] 0.986 0.972 0.920 Jaccard 0.973 0.945 0.851 ExtraTrees [F.sub.1] 0.989 0.972 0.938 Jaccard 0.978 0.945 0.883 BoostedClassifier [F.sub.1] 0.960 0.850 0.783 Jaccard 0.924 0.740 0.644 GradientBoostTree [F.sub.1] 1.000 1.000 1.000 Jaccard 1.000 1.000 1.000 libSVM [F.sub.1] 0.997 0.977 0.948 Jaccard 0.995 0.956 0.901 libLinear [F.sub.1] 0.983 0.954 0.890 Jaccard 0.967 0.912 0.801 Knearest [F.sub.1] 1.000 0.989 0.956 Jaccard 1.000 0.978 0.916 Multi-classLogistic [F.sub.1] 0.978 0.946 0.892 Jaccard 0.957 0.897 0.805 MultiLayerPerceptron [F.sub.1] 0.962 0.949 0.913 Jaccard 0.926 0.902 0.840 NaiveBayesianNet [F.sub.1] 0.949 0.762 0.793 Jaccard 0.902 0.616 0.656 CRF Algorithm Class 9 DecisionTree [F.sub.1] 0.832 Jaccard 0.712 RandomForest [F.sub.1] 0.933 Jaccard 0.875 ExtraTrees [F.sub.1] 0.934 Jaccard 0.877 BoostedClassifier [F.sub.1] 0.798 Jaccard 0.664 GradientBoostTree [F.sub.1] 1.000 Jaccard 1.000 libSVM [F.sub.1] 0.943 Jaccard 0.892 libLinear [F.sub.1] 0.911 Jaccard 0.837 Knearest [F.sub.1] 0.949 Jaccard 0.904 Multi-classLogistic [F.sub.1] 0.885 Jaccard 0.794 MultiLayerPerceptron [F.sub.1] 0.930 Jaccard 0.869 NaiveBayesianNet [F.sub.1] 0.883 Jaccard 0.791 CRF Table 15: Inter-class accuracy on Scene 15 data set. Algorithm Class 0 Class 1 Class 2 DecisionTree [F.sub.1] 0.238 0.618 0.718 Jaccard 0.135 0.447 0.561 RandomForest [F.sub.1] 0.848 0.654 0.852 Jaccard 0.736 0.486 0.742 ExtraTrees [F.sub.1] 0.805 0.816 0.859 Jaccard 0.673 0.689 0.752 BoostedClassifier [F.sub.1] 0.606 0.126 0.290 Jaccard 0.435 0.067 0.170 GradientBoostTree [F.sub.1] 1.000 1.000 1.000 Jaccard 1.000 1.000 1.000 libSVM [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid libLinear [F.sub.1] 0.959 0.860 0.931 Jaccard 0.921 0.754 0.871 Knearest [F.sub.1] 0.606 0.126 0.290 Jaccard 0.435 0.067 0.170 Multi-classLogistic [F.sub.1] 0.941 0.801 0.873 Jaccard 0.936 0.701 0.827 MultiLayerPerceptron [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid NaiveBayesianNet [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid CRF [F.sub.1] 0.835 0.647 0.843 Jaccard 0.736 0.477 0.729 Algorithm Class 3 Class 4 Class 5 DecisionTree [F.sub.1] 0.550 0.304 0.397 Jaccard 0.380 0.179 0.247 RandomForest [F.sub.1] 0.327 0.491 0.571 Jaccard 0.196 0.326 0.400 ExtraTrees [F.sub.1] 0.779 0.590 0.744 Jaccard 0.638 0.419 0.593 BoostedClassifier [F.sub.1] 0.415 0.531 0.399 Jaccard 0.262 0.362 0.249 GradientBoostTree [F.sub.1] 1.000 0.997 1.000 Jaccard 1.000 0.994 1.000 libSVM [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid libLinear [F.sub.1] 0.864 0.784 0.887 Jaccard 0.761 0.645 0.797 Knearest [F.sub.1] 0.415 0.531 0.399 Jaccard 0.262 0.362 0.249 Multi-classLogistic [F.sub.1] 0.820 0.736 0.866 Jaccard 0.721 0.617 0.712 MultiLayerPerceptron [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid NaiveBayesianNet [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid CRF [F.sub.1] 0.318 0.485 0.565 Jaccard 0.187 0.319 0.389 Algorithm Class 6 Class 7 DecisionTree [F.sub.1] 0.471 0.397 Jaccard 0.308 0.248 RandomForest [F.sub.1] 0.618 0.688 Jaccard 0.448 0.524 ExtraTrees [F.sub.1] 0.652 0.777 Jaccard 0.483 0.635 BoostedClassifier [F.sub.1] 0.447 0.608 Jaccard 0.287 0.437 GradientBoostTree [F.sub.1] 1.000 1.000 Jaccard 1.000 1.000 libSVM [F.sub.1] Invalid Invalid Jaccard Invalid Invalid libLinear [F.sub.1] 0.823 0.890 Jaccard 0.699 0.803 Knearest [F.sub.1] 0.447 0.608 Jaccard 0.287 0.437 Multi-classLogistic [F.sub.1] 0.770 0.871 Jaccard 0.656 0.776 MultiLayerPerceptron [F.sub.1] Invalid Invalid Jaccard Invalid Invalid NaiveBayesianNet [F.sub.1] Invalid Invalid Jaccard Invalid Invalid CRF [F.sub.1] 0.546 0.659 Jaccard 0.432 0.510 Table 16: Inter-class accuracy on Scenel5 data set Algorithm Class 8 Class 9 Class 10 DecisionTree [F.sub.1] 0.463 0.330 0.142 Jaccard 0.301 0.198 0.076 RandomForest [F.sub.1] 0.495 0.410 0.034 Jaccard 0.328 0.258 0.017 ExtraTrees [F.sub.1] 0.681 0.645 0.339 Jaccard 0.517 0.477 0.204 BoostedClassifier [F.sub.1] 0.611 0.447 0.342 Jaccard 0.440 0.288 0.206 GradientBoostTree [F.sub.1] 0.997 1.000 1.000 Jaccard 0.994 1.000 1.000 BoostedClassifier [F.sub.1] 0.623 0.459 0.309 Jaccard 0.452 0.298 0.183 libSVM [F.sub.1] Invalid Invalid Invalid Jaccard 0.000 0.000 0.000 libLinear [F.sub.1] 0.862 0.898 0.633 Jaccard 0.758 0.815 0.463 Knearest [F.sub.1] 0.611 0.447 0.342 Jaccard 0.440 0.288 0.206 Multi-classLogistic [F.sub.1] 0.847 0.898 0.721 Jaccard 0.889 0.668 0.775 MultiLayerPerceptron [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid NaiveBayesianNet [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid CRF [F.sub.1] 0.488 0.402 0.028 Jaccard 0.313 0.248 0.016 Algorithm Class 11 Class 12 Class 12 DecisionTree [F.sub.1] 0.234 0.220 0.326 Jaccard 0.133 0.123 0.195 RandomForest [F.sub.1] 0.197 0.071 0.282 Jaccard 0.109 0.037 0.164 ExtraTrees [F.sub.1] 0.286 0.409 0.496 Jaccard 0.167 0.257 0.330 BoostedClassifier [F.sub.1] 0.320 0.325 0.356 Jaccard 0.190 0.194 0.216 GradientBoostTree [F.sub.1] 1.000 1.000 1.000 Jaccard 1.000 1.000 1.000 BoostedClassifier [F.sub.1] 0.288 0.324 0.415 Jaccard 0.168 0.193 0.262 libSVM [F.sub.1] Invalid Invalid Invalid Jaccard 0.000 0.000 0.000 libLinear [F.sub.1] 0.645 0.701 0.663 Jaccard 0.476 0.539 0.495 Knearest [F.sub.1] 0.320 0.325 0.356 Jaccard 0.190 0.194 0.216 Multi-classLogistic [F.sub.1] 0.613 0.761 0.721 Jaccard 0.694 0.582 0.764 MultiLayerPerceptron [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid NaiveBayesianNet [F.sub.1] Invalid Invalid Invalid Jaccard Invalid Invalid Invalid CRF [F.sub.1] 0.188 0.066 0.277 Jaccard 0.100 0.036 0.155 Algorithm Class 14 DecisionTree [F.sub.1] 0.280 Jaccard 0.162 RandomForest [F.sub.1] 0.482 Jaccard 0.318 ExtraTrees [F.sub.1] 0.531 Jaccard 0.362 BoostedClassifier [F.sub.1] 0.485 Jaccard 0.320 GradientBoostTree [F.sub.1] 1.000 Jaccard 1.000 BoostedClassifier [F.sub.1] 0.528 Jaccard 0.359 libSVM [F.sub.1] Invalid Jaccard 0.000 libLinear [F.sub.1] 0.685 Jaccard 0.521 Knearest [F.sub.1] 0.485 Jaccard 0.320 Multi-classLogistic [F.sub.1] 0.655 Jaccard 0.625 MultiLayerPerceptron [F.sub.1] Invalid Jaccard Invalid NaiveBayesianNet [F.sub.1] Invalid Jaccard Invalid CRF [F.sub.1] 0.466 Jaccard 0.309 Table 17: Running time of twelve algorithms on five data sets (unit: millisecond except scene 15 data set). Binary data sets Algorithm SPECTF Ionosphere Spam DecisionTree 1 30.9 79.9 RandomForest 93.9 326.9 2.819 ExtraTrees 266.0 437.0 3.036 BoostedClassifier 202.9 451.9 1761 GradientBoostTree 437.0 656.0 6732 libSVM 46.9 108.9 915184 libLinear 389.9 749.0 3.165 Knearest 1 16.0 280.9 Multi-classLogistic 30.9 46.9 377.9 MultiLayerPerceptron 857.9 4306 33727 NaiveBayesianNet 30.9 93.0 63.0 CRF 168.9 235.2 1.522 Multi-class data sets Algorithm optdigits Scene 15 DecisionTree 202.9 19.562(second) RandomForest 8112 34(minute) ExtraTrees 12541 1 hour22 minute BoostedClassifier 2.697 477.873 (second) GradientBoostTree 53807 8hour42 minute libSVM 3276 Invalid libLinear 1006 2hourl0 minute Knearest 708.9 120.054 (second) Multi-classLogistic 2153 190.660 (second) MultiLayerPerceptron 172847 Invalid NaiveBayesianNet 375.9 Invalid CRF 6420 66 minute

Printer friendly Cite/link Email Feedback | |

Author: | Li, Lin; Wu, Yue; Ye, Mao |
---|---|

Publication: | Informatica |

Article Type: | Report |

Date: | Mar 1, 2015 |

Words: | 12821 |

Previous Article: | Swarm intelligence and its application in abnormal data detection. |

Next Article: | An efficient algorithm for mining frequent closed itemsets. |

Topics: |