# A multi-class SVM classifier utilizing binary decision tree.

1 Introduction

The recent results in pattern recognition have shown that support vector machine (SVM) classifiers often have superior recognition rates in comparison to other classification methods. However, the SVM was originally developed for binary decision problems, and its extension to multi-class problems is not straightforward. How to effectively extend it for solving multiclass classification problem is still an on-going research issue. The popular methods for applying SVMs to multiclass classification problems usually decompose the multi-class problems into several two-class problems that can be addressed directly using several SVMs.

For the readers' convenience, we introduce the SVM briefly in section 2. A brief introduction to several widely used multi-class classification methods that utilize binary SVMs is given in section 3. The Kernel-based clustering introduced to convert the multi-class problem into SVM-based binary decision-tree architecture is explained in section 4. In section 5, we discuss related works and compare SVM-BDT with other multi-class SVM methods via theoretical analysis and empirical estimation. The experimental results in section 6 are presented to compare the performance of the proposed SVM-BDT with traditional multi-class approaches based on SVM, ensemble of decision trees and neural network. Section 7 gives a conclusion of the paper.

2 Support vector machines for pattern recognition

The support vector machine is originally a binary classification method developed by Vapnik and colleagues at Bell laboratories , with further algorithm improvements by others . For a binary problem, we have training data points: {[x.sub.i], [y.sub.i]}, i=1, ..., l, [y.sub.i][member of]{-1, 1}, [x.sub.i][member of][R.sup.d]. Suppose we have some hyperplane which separates the positive from the negative examples (a "separating hyperplane"). The points x which lie on the hyperplane satisfy w x x + b = 0, where w is normal to the hyperplane, [absolute value of b]/[parallel]w[parallel] is the perpendicular distance from the hyperplane to the origin, and [parallel]w[parallel] is the Euclidean norm of w. Let [d.sub.+] ([d.sub.-]) be the shortest distance from the separating hyperplane to the closest positive (negative) example. Define the "margin" of a separating hyperplane to be [d.sub.+] + [d.sub.-]. For the linearly separable case, the support vector algorithm simply looks for the separating hyperplane with largest margin. This can be formulated as follows: suppose that all the training data satisfy the following constraints:

[x.sub.i] x w + b [greater than or equal to] + 1 for [Y.sub.i] = + 1, (1)

[x.sub.i] x w + b [less than or equal to] + 1 for [Y.sub.i] = - 1, (2)

These can be combined into one set of inequalities:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

Now consider the points for which the equality in Eq. (I) holds (requiring that there exists such a point) is equivalent to choosing a scale for w and b. These points lie on the hyperplane [H.sub.1]: [x.sub.i] x w + b = 1 with normal w and perpendicular distance from the origin [absolute value of 1-b]/[parallel]w[parallel]. Similarly, the points for which the equality in Eq. (2) holds lie on the hyperplane H2: x;. w + b = -1, with normal again w and perpendicular distance from the origin [absolute value of -1-b]/[parallel]w[parallel]. Hence [d.sub.+] = [d.sub.-] = 1/[parallel]w[parallel] and the margin is simply 2/[parallel]w[parallel].

[FIGURE 1 OMITTED]

Note that [H.sub.1] and [H.sub.2] are parallel (they have the same normal) and that no training points fall between them. Thus we can find the pair of hyperplanes which gives the maximum margin by minimizing [parallel]w[[parallel].sup.2], subject to constraints (3).

Thus we expect the solution for a typical two dimensional case to have the form shown on Fig. I. We introduce nonnegative Lagrange multipliers [[alpha].sub.i], i = 1, ..., l, one for each of the inequality constraints (3). Recall that the rule is that for constraints of the form [c.sub.i] [greater than or equal to] 0, the constraint equations are multiplied by nonnegative Lagrange multipliers and subtracted from the objective function, to form the Lagrangian. For equality constraints, the Lagrange multipliers are unconstrained. This gives Lagrangian:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)

We must now minimize [L.sub.p], with respect to w, b, and maximize with respect to all [[alpha].sub.i] at the same time, all subject to the constraints [[alpha].sub.i] [greater than or equal to] 0 (let's call this particular set of constraints C.). Now this is a convex quadratic programming problem, since the objective function is itself convex, and those points which satisfy the constraints also form a convex set (any linear constraint defines a convex set, and a set of N simultaneous linear constraints defines the intersection of N convex sets, which is also a convex set). This means that we can equivalently solve the following "dual" problem: maximize [L.sub.p], subject to the constraints that the gradient of [L.sub.p] with respect to w and b vanish, and subject also to the constraints that the [[alpha].sub.i] [greater than or equal to] 0 (let's call that particular set of constraints [C.sub.2]). This particular dual formulation of the problem is called the Wolfe dual . It has the property that the maximum of [L.sub.p], subject to constraints [C.sub.2], occurs at the same values of the w, b and [alpha], as the minimum of [L.sub.p], subject to constraints [C.sub.1].

Requiring that the gradient of [L.sub.p] with respect to w and b vanish gives the conditions:

w = [summation over i] [[alpha].sub.i][y.sub.i][x.sub.i], (5)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (6)

Since these are equality constraints in the dual formulation, we can substitute them into Eq. (4) to give

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (7)

Note that we have now given the Lagrangian different labels (P for primal, D for dual) to emphasize that the two formulations are different: [L.sub.p] and [L.sub.D] arise from the same objective function but with different constraints: and the solution is found by minimizing [L.sub.p] or by maximizing Lb. Note also that if we formulate the problem with b = 0, which amounts to requiring that all hyperplanes contain the origin, the constraint (6) does not appear. This is a mild restriction for high dimensional spaces, since it amounts to reducing the number of degrees of freedom by one.

Support vector training (lot the separable, linear case) therefore amounts to maximizing [L.sub.D] with respect to the [[alpha].sub.i], subject to constraints (6) and positivity of the [[alpha].sub.i], with solution given by (5). Notice that there is a Lagrange multiplier [[alpha].sub.i] for every training point. In the solution, those points for which [[alpha].sub.i] > 0 are called "support vectors", and lie on one of the hyperplanes [H.sub.1], [H.sub.2]. All other training points have [[alpha].sub.i] = 0 and lie either on [H.sub.1] or [H.sub.2] (such that the equality in Eq. (3) holds), or on that side of [H.sub.1] or [H.sub.2] such that the strict inequality in Eq. (3) holds. For these machines, the support vectors are the critical elements of the training set. They lie closest to the decision boundary; if all other training points were removed (or moved around, but so as not to cross [H.sub.1] or [H.sub.2]), and training was repeated, the same separating hyperplane would be found.

The above algorithm for separable data, when applied to non-separable data, will find no feasible solution: this will be evidenced by the objective function (i.e. the dual Lagrangian) growing arbitrarily large. So how can we extend these ideas to handle non-separable data? We would like to relax the constraints (1) and (2), but only when necessary, that is, we would like to introduce a further cost (i.e. an increase in the primal objective function) for doing so. This can be done by introducing positive slack variables [e.sub.1]; i = 1, ..., l, in the constraints, which then become:

[x.sub.i] x w + b [greater than or equal to] + 1 - [e.sub.i] for [y.sub.i] = + l, (8)

[x.sub.i] x w + b [less than or equal to] + 1 - [e.sub.i] for [y.sub.i] = + l, (9)

[e.sub.i] > 0[for all]i. (10)

Thus, for an error to occur, the corresponding [e.sub.i] must exceed unity, so [[SIGMA].sub.i][e.sub.i] is an upper bound on the number of training errors. Hence a natural way to assign an extra cost for errors is to change the objective function to be minimized from [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], where C is a parameter to be chosen by the user, a larger C corresponding to assigning a higher penalty to errors.

How can the above methods be generalized to the case where the decision function (f(x) whose sign represents the class assigned to data point x) is not a linear function of the data? First notice that the only way in which the data appears in the training problem, is in the form of dot products, [x.sub.i] * [x.sub.j]. Now suppose we first mapped the data (Figure 2) to some other (possibly even infinite dimensional) Euclidean space H, using a mapping which we will call [SIGMA]:

[PHI]:[R.sup.d] [??] H, (11)

Then of course the training algorithm would only depend on the data through dot products in H, i.e. on functions of the form [PHI]([x.sub.i]) x [PHI]([x.sub.j]). Now if there were a "kernel function" K such that K([x.sub.i], [x.sub.j]) = [PHI]([x.sub.i]) * [PHI]([x.sub.j]), we would only need to use K in the training algorithm, and would never need to explicitly even know what [PHI] is. The kernel function has to satisfy Mercer's condition .One example for this function is Gaussian:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (2)

In this particular example, H is infinite dimensional, so it would not be very easy to work with [SIGMA] explicitly. However, if one replaces [x.sub.i] * [x.sub.j] by K([x.sub.i], [x.sub.j]) everywhere in the training algorithm, the algorithm will happily produce a support vector machine which lives in an infinite dimensional space, and furthermore do so in roughly the same amount of time it would take to train on the un-mapped data. All the considerations of the previous sections hold, since we are still doing a linear separation, but in a different space. But how can we use this machine? After all, we need w, and that will live in H. But in test phase an SVM is used by computing dot products of a given test point x with w, or more specifically by computing the sign of

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (13)

where the [s.sub.i] are the support vectors. So again we can avoid computing [PHI](x) explicitly and use the K([s.sub.i], x) = [PHI]([s.sub.i]) * [PHI](x) instead.

[FIGURE 2 OMITTED]

3 An overview of widely used multiclass SVM classification methods

Although SVMs were originally designed as binary classifiers, approaches that address a multi-class problem as a single "all-together" optimization problem exist , but are computationally much more expensive than solving several binary problems.

A variety of techniques for decomposition of the multi-class problem into several binary problems using Support Vector Machines as binary classifiers have been proposed, and several widely used are given in this section.

3.1 One-against-all (OVA)

For the N-class problems (N>2), N two-class SVM classifiers are constructed . The ith SVM is trained while labeling the samples in the ith class as positive examples and all the rest as negative examples. In the recognition phase, a test example is presented to all N SVMs and is labelled according to the maximum output among the N classifiers. The disadvantage of this method is its training complexity, as the number of training samples is large. Each of the N classifiers is trained using all available samples.

3.2 One-against-one (OvO)

This algorithm constructs N(N-1)/2 two-class classifiers, using all the binary pair-wise combinations of the N classes. Each classifier is trained using the samples of the first class as positive examples and the samples of the second class as negative examples. To combine these classifiers, the Max Wins algorithm is adopted. It finds the resultant class by choosing the class voted by the majority of the classifiers . The number of samples used for training of each one of the OvO classifiers is smaller, since only samples from two of all N classes are taken in consideration. The lower number of samples causes smaller nonlinearity, resulting in shorter training times. The disadvantage of this method is that every test sample has to be presented to large number of classifiers N(N-1)/2. This results in slower testing, especially when the number of the classes in the problem is big .

3.3 Directed acyclic graph SVM (DAGSVM)

Introduced by Platt [l] the DAGSVM algorithm for training an N(N-1)/2 classifiers is the same as in one-against-one. In the recognition phase, the algorithm depends on a rooted binary directed acyclic graph to make a decision . DAGSVM creates a model for each pair of classes. When one such model, which is able to separate class [c.sub.1] from class [c.sub.2], classifies a certain test example into class [c.sub.1], it does not really vote "for" class [c.sub.1] rather it votes "against" class [c.sub.2], because the example must lie on the other side of the separating hyperplane than most of the class [c.sub.2] samples. Therefore, from that point onwards the algorithm ignores all the models involving the class [c.sub.2]. This means that after each classification with one of the binary models, one more class can be thrown out as a possible candidate, and after only N-1 steps just one candidate class remains, which therefore becomes the prediction for the current test example. This results in significantly faster testing, while achieving similar recognition rate as One-against-one.

3.4 Binary tree of SVM (BTS)

This method uses multiple SVMs arranged in a binary tree structure . A SVM in each node of the tree is trained using two of the classes. The algorithm then employs probabilistic outputs to measure the similarity between the remaining samples and the two classes used for training. All samples in the node are assigned to the two subnodes derived from the previously selected classes by similarity. This step repeats at every node until each node contains only samples from one class. The main problem that should be considered seriously here is training time, because aside training, one has to test all samples in every node to find out which classes should be assigned to which subnode while building the tree. This may decrease the training performance considerably for huge training datasets.

4 Support vector machines utilizing a binary decision tree

In this paper we propose a binary decision tree architecture that uses SVMs for making the binary decisions in the nodes. The proposed classifier

architecture SVM-BDT (Support Vector Machines utilizing Binary Decision Tree), takes advantage of both the efficient computation of the tree architecture and the high classification accuracy of SVMs. Utilizing this architecture, N-I SVMs needed to be trained for an N class problem, but only at most [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] SVMs are required to be consulted to classify a sample. This can lead to a dramatic improvement in recognition speed when addressing problems with big number of classes.

An example of SVM-BDT that solves a 7--class pattern recognition problem utilizing a binary tree, in which each node makes binary decision using a SVM is shown on Figure 3. The hierarchy of binary decision subtasks should be carefully designed before the training of each SVM classifier.

The recognition of each sample starts at the root of the tree. At each node of the binary tree a decision is being made about the assignment of the input pattern into one of the two possible groups represented by transferring the pattern to the left or to the right sub-tree. Each of these groups may contain multiple classes. This is repeated recursivly downward the tree until the sample reaches a leaf node that represents the class it has been assigned to.

There exist many ways to divide N classes into two groups, and it is critical to have proper grouping for the good performance of SVM-BDT.

For consistency between the clustering model and the way SVM calculates the decision hyperplane, the clustering model utilizes distance measures at the kernel space, rather than at the input space. Because of this, all training samples are mapped into the kernel space with the same kernel function that is to be used in the training phase.

[FIGURE 3 OMITTED]

The SVM-BDT method that we propose is based on recursively dividing the classes in two disjoint groups in every node of the decision tree and training a SVM that will decide in which of the groups the incoming unknown sample should be assigned. The groups are determined by a clustering algorithm according to their class membership.

Let's take a set of samples [X.sub.1], [X.sub.1] ..., [x.sub.M] each one labeled by [y.sub.i] [member of] {q, [c.sub.2] ..., [C.sub.N]} where N is the number of classes. SVM-BDT method starts with dividing the classes in two disjoint groups [g.sub.1] and [g.sub.2]. This is performed by calculating N gravity centres for the N different classes. Then, the two classes that have the biggest Euclidean distance from each other are assigned to each of the two clustering groups. After this, the class with the smallest Euclidean distance from one of the clustering groups is found and assigned to the corresponding group. The gravity center of this group is then recalculated to represent the addition of the samples of the new class to the group. The process continues by finding the next unassigned class that is closest to either of the clustering groups, assigning it to the corresponding group and updating the group's gravity center, until all classes are assigned to one of the two possible groups.

This defines a grouping of all the classes in two disjoint groups of classes. This grouping is then used to train a SVM classifier in the root node of the decision tree, using the samples of the first group as positive examples and the samples of the second group as negative examples. The classes from the first clustering group are being assigned to the first (left) subtree, while the classes of the second clustering group are being assigned to the (right) second subtree. The process continues recursively (dividing each of the groups into two subgroups applying the procedure explained above), until there is only one class per group which defines a leaf in the decision tree.

[FIGURE 4 OMITTED]

For example, Figure 4 illustrates grouping of 7 classes, while Figure 3 shows the corresponding decision tree of SVMs. After calculating the gravity centers for all classes, the classes [c.sub.2] and [c.sub.5] are found to be the furthest apart from each other, considering their Euclidean distance and are assigned to group [g.sub.1] and [g.sub.2] accordingly. The closest to group [g.sub.1] is class [c.sub.3], so it is assigned to the group [g.sub.1], followed by recalculation of the [g.sub.1]'s gravity center. In the next step, class [c.sub.1] is the closest to group [g.sub.2] so it is assigned to that group and the group's gravity center is recalculated. In the following iteration, class [c.sub.7] is assigned to [g.sub.1] and class c6 is assigned to [g.sub.2], followed by recalculating of group's gravity centers. Finally class c4 is assigned to [g.sub.1]. This completes the first round of grouping that defines the classes that will be transferred to the left and the right subtree of the root node. The SVM classifier in the root is trained by considering samples from the classes {[c.sub.2], [c.sub.3], [c.sub.4], [c.sub.7]} as positive examples and samples from the classes {[c.sub.1], [c.sub.5], [c.sub.6]} as negative examples.

The grouping procedure is repeated independently for the classes of the left and the right subtree of the root, which results in grouping [c.sub.7] and [c.sub.4] in [g.sub.1,1] and [c.sub.2] and [c.sub.3] in [g.sub.1,2] in the left node of the tree and [c.sub.1] and [c.sub.5] in [g.sub.2,1] and [c.sub.6] in [g.sub.2,2] in the right node of the tree. The concept is repeated for each SVM associated to a node in the taxonomy. This will result in training only N-1 SVMs for solving an N-class problem.

5 Related work and discussion

Various multi-class classification algorithms can be compared by their predictive accuracy and their training and testing times. The training time T for a binary SVM is estimated empirically by a power law  stating that T[approximately equal to][alpha][M.sup.d], where M is the number of training samples and [alpha] is a proportionality constant. The parameter d is a constant, which depends of the datasets and it is typically in the range [1, 2]. According to this law, the estimated training time for OvA is

[T.sub.OvA] [approximately equal to] N[alpha][M.sup.d], (11)

where N is the number of classes in the problem.

Without loss of generality, let's assume that each of the N classes has the same number of training samples. Thus, each binary SVM of OvO approach only requires 2M/N samples. Therefore, the training time for OvO is:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (12)

The training time for DAGSVM is same as OvO.

As for BTS and SVM-BDT, the training time is summed over all the nodes in the [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] levels. In the ith level, there are [2.sup.i-l] nodes and each node uses 2M/N for BTS and M/[2.sup.i-l] for SVM-BDT training samples. Hence, the total training time for BTS is:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (13)

and for SVM-BDT is:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (14)

It must be noted that [T.sub.SVM-BDT] in our algorithm does not include the time to build the hierarchy structure of the N classes, since it consumes insignificant time compared to the quadratic optimization time that dominates the total SVM training time. On the other hand, in the process of building the tree, BTS requires testing of each trained SVM with all the training samples in order to determine the next step, therefore significantly increasing the total training time.

According to the empirical estimation above, it is evident that the training speed of SVM-BDT is comparable with OvA, OvO, DAGSVM and BTS.

In the testing phase, DAGSVM performs faster than OvO and OvA, since it requires only N-1 binary SVM evaluations. SVM-BDT is even faster than DAGSVM because the depth of the SVM-BDT decision tree is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] in the worst case, which is superior to N-1, especially when N>>2.

While testing, the inner product of the sample's feature vector and all the support vectors of the model are calculated for each sample. The total number of support vectors in the trained model directly contributes to the major part of the evaluation time, which was also confirmed by the experiments.

A multistage SVM (MSVM) for multi-class problem has been proposed by Liu et al. . They use Support Vector Clustering (SVC)  to divide the training data into two parts that are used to train a binary SVM. For each partition, the same procedure is recursively repeated until the binary SVM gives an exact label of class. An unsolved problem in MSVM is how to control the SVC to divide the training dataset into exact two parts. However, this procedure is painful and unfeasible, especially for large datasets. The training set from one class could belong to both clusters, resulting in decreased predictive accuracy.

There are different approaches for solving multi-class problems which are not based on SVM. Some of them are presented in the following discussion. However, the experimental results clearly show that their classification accuracy is significantly smaller than the SVM based methods.

Ensemble techniques have received considerable attention within the recent machine learning research [191. The basic goal is to train a diverse set of classifiers for a single learning problem and to vote or average their predictions. The approach is simple as well as powerful, and the obtained accuracy gains often have solid theoretical foundations   . Averaging the predictions of these classifiers helps to reduce the variance and often increases the reliability of the predictions. There are several techniques for obtaining a diverse set of classifiers. The most common technique is to use subsampling to diversify the training sets as in Bagging [211 and Boosting . Other techniques include the use of different feature subsets for every classifier in the ensemble , to exploit the randomness of the base algorithms , possibly by artificially randomizing their behavior , or to use multiple representations of the domain objects. Finally, classifier diversity can be ensured by modifying the output labels, i.e., by transforming the learning tasks into a collection of related learning tasks that use the same input examples, but different assignments of the class labels. Error-correcting output codes are the most prominent example for this type of ensemble methods .

Error-correcting output codes are a popular and powerful class binarization technique. The basic idea is to transform an N-class problem into n binary problems (n > N), where each binary problem uses a subset of the classes as the positive class and the remaining classes as a negative class. As a consequence, each original class is encoded as an n-dimensional binary vector, one dimension for each prediction of a binary problem (+1 for positive and -1 for negative). The resulting matrix of the form {-1, +1} Nxn is called the coding matrix. New examples are classified by determining the row in the matrix that is closest to the binary vector obtained by submitting the example to the n classifiers. If the binary problems are chosen in a way that maximizes the distance between the class vectors, the reliability of the classification can be significantly increased. Error-correcting output codes can also be easily parallelized, but each subtask requires the total training set.

Similar to binarization, some approaches suggest mapping the original multiple classes into three clsses. A related technique where multi-class problems are mapped to 3-class problems is proposed by Angulo and Catal'a . Like with pairwise classification, they propose generating one training set for each pair of classes. They label the two class values with target values +1 and -1, and additionally, samples of all other classes are labeled to a third class, with a target value of 0. This idea leads to increased size of the training set compared to the binary classification. The mapping into three classes was also used by Kalousis and Theoharis  for predicting the most suitable learning algorithm(s) for a given dataset. They trained a nearest-neighbor learner to predict the better algorithm of each pair of learning algorithms. Each of these pairwise problems had three classes: one for each algorithm and a third class named "tie", where both algorithms had similar performances.

Johannes Ftirnkranz has investigated the use of round robin binarization (or pair-wise classification)  as a technique for handling multi-class problems with separate-and-conquer rule learning algorithms (aka covering algorithms). In particular, round robin binarization helps Ripper  outperform C5.0 on multiclass problems, whereas C5.0 outperforms the original version of Ripper on the same problems.

6 Experimental results

In this section, we present the results of our experiments with several multi-class problems. The performance was measured on the problem of recognition of handwritten digits and letters.

Here, we compare the results of the proposed SVM-BDT method with the following methods:

1) one-against-all (OVA);

2) one-against-one (OvO);

3) DAGSVM;

4) BTS;

5) Bagging

6) Random Forests

7) Multilayer Perceptron (MLP, neural network)

The training and testing of the SVMs based methods (OvO, OvA, DAGSVM, BTS and SVM-BDT) was performed using a custom developed application that uses the Torch library . For solving the partial binary classification problems, we used SVMs with Gaussian kernel. In these methods, we had to optimize the values of the kernel parameter [sigma] and penalty C. For parameter optimization we used experimental results. The achieved parameter values for the given datasets are given in Table 1.

We also developed an application that uses the same (Torch) library for the neural network classification. One hidden layer with 25 units was used by the neural network. The number of hidden units was determined experimentally.

The classifications based on ensembles of decision trees  (Bagging and Random Forest) was performed by Clus, a popular decision tree learner based on the principles stated by Blockeel et al. . There were 100 models in the ensembles. The pruning method that we used was C4.5. The number of selected features in the Random Forest method was [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], where M is the number of features in the dataset.

The most important criterion in evaluating the performance of a classifier is usually its recognition rate, but very often the training and testing time of the classifier are equally important.

In our experiments, four different multi-class classification problems were addressed by each of the eight previously mentioned methods. The training and testing time and the recognition performance were recorded for every method.

The first problem was recognition of isolated handwritten digits (10 classes) from the MNIST database. The MNIST database  contains grayscale images of isolated handwritten digits. From each digit image, after performing a slant correction, 40 features were extracted. The features are consisted of 10 horizontal, 8 vertical and 22 diagonal projections . The MNIST database contains 60.000 training samples, and 10.000 testing samples.

The second and the third problem are 10 class problems from the UCI Repository  of machine learning databases: Optdigit and Pendigit. Pendigit has 16 features, 7494 training samples, and 3498 testing samples. Optdigit has 64 features, 3823 training samples, and 1797 testing samples.

The fourth problem was recognition of isolated handwritten letters--a 26-class problem from the Statlog collection . Statlog-letter contains 15.000 training samples, and 5.000 testing samples, where each sample is represented by 16 features.

The classifiers were trained using all available training samples of the set and were evaluated by recognizing all the test samples from the corresponding set. All tests were performed on a personal computer with an Intel Core2Duo processor at 1.86GHz with the Windows XP operating system.

Tables 2 through 4 show the results of the experiments using 8 different approaches (5 approaches based on SVM, two based on ensembles of decision trees and one neural network) on each of the 4 data sets. The first column of each table describes the classification method. Table 2 gives the prediction error rate of each method applied on each of the datasets. Table 3 and table 4 shows the testing and training time of each algorithm, for the datasets, measured in seconds, respectively.

The results in the tables show that SVM based methods outperform the other approaches, in terms of classification accuracy. In terms of speed, SVM based methods are faster, with different ratios for different datasets. In overall, the SVM based algorithms were significantly better compared to the non SVM based methods.

The results in table 2 show that for all datasets, the one-against-all (OVA) method achieved the lowest error rate. For the MNIST, Pendigit and Optdigit datasets, the other SVM based methods (OvO, DAGSVM, BTS and our method--SVM-BDT) achieved higher, but similar error rates. For the recognition of handwritten letters from the Statlog database, the OvO and DAGSVM methods achieved very similar error rates that were about 1.5% higher than the OvA method. The BTS method showed the lowest error rate of all methods using one-against-one SVMs. Our SVM-BDT method achieved better recognition rate than all the methods using one-against-one SVMs, including BTS. Of the non SVM based methods, the Random Forest method achieved the best recognition accuracy for all datasets. The prediction performance of the MLP method was comparable to the Random Forest method for the 10-class problems, but noticeably worse for the 26-class problem.

The MLP method is the fastest one in terms of training and testing time, which is evident in Table 3 and Table 4. The classification methods based on ensembles of trees were the slowest in the training and the testing phase, especially the Bagging method. Overall, the Random Forest method was more accurate than the other non SVM based methods, while the MLP method was the fastest.

The results in Table 3 show that the DAGSVM method achieved the fastest testing time of all the SVM based methods for the MNIST dataset. For the other datasets, the testing time of DAGSVM is comparable with BTS and SVM-BDT methods and their testing time is noticeably better than the one-against-all (OVA) and one-against-one (OvO) methods. The SVM-BDT method was faster in the recognition phase for the Pendigit dataset and slightly slower than DAGSVM method for the Statlog dataset.

In terms of training speeds, it is evident in Table 4 that among the SVM based methods, SVM-BDT is the fastest one in the training phase. For the three 10-class problems the time needed to train the 10 classifiers for the OvA approach took about 4 times longer than training the 45 classifiers for the OvO and DAGSVM methods. Due to the huge number of training samples in the MNIST dataset (60000), SVM-BDT's training time was longer compared to other one-against-one SVM methods. The huge number of training samples increases the nonlinearity of the hyperplane in the SVM, resulting in an incresed number of support vectors and increased training time. Also, the delay exists only in the first level of the tree, where the entire training dataset is used for training. In the lower levels, the training time of divided subsets is not as significant as the first level's delay. In the other 10 class problems, our method achieved the shortest training time. For the Statlog dataset, the time needed for training of the 26 one-against-all SVMs was almost 7 times longer than the time for training the 325 one-against-one SVMs. The BTS method is the slowest one in the training phase of the methods using one-against-one SVMs. It must be noted that as the number of classes in the dataset increases, the advantage of SVM-BDT becomes more evident. The SVM-BDT method was the fastest while training, achieving better recognition rate than the methods using one-against-one SVMs. It was only slightly slower in recognition than DAGSVM.

7 Conclusion

A novel architecture of Support Vector Machine classifiers utilizing binary decision tree (SVM-BDT) for solving multiclass problems was presented. The SVM-BDT architecture was designed to provide superior multi-class classification performance, utilizing a decision tree architecture that requires much less computation for deciding a class for an unknown sample. A clustering algorithm that utilizes distance measures at the kernel space is used to convert the multi-class problem into binary decision tree, in which the binary decisions are made by the SVMs. The results of the experiments show that the speed of training and testing are improved, while keeping comparable or offering better recognition rates than the other SVM multi-class methods. The experiments showed that this method becomes more favourable as the number of classes in the recognition problem increases.

Received: July 27, 2008

References

 V. Vapnik. The Nature of Statistical Learning Theoo,, 2nd Ed. Springer, New York, 1999.

 C. J. C. Burges. A tutorial on support vector machine for pattern recognition. Data Min. Knowl. Disc. 2 (1998) 121.

 T. Joachims. Making large scale SVM learning practical, in B. Scholkopf, C. Bruges and A. Smola (eds). Advances in kernel methods-support vector learning, MIT Press, Cambridge, MA, 1998.

 R. Fletcher. Practical Methods of Optimization. 2nd Ed. John Wiley & Sons. Chichester (1987).

 J. Weston, C. Watkins. Multi-class support vector machines. Proceedings of ESANN99, M. Verleysen, Ed., Brussels, Belgium, 1999.

 V. Vapnik. Statistical Learning Theol. Wiley, New York, 1998.

 J. H. Friedman. Another approach to polychotomous classification. Technical report. Department of Statistics, Stanford University, 1997.

 P. Xu, A. K. Chan. Support vector machine for multi-class signal classification with unbalanced samples. Proceedings of the International Joint Conference on Neural Networks 2003. Portland, pp.1116-1119, 2003.

 Platt, N. Cristianini, J. Shawe-Taylor. Large margin DAGSVM's for multiclass classification. Advances in Neural Information Processing System. Vol. 12, pp. 547-553, 2000.

 B. Fei, J. Liu. Binary Tree of SVM: A New Fast Multiclass Training and Classification Algorithm. IEEE Transaction on neural networks, Vol. 17, No. 3, May 2006.

 X. Liu, H. Xing, X. Wang. A multistage support vector machine. 2nd International Conference on Machine Learning and Cybernetics, pages 13051308, 2003.

 A. Ben-Hur, D. Horn, H. Siegelmann, V. Vapnik. Support vector clustering. Journal of Machine Learning Research, vol. 2:125-137, 2001.

 J. Platt. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods--Support Vector Learning. Pages 185-208, Cambridge, MA, 1999. MIT Press.

 R. Collobert, S. Bengio, J. Mariethoz. Torch: a modular machine learning software library. Technical Report IDIAP-RR 02-46, IDIAP, 2002.

 --, MNIST, MiniNIST, USA http://yann.lecun.com/exdb/mnist

 T. G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18(4): 97-136, Winter 1997.

 G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli (eds.) First International Workshop on Multiple Classifier Systems, pp. 1 15. Springer-Verlag, 2000a.

 D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169-198, 1999.

 E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36:105-169, 1999.

 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

 L. Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996.

 T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263-286, 1995.

 S. D. Bay. Nearest neighbor classification from multiple feature subsets. Intelligent Data Analysis, 3(3): 191-209, 1999.

 J. F. Kolen and J. B. Pollack. Back propagation is sensitive to initial conditions. In Advances in Neural Information Processing Systems 3 (NIPS90), pp. 860 867. Morgan Kaufmann, 1991.

 T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139-158, 2000b.

 C. Angulo and A. Catal'a. K-SVCR. A multi-class support vector machine. In R. L'opez de M'antaras and E. Plaza (eds.) Proceedings of the 11th European Conference on Machine Learning (ECML-2000), pp. 31-38. Springer-Verlag, 2000.

 A. Kalousis and T. Theoharis. Noemon: Design, implementation and performance results of an intelligent assistant for classifier selection. Intelligent Data Analysis, 3(5):319-337, 1999.

 Johannes Ffirnkranz, Round robin classification, The Journal of Machine Learning Research, 2, p.721-747, 3/1/2002

 W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russell (eds.) Proceedings of the 12th International Conference on Machine Learning (ML-95), pp. 115-123, Lake Tahoe, CA, 1995. Morgan Kaufmann.

 D. Kocev, C. Vens, J. Struyf and S. D'zeroski. Ensembles of multi-objective decision trees. Proceedings of the 18th European Conference on Machine Learning (pp. 624-631) (2007). Springer.

 H. Blockeel, J. Struyf. Efficient Algorithms for Decision Tree Cross-validation. Journal of Machine Learning Research 3:621-650, 2002.

 D. Gorgevik, D. Cakmakov. An Efficient Three-Stage Classifier for Handwritten Digit Recognition. Proceedings of 17th Int. Conference on Pattern Recognition, ICPR2004. Vol. 4, pp. 507-510, IEEE Computer Society, Cambridge, UK, 23-26 August 2004.

 C. Blake, E. Keogh and C. Merz. UCI Repository of Machine Learning Databases, (1998). Statlog Data Set, http://archive.ics.uci.edu/ml/datasets.html [Online]

 Statlog Data Set, http://archive.ics.uci.edu/ml/datasets/Letter+Recognition [Online]

Gjorgji Madzarov, Dejan Gjorgjevikj and Ivan Chorbev

Department of Computer Science and Engineering

Faculty of Electrical Engineering and Information Technology

Karpos 2 b.b., 1000 Skopje, Macedonia

```Table 1. The optimized value for [sigma] and C for the use datasets.

MNIST Pendigit Optdigit Statlog

[sigma] 2 60 25 1.1
C 100 100 100 100

Table 2. The prediction error rate (%) of each method for every dataset

Classifier MNIST Pendigit Optdigit Statlog

OvA 1.93 1.70 1.17 3.20
OvO 2.43 1.94 1.55 4.72
DAGSVM 2.50 1.97 1.67 4.74
BTS 2.24 1.94 1.51 4.70
SVM-BDT 2.45 1.94 1.61 4.54
R. Forest 3.92 3.72 3.18 4.98
Bagging 4.96 5.38 7.17 8.04
MLP 4.25 3.83 3.84 14.14

Table 3. Testing time of each method for every dataset measured in
seconds

Classifier MNIST Pendigit Optdigit Statlog

OvA 23.56 1.75 1.63 119.50
OvO 26.89 3.63 1.96 160.50
DAGSVM 9.46 0.55 0.68 12.50
BTS 26.89 0.57 0.73 17.20
SVM-BDT 25.33 0.54 0.70 13.10
R. Forest 39.51 3.61 2.76 11.07
Bagging 34.52 2.13 1.70 9.76
MLP 2.12 0.49 0.41 1.10

Table 4. Training time of each method for every dataset
measured in seconds

Classifier MNIST Pendigit Optdigit Statlog

OvA 468.94 4.99 3.94 554.20
OvO 116.96 3.11 2.02 80.90
DAGSVM 116.96 3.11 2.02 80.90
BTS 240.73 5.21 5.65 387.10
SVM-BDT 304.25 1.60 1.59 63.30
R. Forest 542.78 17.08 22.21 50.70
Bagging 3525.31 30.87 49.40 112.75
MLP 45.34 2.20 1.60 10.80
```
COPYRIGHT 2009 Slovenian Society Informatika
No portion of this article can be reproduced without the express written permission from the copyright holder.