# Fast Decision Tree Algorithm.

I. INTRODUCTION

A pattern is a representation of an entity in a P-dimensional space E. Usually the pattern xi is a vector

[x.sub.i] = ([x.sub.i 1], [x.sub.i2],..., [x.sub.ip]) (1)

where [x.sub.ij] is the feature j of the pattern [x.sub.i] and [x.sub.ij] [member of] R, E [subset] [R.sup.p].

The patterns that are similar to each other are grouped into classes of equivalence. These classes are regions in space E and depending on the separation areas between them we can have two types of class separability: linear and nonlinear. The decision binary trees can be used successfully for pattern classification, especially in the case of nonlinear class separability. The method consists of several steps; the classes that can be potentially assigned to pattern xi are rejected during the process until one of them is accepted. All these steps can be grouped into one generic step called learning step where a training set is provided to the pattern recognition system:

T = {([x.sub.i], [[sigma].sub.i])/[x.sub.i] [member of] [E.sub.T], [[sigma].sub.i] [member of] {1,2,...,M}} (2)

where [E.sub.T] [member of] E and M is the number of classes in [E.sub.T].

The pattern space is divided into several regions, one for each class (Fig. 1). Using these regions, a binary decision tree can be created that will allow an automatic classification of any "unseen" patterns. The sequence of decisions in this tree is applied to some individual characteristics (attributes). In case of continuous attributes the tests are usually simple conditions: is [x.sub.i] < t where t is a threshold value. For discrete attributes the tests are: [x.sub.j] [member of] {[v.sub.j1], [v.sub.j2], ..., [v.sub.jq]} where [v.sub.ji] are the possible values of the attribute.

The root and the nodes of such tree represent the decisions of the tree and the leaves represent the outcome (the class to which a pattern belongs to). These trees are called ordinary binary classification trees (OBCTs).

The principle of pattern recognition is demonstrated against the example from Fig. 1 where p=2 and M=3. In Fig. 2 it is shown the decision tree that corresponds to the three regions delimited from Fig. 1 for classes C1 and C2. In a decision tree it is possible to classify a pattern without consulting the values of all attributes. One may see that a decision tree built against a training set T is not unique. In the example from Fig. 1 the tree was built by visual examination of the geometry of pattern placement in the plane, it is obvious that this method cannot be applied for p > 3. A systematic method that can be applied for p > 3 requires the following to be taken into account:

* the set of patterns E is associated to the root node

* each node t has a subset of patterns [F.sub.t] from E

* by evaluating a condition to true or false the subset [F.sub.t] is split further into two disjoint subsets [F.sub.tY] and [F.sub.tN]

In order to build the decision tree the following have to be defined:

* a criterion to decide what is the best method to split the set of patterns

* a rule to decide when to stop the set of patterns splitting; a such rule controls the tree growing and decides whether a certain node becomes a leaf

* a decision rule for classification in leaves

The set of tests is very important in the decision tree induction process. Let n be the number of patterns from E. Each attribute [x.sub.k] | k=1,p can have a number of values [n.sub.k] <= n. For each attribute there is a number of threshold values [t.sub.k] where each value is approximated with the half of distance between two consecutive values of attribute [x.sub.k] in the patterns space. The total number of tests is:

[p.summation over (k=1)]/ [n.sub.kt] (3)

One key aspect of decision tree induction is to minimize the number of tests that is equal to the number of nodes that are not leaves. Classification is a very important task in data mining related tasks  and the decision trees are probably the most used algorithms for solving classification problems. They can be used as standalone algorithms or combined with other algorithms like Adaboost , random trees, etc.

II. THE DATA COMPRESSION ALGORITHM

One of the most challenging tasks for data mining community is to develop classifiers that can mine very large datasets . This basically means to extend the existing algorithms or develop new ones that can scale up to large datasets. The decision trees are widely used in various fields such as banking, statistics, gas and oil exploration, astronomy , speech recognition , to name just a few. Many algorithms for learning decision trees [6-11] have been developed: ID3, C4.5, C5.0, CART, CHAID, SLIQ, Sprint, QUEST. C4.5 and C5.0 have been developed by Ross Quinlan and are considered de facto standard when evaluating new decision tree algorithms. A comparison of the above algorithms was done and the results were published on his web site (http://www.rulequest.com).

An excellent survey about the scalability of decision tree algorithms was done by Provost . Apparently the most common reason for scaling up is the fact that increasing the number of the training patterns will often result in a classifier that has a better accuracy. If we are to ask the question "How large is very large?" one could answer as follows :

"Somewhere around data sizes of 100 megabytes or so, qualitatively new, very serious scaling problems begin to arise, both on the human and on the algorithmic side."

One may ask the following question: why not use a well known compression/decompression algorithm like gzip for example? Unfortunately, such algorithms cannot be used as we need to read and decompress (unpack) data from random offsets within the dataset during the decision tree induction.

We propose a fast data compression algorithm that compresses/decompresses data at dataset record level and can be used in pretty much any decision tree algorithm.

A good decision tree algorithm should be able to handle patterns with two types of attributes: discrete and continuous. A discrete attribute can have a finite set of values as follows:

[v.sub.i] = ([v.sub.i 1], [v.sub.i2],...,[v.sub.ip]) (4)

where i is the ith attribute and q is the number of possible values for this attribute. We can encode each of these values using a number of bits as follows:

bits = roundup(lo[g.sub.2] (p)) (5)

As the continuous (numeric) attributes can have a very large set of possible values we cannot represent these using bits as we do for discrete ones. Instead, we have to use a discretization mechanism by splitting these possible values into intervals then using a number of bits to encode each interval. The simplest method to divide these values into intervals is called equal-width-intervals, is based on information gain and consists of the following steps: sort the values ascending, split them into a number of equal length intervals, for each two adjacent intervals ([t.sub.m-1], [t.sub.m]) and ([t.sub.m], [t.sub.m+1]) calculate the information gain against the current dataset for splitting point [t.sub.m]. The threshold [t.sub.m] that maximizes the information gain is chosen so the splitting criterion for attribute ai can be represented as:

[v.sub.ij] [less than or equal to] [t.sub.m], [v.sub.ik] > [t.sub.m]| k [not equal to] j (6)

These calculations have to be done on each splitting node against the dataset that reaches it which can be time consuming especially for large datasets. As we need to discretize the attributes with continuous values beforehand we can't use this approach for our algorithm. Kerber  proposed a new discretization algorithm based on chi-square statistical method that calculates the splitting intervals just once for the whole dataset. This approach is also used in CHAID and related algorithms  and usually yields better results than the equal-width-intervals method in terms of classification accuracy.

Once all attribute values are discretized we calculate the number of bits required to store all distinct values for each one. We can represent each attribute value by a number between 0 and q-1 and pack this into a sequence of bits within a primitive data type such as int or long, depending on the operating system and processor--for a 32 bit operating system we'll use an int (4 bytes), for a 64 bit operating system we'll use a long (8 bytes) in order to maximize the data transfer rate between memory and processor. Each attribute will have an offset in order to identify the position of its value within the packed data. Depending on how many attributes the dataset has and the amount of bits required for each attribute we may need more than one int or long to store a record from the dataset. Packing and unpacking data are reduced to simple bit operations:

The variable a.bits is the number of bits required to encode the values for a particular attribute and a.offs is the attribute's offset. Our experiments show that the amount of memory required to store the dataset in packed format is about one order of magnitude lower than its unpacked format. This means that we can build decision trees against much larger datasets that otherwise would not fit in computer's memory. The packing operations for a dataset can be easily parallelized as every thread can read and pack data at different offsets within the dataset so the memory writes operations won't clash. A drawback of this method is that we use a fixed offset for every attribute value that is encoded. For example, if we use a multiple of 64 bits (long type) for a pattern that requires 65 bits to be encoded then 63 bits would be wasted. The solution is to pack the data more tightly - the first bit of the next pattern should be stored next to the last bit of the previous pattern - this will be a topic for our future research.

III. THE PRUNING ALGORITHM

Pruning is a very important step in a decision tree induction and it should reduce the size of a decision tree without reducing its predictive accuracy. It also reduces the risk of over fitting the training data. There are several methods of pruning decision trees in literature: minimum description length principle , multiple comparison (MC) analysis , VC dimension , Bayesian methods , forest tree pruning , using genetic algorithms . In  it is presented a comparison of reduced error pruning methods. This method is known to be one of the fastest pruning methods. It uses a bottom-up approach and consists of replacing a node with a leaf where its value is one of the possible values of the attribute that is linked to that particular node. The process is an iterative one and stops when no more accuracy improvement is achieved. The node to be replaced is the one that maximizes the tree accuracy by replacing it with a leaf. We propose a very fast pruning algorithm where the calculations required to decide what node is to be replaced with a leaf are done incrementally. Every node in the tree will have three extra attributes:

* hits = number of patterns that reach the node

* correct = number of patterns classified correctly that reach the node

* cls[] = class distribution for patterns that reach the node Each pattern from the test dataset will be classified by the tree just once and the variables hits, correct and cls[] will be set up accordingly for every node. Next, a list of nodes that have only leaves as children is built. For each node in the list we calculate the number of patterns classified correctly by the node (the ones that reach the node) if we would replace that node with a leaf that has the value equal to the most common class in cls[] vector.

If the following condition is true:

diff = max(node.cls[])--node .correct [greater than or equal to] 0 (7)

then we add that node to a sorted hash map where the keys are sorted ascending. In this case the key is set to diff value and the value is a vector of nodes. While the sorted hash map is not empty the following steps are executed:

* get the last key from the map (the biggest one)

* read the vector of nodes for that key

* replace the last node from vector with a leaf that has the value = max(node.cls[])

* add the key value to leaf.correct

* remove the node from vector

* if the vector is empty remove the key from hash table

* for every parent node up to the root add the key value to correct value

* apply formula (7) to parent node of current node and if diff >= 0 then add the parent node to the hash table

In Fig. 3 one may see the results for Iris dataset.

Iris dataset has a total of 150 patterns from which 75 are used as training dataset and the remaining 75 as test dataset.

As one may see it is very easy to calculate the tree's classification accuracy using the following formula:

accuracy = root.correct/root.hits.100 (8)

In Fig. 3 the first pair (75/73) from the root node

represents (hits/correct) variables and the numbers enclosed

in square brackets [25,25,25] is the class distribution vector

for root node. Obviously, all patterns from the dataset reach

the root node. The algorithm is shown in pseudo code:

IV. EXPERIMENTAL RESULTS

In order to validate the proposed algorithms three datasets from UCI machine learning repository have been used:

* sleep - 105,908 patterns, 6 numeric attributes, 6 classes

* adult - 48,842 patterns, 6 numeric attributes, 8 discrete attributes, 2 classes

* poker - 1,000,000 patterns, 10 discrete attributes, 10 classes

All datasets have been split in half; the odd records have been used as training data and the even ones as test data. We used the same training and test data for testing C5.0 and our algorithm. The test application is written in C++ and is compiled with GNU C++ compiler on a Linux 64 bit platform. Our algorithms have been tested against Quinlan's C5.0 open source implementation which was compiled from sources on the same platform. For every test we compared the classification accuracy, the tree size and the execution time. Each test was repeated against a ten times bigger dataset (every dataset was concatenated to itself ten times).

The values in bold represent better values. As one may see in Table I our algorithm outperforms C5.0 in classification accuracy in all 3 tests and is a bit faster in two of the tests. However, C5.0 produces much smaller decision trees. We repeated the experiments with the same datasets but ten times larger.

One may see in Table II that the difference between execution times widens in favor of our algorithm. The pruning algorithm complexity depends on two factors. The first one is the complexity of classify() function. If m is the number of attributes and n is the number of patterns in the test dataset then the complexity of classify() function can be expressed as O(mn) because the maximum depth of the tree is equal to m. On the other hand the complexity of foreach and while loops can be expressed as O(sm) where s is the number of nodes in the tree before pruning. We've also included m in the calculations because of the inner while loop that is necessary to update incrementally the correct value for all parent nodes up to the root when a node is replaced with a leaf. It has been observed in our tests that the number of nodes in the tree before pruning has the same order of magnitude as the size of the test dataset so we can approximate the complexity of pruning algorithm with O(mn).

V. CONCLUSION

We proposed a very fast algorithm for building decision trees that can be used successfully against large datasets. Some parts of it can be easily parallelized (discretization of continuous attributes) which should shorten the overall execution time. Also, further research is required for data packing algorithm in order to reduce the amount of memory

REFERENCES

[I] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to data mining. Boston: Pearson Addison Wesley, 2005.

 Y. Freund and R. E. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting," Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-139, Aug. 1997.

 S. Chakrabarti, Data mining: know it all. Burlington, MA: Elsevier/Morgan Kaufmann Publishers, 2009.

 E. C. Vasconcellos, R. R. de Carvalho, R. R. Gal, F. L. LaBarbera, H. V. Capelato, H. F. C. Velho, M. Trevisan, and R. S. R. Ruiz, "Decision Tree Classifiers for Star/Galaxy Separation," The Astronomical Journal, vol. 141, no. 6, p. 189, Jun. 2011.

 K.-H. Kim and J.-H. Kim, "Domain Independent Vocabulary Generation and Its Use in Category-based Small Footprint Language Model," Advances in Electrical and Computer Engineering, vol. 11, no. 1, pp. 77-84, 2011.

 J. R. Quinlan, "Induction of decision trees," Mach Learn, vol. 1, no. 1, pp. 81-106, Mar. 1986.

 S. L. Salzberg, "C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993," Mach Learn, vol. 16, no. 3, pp. 235-240, Sep. 1994.

 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, "Classification and Regression Trees (POD)," 1999.

 M. Mehta, R. Agrawal, and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining," in Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, London, UK, UK, 1996, pp. 18-32.

 J. C. Shafer, R. Agrawal, and M. Mehta, "SPRINT: A Scalable Parallel Classifier for Data Mining," in Proceedings of the 22th International Conference on Very Large Data Bases, San Francisco, CA, USA, 1996, pp. 544-555.

 W.-Y. Loh and Y.-S. Shih, Split Selection Methods for Classification Trees. 1997.

 F. Provost and V. Kolluri, "A Survey of Methods for Scaling Up Inductive Algorithms," Data Mining and Knowledge Discovery, vol. 3, no. 2, pp. 131-169, Jun. 1999.

 P. Huber, "From Large to Huge: A Statistician's Reactions to KDD & DM," 1997, p. 304.

 R. Kerber, "ChiMerge: discretization of numeric attributes," in Proceedings of the tenth national conference on Artificial intelligence, San Jose, California, 1992, pp. 123-128.

 J. Ouyang, N. Patel, and I. K. Sethi, "Chi-Square Test Based Decision Trees Induction in Distributed Environment," in IEEE International Conference on Data Mining Workshops, 2008. ICDMW '08, 2008, pp. 477-485.

 J. R. Quinlan and R. L. Rivest, "Inferring decision trees using the minimum description length principle," Inf. Comput., vol. 80, no. 3, pp. 227-248, Mar. 1989.

 D. Jensen and M. D. Schmill, "Adjusting for Multiple Comparisons in Decision Tree Pruning.," in KDD, 1997, pp. 195-198.

 M. Kearns and Y. Mansour, "A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization," in In Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 269-277.

 W. Zhang and Y. Li, "A Post-Pruning Decision Tree Algorithm Based on Bayesian," in 2013 Fifth International Conference on Computational and Information Sciences (ICCIS), 2013, pp. 988-991.

 H. Guo, M. Fan, and Y. Ye, "Forest pruning based on Tree-Node Order," in 2011 IEEE International Conference on Computer Science and Automation Engineering (CSAE), 2011, vol. 3, pp. 71-76.

 J. Chen, X. Wang, and J. Zhai, "Pruning Decision Tree Using Genetic Algorithms," in International Conference on Artificial Intelligence and Computational Intelligence, 2009. AICI '09, 2009, vol. 3, pp. 244-248.

 W. N. H. W. Mohamed, M. N. M. Salleh, and A. H. Omar, "A comparative study of Reduced Error Pruning method in decision tree algorithms," in 2012 IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2012, pp. 392-397.

Vasile PURDILA, Stefan-Gheorghe PENTIUC

Stefan cel Mare University of Suceava, 720229, Romania

vasilep@eed.usv.ro, pentiuc@eed.usv.ro

This work was supported by the project "Improvement of the doctoral studies quality in engineering science for development of the knowledge based society-QDOC" contract no. POSDRU/107/1.5/S/78534, project co-funded by the European Social Fund through the Sectorial Operational Program Human Resources 2007-2013.

Digital Object Identifier 10.4316/AECE.2014.01010
```TABLE I. EXPERIMENTAL RESULTS

C5.0
Dataset  Records    Errors %  Leaves  Time (sec)

sleep      105.903  26.80      2,071  1.10
poker    1,000,000  20.60     14,067  3.40

Our algorithm
Dataset  Errors %  Loaves   Time (sec)

sleep    23.18      10.214  0.60
poker    16.47     104,920  2.90

TABLE II.EXPERIMENTAL RESULTS (DATASET X 10)

C5.0        Our algorithm
Dataset  Records     Time (sec)  Time (sec)

sleep     1,059,080  15.30        5.40