# A novel method of data stream classification based on incremental storage tree.

1. IntroductionNowadays, tremendous amounts of data streams are generated in many applications such as e-commerce, wireless sensor network, network monitor, telecommunication system, stock market and meteorological data. The continuous arrival of data streams in multiple, rapid, time-varying, and potentially unbounded streams raises new challenges and research problems. Indeed, it is usually not feasible to simply store the arriving data in a traditional database management system in order to perform operations on that data later on. Rather, data stream must generally be processed in an online, incremental manner so as to guarantee that results are up-to-date and that queries can be answered with small time delay. One important data mining problem which has been studied in the context of data streams is that of classification [1]. The main thrust on data stream mining in the context of classification has been that of one-pass mining [2,3].In general, the use of one-pass mining does not recognize the changes which have occurred in the model since the beginning of the stream construction process. While the work in works on time changing data streams, the focus is on providing effective methods for incremental updating of the classification model.

Existing data stream classification approaches [4,5,6,7,8,9,10,11] are largely classified into distance, decision theory, and structure-based approaches. The distance-based approach [10,11] uses a distance metric to choose a class that is closest to the query. Well-known distance metrics such as Euclidean distance and Dynamic Time Warping (DTW) are commonly used. The decision theory-based approach [4, 5, 9] generally uses a statistical or machine learning classifier, such as Bayesian classifiers, Hidden Markov Models (HMMs), Recurrent Neural Networks (RNNs), genetic algorithms, micro-clusters, etc. These methods select the maximum likelihood class using prior probabilities and distribution statistics. Finally, the structure-based approach [6, 7, 8] builds classification rules in the form of a tree or graph structure, and then chooses the class with the most similar structure.

This paper proposes a Bayesian classification data mining algorithm based on incremental storage tree. Use sliding window to process data stream and divide it into several basic unit, apply Principal component analysis (PCA) to compress the data from window and produce dynamic incremental storage tree, use the second power strategy to continuously update several incremental storage tree. Then use multi-classifier integration technology combines with Bayesian classification to produce Bayesian classifier. At last, adjust the weight of classifier by cyclic test, and get high classification accuracy. Detailed simulation analysis demonstrates that the presented algorithm is of high efficiency of space and time and is more stable.

A review of background is given in section 2. In section 3, we introduce the structure and update method of incremental storage tree. In section 4, we introduce the structure of Bayesian classifier based on incremental storage tree. The experimental results of comparing the algorithm proposed in this paper with other algorithms are also presented in section 5. Finally, our work of this paper is summarized in the last section.

2. Background

2.1 Related work

At present, more algorithms have been used in the data stream classification. Zhang and Zhou [12] proposed a novel frame work that first categorized diverse training examples into four types and assigned learning priorities to them. Then, they derived four learning cases based on the proportion and priority of the different types of training examples. Shaker et al [13] proposed a novel model class called fuzzy pattern trees (FPTs) for machine learning. They considered the problem of learning fuzzy pattern trees for binary classification from data streams. Apart from its practical relevance, this problem is also interesting from a methodological point of view. First, the aspect of efficiency plays an important role in the context of data streams, since learning has to be accomplished under hard time (and memory) constraints. Moreover, a learning algorithm should be adaptive in the sense that an up-to-date model is offered at any time, taking new data items into consideration as soon as they arrived and perhaps forgetting old ones that have become obsolete due to a change of the underlying data generating process. To meet these requirements, they developed an evolving version of fuzzy pattern tree learning, in which model adaptation is realized by anticipating possible local changes of the current model, and confirming these changes through statistical hypothesis testing. Seo et al [14] proposed new classification models that exploited temporal relations among features. They showed that consideration of such dependencies does significantly improve the classification accuracy. Another benefit of employing temporal relations is the improved interpretability of the resulting classification models, as the set of temporal relations can be easily translated to a rule using a sequence of inter-dependent events characterizing the class. Zhang et al [15] proposed two novel learning models, transfer semi-supervised SVM(TS3VM) and relational k-means-based TS3VM (RK-TS3VM), for training.

2.2 Data stream and sliding window

Let t denote any of the timestamp; [s.sub.t] denote the data elements arrived at t, then, data stream can be represented as infinite collection {... [s.sub.t-1], [s.sub.t], [s.sub.t+1] ...}. When extend to multiple data streams, let {[x.sub.1], [x.sub.2],... [x.sub.n]} denote n data streams at t, and in which [x.sub.1] = {... [s.sub.t-1], [s.sub.t], [s.sub.t+1], ...}.

Sliding window is suitable to the application which only interests in current data, the algorithm only keeps and mines data in fixed window breadth, the window position slides as data stream flows.

2.3 Principal component analysis model

Principal component analysis [16] is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.

The 1st-principal component: define vector [a.sub.1] under the constraint conditions of [a.sub.1.sup.T] [a.sub.1] = 1, make Var ([Y.sub.1]) = [a.sub.1.sup.T] [summation] [a.sub.1] to be the maximum, that is, [Y.sub.1] = [a.sub.1] X, in which [SIGMA] is the covariance matrix.

The kth-principal component: if [Y.sub.1], [Y.sub.2], ..., [Y.sub.k-1] is not enough to reflect the information of raw variable, then construct further: [Y.sub.k] = [a.sup.T.sub.k] X = [a.sub.k1], [x.sub.1] + [a.sub.k2] [x.sub.2] + ... + [a.sub.kp], [x.sub.p].

Define vector [a.sub.k] under the constraint conditions of [a.sup.T.sub.k] [a.sub.k] = 1 and Cov ([Y.sub.k], [Y.sub.i]) = [a.sup.T.sub.k] [summation] [a.sub.i] = 0, (i = 1, 2,..., k -1) make Var ([Y.sub.k]) = [a.sup.T.sub.k] [SIGMA] [a.sub.k] to the maximum, that is [Y.sub.k] = [a.sub.k]X.

3. The structure and update of incremental storage tree

According to characteristic of Bayesian algorithm, when we calculate P(X | [C.sub.i]), the algorithm uses the product of classification and certain attribute, that is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. Then we only need to save the data as the tree to classify category and each attribute, and achieve the storage requirement of Bayesian algorithm.

3.1. Dynamic incremental storage tree

Definition 1. suppose there are m classes and n attributes in training sample, they are [C.sub.1], [C.sub.2], ..., [C.sub.m], and [A.sub.1], [A.sub.2], ..., [A.sub.n], according to the characteristic of Bayesian classification, construct a storage tree. The tree has four layers, root is denoted as root node, attribute [A.sub.i] is the second layer, the value of attribute [A.sub.ij] is the third layer, and class [C.sub.i] is leaf node. The structure is shown in Figure 1.

(1) Apply PCA to compress the data in sliding window.

(2) Initialize incremental storage tree [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], and root node is ([K.sub.i]_root, 0).

(3) When new data arrives, seek along the incremental storage tree, if the data exists in the storage tree, then rc + 1, [n.sub.i] + 1, [r.sub.j] + 1, [m.sub.i] + 1; if the data doesn't exist in the storage tree, then produce attribute [A.sub.i], the value of attribute A., class leaf node C and initialize to get rc = 1, [n.sub.i] = 1, [r.sub.j] = 1, [m.sub.i] = 1

[FIGURE 1 OMITTED]

Note: Two regions root_name (RN) and root_count (rc) make up Rt point in the storage tree, RN represents tree name, rc represents the record count of training sample, [A.sub.i] is made up by attributename (AN) and [n.sub.i] AN represents attribute name, [n.sub.i] represents the record count of certain attribute, [A.sub.ij] is made up by value name (VN) and [r.sub.j], VN represents the name of attribute value, [r.sub.j] represents the record count of certain value of certain attribute, [C.sub.m] is made up by class_name (CN) and [m.sub.k], CN represents class name, [m.sub.k] represents the record count of certain class of certain value of certain attribute.

3.2 The structure and update of dynamic incremental storage tree

This paper introduces the idea of the second power in the process of updating storage trees, it doesn't update all storage trees when new data window p arrives, but updates the first i storage trees meeting conditions 2 i- 1 < p < [2.sup.i], while update all storage trees when p = k *[2.sup.n], in which k is natural number. This update strategy reduces the cost of memory and CPU in the process of updating, also guarantees that the storage trees meeting conditions always store latest data, that is, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] constantly update, and smaller the subscript of storage tree is, more frequently update storage tree is.

Suppose the number of data stored by [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], the number of data stored by [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is the new data, let [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] denote the stores tree set. The second power update algorithm of store tree is as follows: Update (T)

(1) p = p mod [2.sup.n-1];

(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] records the second i new storage tree

(3) If (p % 2==0)

(4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(6) for (i = 1, [2.sup.i-1] < p, i ++) BEGIN

(7) Find j, [2.sup.j-1] < = p && [2.sup.i] > p ;

(8) Find m (0 < = m < =j), to make [2.sup.m] as the greatest common divisors of p.

(9) If (m = 0)

(10) return; // Storage trees don't update.

(11) Else

(12) for (i = 1, i < = m, i + +) BEGIN

(13) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII];

(14) If (i + 1 < m)

(15) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Update first m + 1 storage trees.

(16) END

Figure 2 shows the process of dynamic structure and update of incremental storage tree.

Incremental storage tree breaks traditional storage ways and greatly reduce memory space. Suppose each value of attribute can be discretized as q, then the space complexity of each storage tree is 0(m * n * q).

[FIGURE 2 OMITTED]

4. The structure of integrated Bayesian classifier EBC

Set integrated Bayesian classifier [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. In which, M corresponds to child Bayesian classification, i represents the ith child classification. When EBC classifies categorical data, we take the data of incremental storage tree [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] as training set. [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is the predicted probability got from classification [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. In the process of classification, we give every child classification a weight [w.sub.i] and the final classification prediction result is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. And in this paper, we introduce weights attenuation coefficients a(a [member of] (0,1)), [w.sub.i] = a * [w.sub.i] to control the precision error of macroforecast.

The classification steps of integrated Bayesian classifier is as follows:

(1) Each unclassified is represented as a n dimensional characteristic vector X = {[x.sub.1], [x.sub.2], ... [x.sub.n]} which describe n sample attributes [A.sub.1], [A.sub.2], ..., [A.sub.n].

(2) Take the data of incremental storage tree [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] as the sample data of training set.

(3) Calculate the [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] of unclassified data in each incremental storage tree, prior' probability of class can be calculated by, in which, [s.sub.i] is the number of training sample in class [C.sub.i], S is the total training samples.

(4) Calculate the P (X | [C.sub.i]) of unclassified data in each incremental storage tree.

(5) Calculate the of P ([C.sub.i] | X) = P(X|[C.sub.i])P([C.sub.i])/P(X) unclassified data in each incremental storage tree.

(6) Calculate the probability of unknown sample [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], in which, w is the weight of each classification and changes along with training sample.

5. Experimental results and analysis

The program is written in Matlab and java under the Matlab 7.9 running on Windows 2008, because Matlab do well in calling java programs. The tests were performed on a Core(TM) i7 2.67GHz with 4 GMB Memory and 500GB Hard disk. Experimental data contains two parts: one is synthetic datasets got from Weka and the other is real datasets got from bank credit card system. We evaluated the accuracy, memory consumption and execution time of algorithm using synthetic datasets, tested the parameter influence using real data. The number of data entering into sliding window each time is 5000 and the number of data in every child window is 1000.

5.1. The comparison of algorithms

This paper compares EBC algorithm with classical SVM and KNN on the index of accuracy, execution time and memory consumption. Figure 3 shows that the algorithm presented in this paper has higher accuracy on different data sets, and EBC ensemble classifier's function of automatic adjusting of weights makes it.

[FIGURE 3 OMITTED]

Figure 4 shows the memory consumption and execution time of these three algorithms and we can see the algorithm presented in this paper is better.

5.2 The influence of parameters

Experiment 1, set the weights parameters [w.sub.i] in the algorithm as 0.02, and the result showed in figure 5 is that smaller attenuation coefficient a of data selection algorithm causes larger cost of memory. As smaller a is, more the attributes of sample data are, then when we use storage tree to save data, the memory will be larger and the cost of memory will be bigger when the algorithm executes.

[FIGURE 4 OMITTED]

[FIGURE 5 OMITTED]

Experiment 2, set the weights parameters w. in the algorithm as 0.02, and the result showed in figure 6 is that smaller attenuation coefficient a of data selection algorithm causes more execution time. As smaller a is, larger the storage tree generated, then when we execute the algorithm, the time of ergodic search will be more and the cost of time is larger.

[FIGURE 6 OMITTED]

6. Conclusions

The characteristics of data stream bring many difficulties to classification mining: 1) the number of data is indefinite, while classification algorithm can only process the limited amount of data; 2) we think no database can store such mega data, and algorithm usually scans the data once. For the problems mentioned above, this paper proposes a Bayesian classification data mining algorithm based on incremental storage tree. Use sliding window to process data stream and divide it into several basic unit, apply PCA to compress the data from window and produce dynamic incremental storage tree, use the second power strategy to continuously update several incremental storage tree. Then use multi-classifier integration technology combines with Bayesian classification to produce Bayesian classifier. At last, adjust the weight of classifier by cyclic test, and get high classification accuracy. Experimental results on synthetic and real datasets show that our algorithm has high accuracy, efficiency and stability. For a future work, this paper suggests a study on classification precision of complex construction datasets.

Received: 10 March 2012, Revised 29 May 2012, Accepted 13 June 2012

7. Acknowledgments

This research was supported by National Natural Science Foundation of China (Grant No. 71071140, 71071141 and 60905026), Natural Science Foundation of Zhejiang Province (Grant No.Y6110628 and LQ12G01007).

References

[1] Zhang P., Zhu X., Tan J., Guo, L. KIF: a data imputation framework for concept drifting data streams.

[2] Domingos, P, Hulten, G. (2000). Mining High-Speed Data Streams. In: Proc. of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'00), p. 71-80. Association for Computing Machinery, Apr.

[3] Hulten, G., Spencer, L., Domingos, P. (2001). Mining Time Changing Data Streams. In: Proc. of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'01), p. 97-106, Association for Computing Machinery, May.

[4] Aggarwal C. C., Han J., Wang J., Yu, P. S. (2006). A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering 18 (5) 577-589.

[5] Carvalho, D. R., Freitas, A. A. (2004). A hybrid decision tree/genetic algorithm method for data mining. Information Sciences, 163 (1-3) 13-35.

[6] Duda, R.O, Hart, P. E, Stork, D. G. (2000). Pattern Classification, second ed., Wiley-Inter-science.

[7] Hu, Y. (2005). Finding useful fuzzy concepts for pattern classification using genetic algorithm. Information Sciences 175 (1-2) 1-19.

[8] Jin, R., Agrawal, G. (2003). Efficient decision tree construction on streaming data. In: Proc. of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD03), p. 571-576, Association for Computing Machinery, Apr.

[9] Kadous M. W., Sammut C. (2005). Classification of multivariate time series and structured data using constructive induction. Machine Learning 58 (2-3) 179216.

[10] Keogh, E., Kasetty, S. (2002). On the need for time series data mining benchmarks: a survey and empirical demonstration. Journal Data Mining and Knowledge Discovery, 7 (4) 349-371.

[11] Lin, J., Keogh, E., Lonardi, S., Chiu, B. (2003). A symbolic representation of time series with implications for streaming algorithms. In: Proc. of the Eighth ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD '03), p. 2-11, Association for Computing Machinery, May.

[12] Zhang, Z. H., Zhou, J. (2010). Transfer estimation of evolving class priors in data stream classification, Pattern Recognition, 43 (9) 3151-3161.

[13] Shaker, A., Senge, R., Hullermeier, E. (2012). Evolving fuzzy pattern trees for binary classification on data streams. Information Sciences, In: Press.

[14] Seo, S., Kang, J., Ryu K. H. (2009). Multivariable stream data classification using motifs and their temporal relations, Information Sciences, 179 (20) 3489-3504.

[15] Zhang, P., Gao, B., Liu, P., Shi, Y., Guo, L. (2012). A framework for application-driven classification of data streams. Neurocomputing, In: Press.

[16] Mei, C. L., Fan, J. C. (2006). Data analysis method. BeiJing:Higher Education Press.

Chonghuan Xu

Business Administration college of Zhejiang Gongshang University

Center for Studies of Modern Business Zhejiang Gongshang University

Hangzhou, 310018 China

talentxch@163.com

Author biography

Chonghuan Xu received his B.S. and M.S. degrees in Computer and Information Engineering from Zhejiang Gongshang University, Hangzhou. He has been a teacher in ZheJiang Gongshang University. His research interests include electronic commerce, data mining. He has published over 10 publications in academic journals and conference proceedings.

Printer friendly Cite/link Email Feedback | |

Author: | Xu, Chonghuan |
---|---|

Publication: | Journal of Digital Information Management |

Article Type: | Report |

Date: | Oct 1, 2012 |

Words: | 3276 |

Previous Article: | An intelligent communication path planning method of metallurgical equipment multi-dimensional information space. |

Next Article: | Applications of numerical calculation in engineering design. |

Topics: |