High dimensional data partitioning with an adaptive ensemble construction and analysis scheme.
Clustering is an important technique of exploratory data mining, which divides a set of objects into several groups in such a way that objects in same group are more similar with each other in some sense than with the objects in other groups. It has been widely used in different disciplines and applications, such as machine learning, pattern recognition, data compression, image segmentation , time series analysis , information retrieval, spatial data analysis  and biomedical research . Moreover, as data's variety and scale increase rapidly, and the prior knowledge about the data is usually limited, clustering has been a challenging task.
The most popular example of density-based clustering is DBSCAN in which only the objects whose density is greater than the given thresholds are connected together to form a cluster. However, the proper threshold setting varies with different data sets, there is still no effective method to preassign these thresholds. The spectral clustering based algorithm does not make assumptions on the forms of the clusters; it utilizes the spectrum of the similarity matrix of the data to map the data into a lower dimensional space in which the objects can be easily clustered by traditional clustering techniques. Comparing to the traditional algorithms, such as kMeans and single-linkage, this kind of clustering algorithm is useful in non-convex boundaries and performs empirically very well . The first few eigenvalues can be used to determine the number of clusters and reduce the dimension of data these first eigenvectors cannot successfully cluster objects that contain structures with different sizes and densities.
Generally, data stream mining refers to the mining tasks that are conducted on a sequence of rapidly arriving data records. As the environment where the data are collected may change dynamically, the data distribution may also change accordingly. This phenomenon, referred to as concept drift , , is one of the most important challenges in data stream mining. A data stream mining technique should be capable of constructing and dynamically updating a model in order to learn dynamic changes of data distributions, i.e., to track the concept drift.
Concept drift is formally defined as the change of joint distribution of data, i.e., p(x, y), where x is the feature vector and y is the class label. Over the past few decades, concept drift has been widely studied . The majority of the previous works focus on the concept drift caused by the change in class-conditional probability distribution, i.e., p(xjy). In comparison, class evolution, which is another factor that induces concept drift, has attracted relatively less attention. Briefly speaking, class evolution is concerned with certain types of change in the prior probability distribution of classes, i.e., p(y) and usually corresponds to the emergence of a novel class and the disappearance of an outdated class. Class evolution occurs frequently in practice. For example, new topics frequently appear on Twitter and outdated topics are forgotten with time.
The number of classes may change when class evolution happens; the model needs to be adapted not only to capture the distribution of existing classes, but also to identify that of the novel classes. At the same time, the effects of disappeared classes need to be removed from the model. Hence, in comparison to the change of classconditional probability, class evolution brings additional challenges to data stream mining. In literature, a few approaches have been proposed to address class evolution problems, e.g., Learn++.NC, ECSMiner  and CLAM . Although they have shown promising performance, they implicitly assume that classes emerge or disappear in a transient manner.
The ensemble clustering technique has recently been drawing increasing attention due to its ability to combine multiple clusterings to achieve a probably better and more robust clustering , , , , . The relationship between objects lies not only in the direct connections, but also in the indirect connections. The key problem here is how to exploit the global structure information in the ensemble effectively and efficiently and thereby improve the final clustering results. An ensemble clustering approach is constructed with sparse graph representation and probability trajectory analysis.
Microclusters are applied as a compact representation for the ensemble data, which is able to greatly reduce the problem size and facilitate the computation. How are the relationships between x1 and x2 and between x2 and x3 in the context of (a) a graph with three nodes, (b) a graph with six nodes and (c) a graph with eight nodes? To deal with the uncertain links, the k-nearest neighbor like strategy termed elite neighbor selection (ENS) is employed to identify the uncertain links and build a sparse graph termed K-elite neighbor graph (KENG) that preserves only a small number of probably reliable links. Two microclusters are elite neighbors if the link between them has a large weight. The use of a small number of probably reliable links is capable of reflecting the overall structure of the graph and may lead to much better and more robust clustering results than using all graph links without considering their reliability.
Microcluster Similarity Graph (MSG) is constructed with the MCA matrix. Then, the ENS strategy is performed on the MSG and the sparse graph K-ENG is constructed by preserving a small number of probably reliable links. The random walks are conducted on the K-ENG graph and the PTS similarity is obtained by comparing random walk trajectories. Having computed the new similarity matrix, any pair-wise similarity based clustering methods can be used to achieve the consensus clustering. Typically, the system uses two novel consensus functions, termed PTA and PTGP, respectively. Note that PTA is based on agglomerative clustering, while PTGP is based on graph partitioning. The PTA and PTGP methods exhibit a significant advantage in clustering robustness over the baseline methods.
Cluster Ensemble Approaches:
Cluster ensemble approaches are gaining more and more attention, due to its useful applications in the areas of pattern recognition, data mining, bioinformatics and so on. When compared with traditional single clustering algorithms, cluster ensemble approaches are able to integrate multiple clustering solutions obtained from different data sources into a unified solution and provide a more robust, stable and accurate final result.
Conventional cluster ensemble approaches have several limitations: (1) they do not consider how to make use of prior knowledge given by experts, which is represented by pairwise constraints. Pairwise constraints are often defined as the must-link constraints and the cannot-link constraints. The must-link constraint means that two feature vectors should be assigned to the same cluster, while the cannot-link constraints means that two feature vectors cannot be assigned to the same cluster. (2) Most of the cluster ensemble methods cannot achieve satisfactory results on high dimensional datasets. (3) Not all the ensemble members contribute to the final result.
The random subspace based semi-supervised clustering ensemble framework (RSSCE) integrates the random subspace technique the constraint propagation approach and the normalized cut algorithm into the cluster ensemble framework to perform high dimensional data clustering. The incremental semi-supervised clustering ensemble framework (ISSCE) is designed to remove the redundant ensemble members. When compared with traditional semi supervised clustering algorithm, ISSCE is characterized by the incremental ensemble member selection process based on a global objective function and a local objective function, which selects ensemble members progressively.
The local objective function is calculated based on a newly designed similarity function which determines how similar two sets of attributes are in the subspaces. Next, the computational cost and the space consumption of ISSCE are analyzed theoretically. Multiple semi-supervised clustering ensemble approaches are analyzed over different datasets. The experiment results show the improvement of ISSCE over traditional semi supervised clustering ensemble approaches or conventional cluster ensemble methods on 6 real-world datasets from UCI machine learning repository and 12 real-world datasets of cancer gene expression profiles.
The contributions of the system are fourfold. An incremental ensemble framework for semi-supervised clustering in high dimensional feature spaces. A local cost function and a global cost function are applied to incrementally select the ensemble members. The similarity function is adopted to measure the extent to which two sets of attributes are similar in the subspaces. Non-parametric tests are used to compare multiple semi supervised clustering ensemble approaches over different datasets.
The data partitioning methods are applied to group the relevant records. The transactions are compared with attribute values. The ensembles are constructed to improve the cluster results. The ensemble members are identified with relationship information. The ensemble member based relationship analysis for the transactions are used in the clustering process. The Incremental Semi-Supervised Cluster Ensemble (ISSCE) framework is employed to perform data clustering with ensembles. The Incremental Ensemble Membership Selection (IEMS) scheme is adapted to fetch the ensemble members for the clusters. The Similarity Function (SF) is applied to compute the similarity values between the transactions. The following issues are identified from the ISSCE Scheme. Structure independent ensemble selection is not supported. Data type complexity levels are not considered in the system. The system produces limited clustering accuracy levels. Dynamic membership selection process is not supported.
Adaptive Ensemble Construction And Analysis Scheme:
The cancer data clustering system is designed to perform data partitioning on the cancer diagnosis data values. The Incremental Semi-Supervised Cluster Ensemble (ISSCE) scheme is applied for the clustering process. Incremental Ensemble Membership Selection (IEMS) scheme is used for the cluster ensemble selection process. The relationship levels are estimated with the similarity functions. The Dynamic Ensemble Membership Selection (DEMS) is used to perform the ensemble selection for structure and complexity independent data values. The DEMS scheme is integrated with Partition Around Medoids (PAM) clustering algorithm.
The system is designed to perform multi model data clustering with efficient similarity analysis model. The link based similarity model is tuned to analyze multi model data values. Ensemble based clustering approach is used in the system. The system consists of six modules. They are Data cleaning process, Ensemble selection, Local Similarity Analysis, Global Similarity Analysis, ISSCE Clusters and PAM Clusters with DEMS.
The data cleaning module is designed to update noise data values. The ensemble selection module is designed to identify the cluster initial ensembles. The local similarity estimation process is carried out with the ensembles that are identified with the incremental model. The global similarity estimation process is carried out with the dynamic ensemble member selection model based data values. The Incremental Semi-Supervised Cluster Ensemble (ISSCE) approach is used in the ISSCE clustering process. The DEMS based PAM clustering approach is adapted in the dynamic membership based clustering process.
The cancer diagnosis details are imported from textual data files. The textual data contents are parsed and categorized with its property. Redundant and noise records are identified and maintained separately. The data values are parsed and transferred into the Oracle database. Redundant data values are removed from the database. Missing elements are assigned using aggregation based substitution mechanism. Cleaned data values are referred as optimal data sets. Cluster ensembles are selected from the transaction collections. Cluster count is collected from the user. The ensemble selection is carried out with the Incremental Ensemble Membership Selection (IEMS) scheme. The Similarity Function is used to compare the transactions. The ensembles are identified for each cluster levels.
The local similarity analysis module is designed to perform the transaction similarity estimation process. Incremental ensemble based model is adapted to the similarity estimation process. The continuous data values are converted into categorical data. Median values are used in the conversion process. Similarity function is used in the relationship analysis. The similarity values are updated into the dissimilarity matrix. The similarity analysis is performed to estimate the transaction relationship. Independent data similarity is designed for binary, categorical and continuous data types. Similarity function is tuned to find similarity for all type of data values. Vector and link models are integrated for relationship analysis. The Dynamic Ensemble Membership Selection (DEMS) scheme is used in the ensemble member identification process. The similarity estimation is performed with the finalized ensemble values. Structure independent ensemble membership selection is used in the system.
The Incremental Semi-Supervised Cluster Ensemble (ISSCE) approach is adapted to perform the data clustering process. The Incremental Ensemble Membership Selection (IEMS) algorithm is used in the ensemble member selection process. The Similarity Function (SF) is applied to estimate the transaction similarity values. Local relationships are considered in the similarity estimation process. The Similarity matrix is composed with incomplete similarity details. The similarity intervals are used to partition the data values. The clustering process is performed with the user provided cluster count values. The cluster list shows the list of clusters with the transaction count. The cluster details form shows the cluster name and its associated transactions.
The Partition Around Medoids (PAM) algorithm is used for the clustering process. The dissimilarity is minimized in the PAM algorithm. The Dynamic Ensemble Membership Selection (DEMS) scheme is employed to select the ensemble members with structure independent mechanism. Data set complexity is also considered in the DEMS scheme. The Similarity functions are also tuned for the dynamic ensemble member selection process. The Dynamic Ensemble Membership Selection (DEMS) scheme is integrated with the PAM clustering algorithm. The clustering process is carried out with the cluster count specified by the user.
The medical data analysis system is developed to partition the breast cancer patient diagnosis details. The Incremental Semi-Supervised Cluster Ensemble (ISSCE) approach is used for the data clustering process. The Incremental Ensemble Membership Selection (IEMS) mechanism is used to select the members for the cluster ensembles. The Similarity Function (SE) is used to find out the relationship between the transactions. The Dynamic Ensemble Membership Selection (DEMS) scheme is build to identify the ensembles with structure and complexity independency. The DEMS scheme is integrated with the Partition Around Medioids clustering algorithm to produce better cluster results. The system performance is evaluated with Incremental SemiSupervised Cluster Ensemble (ISSCE) and Dynamic Ensemble Membership Selection based Partition Around Medoids (DEMS based PAM) clustering schemes. The system is tested with three performance parameters to measure the cluster quality levels. They are F-measure, purity and separation index levels. The system is tested with different data intervals.
The clustering system is analyzed using the breast cancer patient data sets collected from the University of California Irwin (UCI) machine learning repository. The dataset is downloaded from http://archive.ics.uci.edu/ml/datasets.html. The diagnosis details are collected from patients from different countries. The dataset contains 1000 transactions with 15 attributes. Missing values are replaced in preprocess. Aggregation based data substitution mechanism is used for the data preprocess. Redundant transactions are removed from the datasets during the preprocess.
The F-measure is a harmonic combination of the precision and recall values used in information retrieval. Each cluster obtained can be considered as the result of a query, whereas each pre classified set of documents can be considered as the desired set of documents for that query. Thus, the system can calculate the precision P(i, j) and recall R(I, j) of each cluster j for each class i.
If [n.sub.i] is the number of members of the class i, [n.sub.j] is the number of members of the cluster j and [n.sub.ij] is the number of members of the class i in the cluster j, then P(i, j) and R(i, j) can be defined as n
P(i, j) = [n.sub.ij]/[n.sub.j], (1)
R(i, j) = [n.sub.ij]/[n.sub.i] (2)
The corresponding F-measure F(i, j) is defined as
F(i, j) = 2 * P(i, j) * R(i, j)/P(i, j) + R(i, j) (3)
Then, the F-measure of the whole clustering result is defined as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (4)
where n is the total number of documents in the data set. In general, the larger the F-measure is, the better the clustering result is (2).
The F-measure analysis between the Incremental Semi-Supervised Cluster Ensemble (ISSCE) and Dynamic Ensemble Membership Selection based Partition Around Medoids (DEMS based PAM) clustering schemes is shown in figure 6.1.. The analysis result shows that the Dynamic Ensemble Membership Selection based Partition Around Medoids (DEMS based PAM) scheme increases the f-measure level 15% than the Incremental Semi-Supervised Cluster Ensemble (ISSCE) clustering scheme.
The purity of a cluster represents the fraction of the cluster corresponding to the largest class of documents assigned to that cluster; thus, the purity of the cluster j is defined as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (5)
The overall purity of the clustering result is a weighted sum of the purity values of the clusters as follows:
Purity = [summation over (j)] [n.sub.j]/n Purity(j) (6)
In general, the larger the purity value is, the better the clustering result is (6).
The Purity analysis between the Incremental Semi-Supervised Cluster Ensemble (ISSCE) and Dynamic Ensemble Membership Selection based Partition Around Medoids (DEMS based PAM) clustering schemes are shown in figure 6.2.. The analysis result shows that the Dynamic Ensemble Membership Selection based Partition Around Medoids (DEMS based PAM) scheme increases the purity level 10% than the Incremental Semi-Supervised Cluster Ensemble (ISSCE) clustering scheme.
D. Separation Index:
Separation Index (SI) is another cluster validity measure that utilizes cluster centroids to measure the distance between clusters, as well as between points in a cluster to their respective cluster centroid.
It is defined as the ratio of average within-cluster variance to the square of the minimum pairwise distance between clusters:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Where [m.sub.i] is the centroid of cluster [c.sub.i], and distmin is the minimum pairwise distance between cluster centroids. Clustering solutions with more compact clusters and larger separation have lower Separation Index, thus higher values indicate better solutions. This index is more computationally efficient than other validity indices, such as Dunn's index, which is also used to validate clusters that are compact and well separated. In addition, it is less sensitive to noisy data.
The Incremental Semi-Supervised Cluster Ensemble (ISSCE) and Dynamic Ensemble Membership Selection based Partition Around Medoids (DEMS based PAM) clustering schemes are shown in figure 6.3.. The analysis result shows that the Dynamic Ensemble Membership Selection based Partition Around Medoids (DEMS based PAM) scheme increases the inter cluster distance level 30% than the Incremental SemiSupervised Cluster Ensemble (ISSCE) clustering scheme.
The cancer data clustering system is built to partition the breast cancer diagnosis data values. The Incremental Semi-Supervised Cluster Ensemble (ISSCE) approach is used for the data clustering process with ensemble models. The ensemble identification process is performed with Incremental Ensemble Membership Selection (IEMS) scheme. The Similarity Function (SF) is applied to estimate the relationship values. The Dynamic Ensemble Membership Selection (DEMS) mechanism is applied to identify the cluster ensembles with structure and data complexity independent models. The Partition Around Medoids (PAM) clustering scheme is integrated with Dynamic Ensemble Membership Selection (DEMS) mechanism.
[1.] Yang, Y. and K. Chen, 2011. "Temporal data clustering via weighted clustering ensemble with different representations," IEEE Trans. on KDE, 23(2): 307-320.
[2.] Iam-On, N., T. Boongoen, S. Garrett and C. Price, 2011. "A link-based approach to the cluster ensemble problem," IEEE Trans. on PAMI, 33(12): 2396-2409.
[3.] Gama, J., I. Z" liobaite', A. Bifet, M. Pechenizkiy and A. Bouchachia, 2014. "A survey on concept drift adaptation," ACM Comput. Surv., 46(4): 44: 1-44: 37.
[4.] Minku, L., A. White and X. Yao, 2010. "The impact of diversity on online ensemble learning in the presence of concept drift," Knowledge and Data Engineering, IEEE Transactions on, 22(5): 730-742.
[5.] Minku, L. and X. Yao, 2012. "DDD: A new ensemble approach for dealing with concept drift," Knowledge and Data Engineering, IEEE Transactions on, 24(4): 619-633.
[6.] Al-Khateeb, T., M. Masud, L. Khan, C. Aggarwal, J. Han and B. Thuraisingham, 2012. "Stream classification with recurring and novel class detection using class-based ensemble," in 2012 IEEE 12th International Conference on Data Mining, pp: 31-40.
[7.] Lai, C.-P., P.-C. Chung and V.S. Tseng, 2010. "A novel two-level clustering method for time series data analysis," Expert Systems with Applications, 37(9): 6319-6326.
[8.] Shama and S. Phadikar, 2014. "Automatic color image segmentation using spatial constraint based clustering," in Emerging Trends in Computing and Communication. Springer, pp: 113-121.
[9.] Zhang, Y. and T. Li, 2011. "Extending consensus clustering to explore multiple clustering views," in SDM, pp: 920-931.
[10.] Iam-On, N., T. Boongeon, S. Garrett and C. Price, 2012. "A link-based cluster ensemble approach for categorical data clustering," IEEE Trans. on KDE, 24(3): 413-425.
[11.] Xiankun Yang, W.C., 2010. "A novel spatial clustering algorithm based on delaunay triangulation," J. Software Engineering & Applications, 3: 141-149.
[12.] Masud, M., J. Gao, L. Khan, J. Han and B. Thuraisingham, 2011. "Classification and novel class detection in concept-drifting data streams under time constraints," Knowledge and Data Engineering, IEEE Transactions on, 23(6): 859-874.
[13.] RuiXu, Donald C. Wunsch, 2010. "Clustering algorithms in biomedical research: a review," IEEE Reviews in Biomedical Engineering, 3: 120-154.
[14.] Guangtao Wang and Qinbao Song, 2016. "Automatic Clustering via Outward Statistical Testing on Density Metrics", Transactions on Knowledge and Data Engineering.
[15.] Dennis Ebenezer and M. Suganya, 2016. "A Survey On Data Partitioning And Cluster Ensemble Techniques", Elysium Journal of Engineering Research and Management, 3(6): 1-6.
(1) M. Suganya and (2) D. Dennis Ebenezer
(1) PG Scholar, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Oddanchatram, Tamilnadu-624619, India.
(2) Assistant Professor, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Oddanchatram, Tamilnadu-624619, India.
Received 18 January 2017; Accepted 22 March 2017; Available online 28 March 2017
Address For Correspondence:
M. Suganya, PG Scholar, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Oddanchatram, Tamilnadu-624619, India.
Fig. 1: F -measure Analysis of ISSCE and DEMS based PAM Schemes F-measure Analysis of ISSCE and DEMS based PAM Schemes Transactions F-Measure ISSCE DEMIS based PAM 200 0.791 0.943 400 0.81 0.955 600 0.783 0.962 800 0.829 0.972 1000 0.868 0.989 Note: Table made from bar graph. Fig. 2: Purity analysis of ISSCE and DEMS based PAM Schemes Purity analysis of ISSCE and DEMS based PAM Schemes Transactions Purity ISSCE DEMS based PAM 200 0.826 0.951 400 0.839 0.965 600 0.851 0.972 800 0.863 0.989 1000 0.875 0.997 Note: Table made from bar graph. Fig. 3: Separation Index Analysis of ISSCE and DEMS based PAM Schemes Separation Index Analysis of ISSCE and DEMS based PAM Schemes Transactions Separation Index ISSCE DEMS based PAM 200 2.806 4.705 400 2.716 4.365 600 3.796 6.728 800 6.484 9.213 1000 8.719 11.457 Note: Table made from bar graph.
|Printer friendly Cite/link Email Feedback|
|Author:||Suganya, M.; Ebenezer, D. Dennis|
|Publication:||Advances in Natural and Applied Sciences|
|Date:||Mar 1, 2017|
|Previous Article:||Interfacing of large scale specimen testing machines in big hospitals to HMS software using RS 232.|
|Next Article:||Malicious node identification in MANETs based on false information.|