Modified Rock (MROCK) algorithm for clustering categorical data.
Data mining is a technique used for processing data from different perspectives which leads to summarization of useful information. The goal of the data mining process is to extract information from a data set and transform it into an understandable format for further use. In data mining, the data is mined using two learning methodologies. They are
* Supervised Learning
* Unsupervised Learning
Supervised Learning is the task of deriving a function from labeled training data. The training data is a set of training examples. Supervised learning is a pair consisting of an input object and output value. A supervised learning algorithm takes the training data and then produces an inferred function that can be used for mapping new examples. An optimal scenario will allow the algorithm to correctly determine the class labels for unseen instances. Classification and regression are the approaches under Supervised Learning.
In machine learning, the problem of unsupervised learning tries to find hidden structure in unlabeled data. The examples that are given to the learner are unlabeled. This is the difference between unsupervised learning and supervised learning. Unsupervised learning is related to the density estimation problem in statistics. However the unsupervised learning also contains many other techniques that aim to summarize and explain key features of the data. Many methods that are given in unsupervised learning are based on data mining methods which are used to preprocess data. Clustering is one of the approaches under unsupervised learning.
1. Related Work:
Recently, clustering algorithms for mining large databases have been proposed in (Raymond et al, 1994, Martin Ester et al, 1995, Tian Zhang et al, 1996, Sudipto Guha et al., 1998, Martin Ester et al., 1996). Most of these, are variants of either partition (Raymond et al., 1994) or centroid-based hierarchical clustering (Tian Zhang et al, 1996, Sudipto Guha et al, 1998). These algorithms are not suitable to cluster the categorical data, instead used to cluster numeric data.
For example, CLARANS (Raymond et al., 1994) incorporates a randomized search to find the k-best cluster medoids. BIRCH, proposed in (Tian Zhang et al., 1996), finds the clusters in two steps. As the first step, the data is pre clustered and finally, the centroid based hierarchical algorithm is used to cluster the partial clusters. The CURE algorithm proposed in (Sudipt0 Guha et at., 1998) is a combination of random sampling and partition clustering to handle large databases. The hierarchical clustering algorithm represents each and every cluster by a certain number of entities that are generated by selecting well scattered entities and then shrinks them toward the cluster centroid by a specified fraction. DBSCAN (Density Based SCAN), which is proposed in (Martin Ester et at., 1996), grows clusters by including the dense neighborhoods of points already in the cluster. This approach may be prone to errors if clusters are not well-separated. The ROCK clustering algorithm, proposed in (S. Guha et at., 1999) is based on the clustering of categorical data. The classification of all the clustering algorithms is given in Fig 1.
Clustering is a process of grouping with similar properties. Any cluster should exhibit two main properties, low inter-class similarity and high intra-class similarity. Clustering is an unsupervised learning i.e., it learns by observations rather than examples. There are no predefined class label exists for the data points. Cluster analysis is used in many number of applications such as data analysis, image processing, etc., Clustering algorithms can be divided into many number of categories which are illustrated in Fig 1. ROCK algorithm comes under the Partitioned Clustering algorithms where the complete data set is portioned into several subsets. Fig 2 represents the original entities of a data set and entities after the partition clustering is performed.
This section outlines the data mining learning approaches, different types of clustering algorithms and the algorithms that come under each type. Next section briefly explains the existing ROCK algorithm.
3. Rock Clustering Algorithm:
ROCK clustering algorithm is used for clustering of the categorical data. The ROCK clustering algorithm is mainly based on the following
* Links and
* Goodness measure
A entity neighbors are those points that are considerably similar to it. The similarity is calculated using the Jaccard Coefficient.
sim ([c.sub.1], [c.sub.2] = [absolute value of [c.sub.1] [intersection] [c.sub.2]]/ [absolute value of [c.sub.1] [union] [c.sub.2]] (1)
where, [C.sub.1] and [C.sub.2] are the two entities in the data.
The Jaccard Coefficient depends on the common attributes for both the entities for which the neighbors are calculated. If the similarity value is greater than the fixed threshold, then the entity is considered as neighbor of another entity.
The link[i,j] is defined as the common attributes between the entities i and j. If the value of the link[i,j] is large, then it is more probable that entities i and j belong to same cluster.
This is used to merge the two clusters. It is calculated based on the number of links that exists between two entities. The goodness measure used in ROCK is given below.
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Eq(2)
where, link[[C.sub.i], [C.sub.j]] is the number of cross links which is nothing but the number of common attributes between [C.sub.i], [C.sub.j] [n.sub.i], [n.sub.j] are the size of the clusters [C.sub.i], [C.sub.j],
f ([theta]) = 1-[theta]/1+[theta]
In this section the existing ROCK clustering algorithm is explained in detail. In the next section, the proposed MROCK clustering algorithm and the comparison with the existing ROCK algorithm is briefly discussed.
4. ProposedMrock Clustering Algorithm:
The proposed MROCK clustering algorithm is based on the ROCK algorithm (S. Guha et al, 1999) which is extensively used in the area of clustering for the categorical values of the data. The proposed clustering algorithm proceeds in three stages as follows.
Step 1: Computation of the intra cluster similarity
In this step, the similarity between each and every entity pair of the data set is calculated. The proposed similarity measures for calculating the similarity between the two entities are given as follows and the best similarity among the following similarity measures is considered for MROCK algorithm.
Modified Sorensen Dice coefficient:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Eq(3)
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Eq(4)
Modified Second Kulczynski:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Eq(5)
where, [C.sub.1] and [C.sub.2] are the two entities in the data.
For a sample data set which consists of 20 entities and 5 attributes, the similarity between each and every entity pair is calculated and those values are compared with the Human Judgment values and are illustrated in Fig 3.
The correlation coefficient is calculated for the different similarity measures with the HJ values and is depicted in Fig 4. In comparision the Modified Tversky model is found to correlate well with Human judgements and the correlation coefficient obtained is 0.77.
Step 2: Computation of the neighbor list
The neighbor list of the entities is nothing but the entities which are related to any particular entity. After calculating the intra similarity, if the similarity is greater than or equal to the fixed threshold, then that entity is considered as the neighbor. Based on the experimental results, the threshold ([theta]) fixed for the experiment as 0.555.
Step 3: Calculating the goodness measure
Goodness measure is calculated between the two clusters to decide whether to merge the clusters or not. Initially each and every entity is taken as one cluster and subsequently based on the calculated goodness measure, the clusters are merged. Here, The merging process is continued till the links between the entities disappear or till a specified number k of clusters remain.
The proposed goodness measure for merging the two clusters is given below.
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] Eq(6)
where link[[C.sub.i], [C.sub.j]] is the number of cross links between [C.sub.i], [C.sub.j] [n.sub.i], [n.sub.j] are the size of the clusters [C.sub.i], [C.sub.j]
f([theta]) = 1-[theta]/1+[theta]
The complete pseudo code for the proposed MROCK Clustering Algorithm is given in Fig 5.
Fig. 5: Pseudo code for the proposed clustering algorithm. Algorithm MROCK (D, k, [theta]) Input: Set of entities, D, number k of clusters to be found, the similarity threshold [theta] Output: Set of clusters Begin Step 1: Compute the similarity for each i, j [member of] D using Eq(4) Step 2: Compute neighbor list for each i [member of] D Fig. 5: Pseudo code for the proposed clustering algorithm. if sim (i, j) [greater than or equal to][theta] then j is considered as neighbor of i. Step 3: Compute the links link [i,j] for each i, j e D //Number of links between i,j is number of common of common attributes //between i and j Step 4: Compute the goodness measure g ([C.sub.i], [C.sup.j]) using Eq(6) //Store the goodness measure values in the descending order of the magnitude and //merge the clusters with high goodness measure value. Step 5: Formation of the clusters For each [C.sub.i] and [C.sub.j] //Merge the clusters till the links disappear //or no.of clusters k remain. End
In this section, the detailed description of the proposed MROCK clustering algorithm has been given. In the next section, the proposed MROCK clustering algorithm is compared with the existing ROCK algorithm.
4.1 Comparison of Mrock Algorithm with Rock Algorithm:
The comparison between the proposed MROCK clustering algorithm with the existing ROCK clustering algorithm can be done by taking entities from a sample data set. The following two figures, Fig 6 and Fig 7 depict the formation of the clusters using ROCK and MROCK respectively.
In this section, the proposed clustering MROCK and its comparison with the existing ROCK algorithm is discussed clearly. The next section concludes by giving the results obtained by using the synthetic data set and sample of Mushroom data set.
The existing clustering technique ROCK with which the proposed Modified ROCK is compared. For the synthetic dataset used for experimental purpose, number of clusters formed from ROCK are 4 and from MROCK, number of clusters formed are 2. For the sample mushroom dataset, number of clusters formed from ROCK are 5 and from MROCK, number of clusters formed are 3. Contributions of this work are,
* A new intra similarity measure is proposed for the computation of the intra similarity between the entities in the categorical data.
* A new goodness measure is proposed to merge the clusters in an effective manner.
* Using the above intra similarity measure and goodness measure, MROCK algorithm is given which performs better compared to ROCK.
Received 12 October 2014
Received in revised form 26 December 2014
Accepted 1 January 2015
Available online 25 February 2015
Alexander Hinneburg, Daniel A.Keim, 1998. An Efficient Approach to Clustering in Large Multimedia Databases with Noise, American Association for Artificial Intelligence.
Geroge Karypis Eui-Hong (Sam) Han, Vipin Kumar, 1999. Chameleon: Hierarchical Clustering Using Dynamic Modeling, IEEE, August.
Martin Ester, Hans-Peter Kriegel and Xiaowei Xu, 1995. A database Interface for Clustering in Large Spatial Databases, in proceedings of the Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August.
Martin Ester, Hans-Peter Kriegel, Jorg Sander and Xiaowei Xu, 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial Database with Noise, in proccedings of the Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, August.
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jorg Sander, 1999. OPTICS: Ordering Points to Identify the Clustering Structure, in the proceedings of ACM SIGMOD'99 Int.Conf.on Management Data, Philadelphia PA.
Raghunath Kar and Susant Kumar Dash, 2011. A Study On High Dimensional Clustering By Using Clique, International Journal of Computer Science & Informatics, Volume-I, Issue-II.
Raymond T. Ng and Jiawei Han, 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, in proceedings of the 20th VLDE Conference Santiago, Chille, pp: 144-155.
Guha, S., R. Rastogi and K. Shim, 1999. ROCK: a robust clustering algorithm for categorical attributes, in Proceedings of the 15th International Conference on Data Engineering, pp: 512-521.
Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, 1998. CURE: A Clustering Algorithm for Large Databases, in proceedings of the ACM SIGMOD Conference on Management of Data, May.
Tian Zhang, Raghu Ramakrishnan and Miron Livny, 1996. Birch: An Efficient Data Clustering method for Very Large Databases, in proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp: 103-114.
Wei Wang, Jiong Yang, Richard Muntz, 1997. STING: A Statistical Information Grid Approach to Spatial Data Mining, in the proceedings of the 23rd VLDB Conference, Athens, Greece.
(1) Dr. K. Saruladha and (2) P.V. Likhitha
(1) Assistant Professor, Pondicherry Engineering College, Puducherry, India.
(2) Post Graduate Student, Pondicherry Engineering College, Puducherry, India.
Corresponding Author: P.V. Likhitha, Post Graduate Student, Pondicherry Engineering College, Puducherry, India.
|Printer friendly Cite/link Email Feedback|
|Author:||Saruladha, K.; Likhitha, P.V.|
|Publication:||Advances in Natural and Applied Sciences|
|Date:||Jun 1, 2015|
|Previous Article:||Enhancement of Gaussian noise affected images using modified threshold function in curvelet domain.|
|Next Article:||Multimodal biometric key generation for cryptographic security using face and iris.|