# Research on cluster analysis of high dimensional space based on fuzzy extension.

1. Introduction

Since the beginning of the 21st century, all walks of life have been widely used a large number of spatial database along with the rapid development of the information age. As a result, the field of spatial data mining rise rapidly [1-2]. For Spatial clustering methods have great advantages in solving and analyzing complex problems lack of background information, the method of spatial clustering spatial data has been one of the most active and most widely used technology in mining field.

Now widely used clustering algorithm can be divided into the following types: Level-based approach, based on the partitioning method, grid-based methods, model-based approach and density-based method. However, for a particular clustering method, it is generally a variety of clustering integration together, and from the type Point of view, it can't simply be attributed to a single category. General, different clustering algorithms can be combined with each other, which will become more efficient and more reliable. In recent years, with more in-depth technology found on the clustering of knowledge, more and more new clustering algorithm emerges, which usually combine the idea of a variety of clustering methods.

Accompanied by the continuous expansion of application areas and more in-depth of clustering analysis, in recent years high-dimensional clustering problem has gradually become the focus of the study of cluster analysis. With the rapid development of a variety of detectors and sensor technology, the number of spatial data properties also increased by a few dozens or even hundreds for spatial data. To discover knowledge from dozens up to hundreds-dimensional spatial data clustering has become a challenge direction and difficult in the current course of the study. There are three types of algorithms for clustering of high dimensional data: Collaborative clustering, subspace clustering and attribute conversion. Collaborative clustering is widely used in Web and text mining, cluster analysis, however, Subspace clustering and attribute conversion method are mostly used in cluster analysis of high dimensional space. With the wide application of a variety of complex special database, the actual demand asks for newer and higher requirements on the spatial clustering task, which has brought new challenges for spatial clustering methods and technology.

2. High dimensional space clustering algorithm

2.1 Cluster analysis and high-dimensional data features

Clustring is an important data mining method. The so-called clustering is a data object to group making more than one cluster or class, After generated the cluster is a collection of a set of data objects In the same cluster objects with each other have high similarity, but objects in different clusters are different.

High dimensional spatial data are usually able to show the following characteristics [3]:

2.1.1 Data are sparse

The characteristics of high-dimensional data is different from the low-dimensional data, whose data distribution is relatively sparse. As a result, the distance between the scale and regional density in the data no longer has the intuitive meaning in high-dimensional space, Therefore, the detection and the discovery of outliers are different from the traditional one in high-dimensional space.

2.1.2 Dimension effect

Nowadays, the clustering algorithms are basically for low-dimensional space designed. For the characteristics of the effect of high-dimensional space, the data will show the dimension and sparse data characteristics. Most of the clustering algorithm fails because of the cluster centers can't be determined, the similarity function can't be calculated and other factors. It may not be accepted due to the algorithm time and space complexity increases exponentially. Thus, we need research and develop a high dimensional space clustering algorithm suitable according to the characteristics of high dimensional spatial data. The essential difference between spatial data and other types of data is its spatial properties. Space property contains the contents of the spatial distance, location, geometry and size. What's more, it can be looked as individual relationship between space, such as: Orientation relations, topological relations and metric relations, which make spatial data more complex than other types of data [4].

2.2 The difficulty of high dimensional data clustering

The difficulty is of high dimensional data clustering are as follows:

(1) Definition of the distance function is too difficult. Based on the similarity measure between the data objects for clustering operation, and ultimately the high similarity of the object is classified as a class. The equidistance function of the Euclidean distance is commonly used to measure the similarity in the low-dimensional space. In high-dimensional case, we need to re-design and similarity of guidelines and standards for the establishment of a new measure of data objects for distance function failure.

(2) In the case of the high-dimensional space, to the cluster mean will be very close calculated by distance formula and appear gap between the increasingly zero phenomenon. That is to say, Data dot pitch distance between the farthest and the nearest data points gradually tends to zero along with the increase in dimension. This will result in the clustering operation can't be done without a clear distinction between the center of the cluster.

(3) Dimension of the space is too high will lead to increased computational complexity of traditional clustering algorithms, which will make inefficient and unacceptable limiting to its range of applications.

Apparently, the category number that people are able to perceive is related to scale of the observational data: When the scale of observation is very rough, It can be said that the entire data set is a category; But when the scale of observation is in a very fine condition, all data points can be seen as the same category. This Also brings a new way to determine the most reasonable method of the class number [5]. That is to say, the number of category with the number of selected fixed number within a maximum range of the scale of change is the longest-scale survival of the categories. When the scale conditions are different, multidimensional spatial data often have different spatial forms of performance, but at the same time it also generates different clustering results. So scale problems has been gradually attracted more and more attention in many areas, especially in the field of space science. But for multiscale spatial, data mining research is still in its infancy on the current situation. Currently, Most of the spatial scale under the conditions of spatial clustering algorithm is focused on expansion in the scale based on the hierarchical clustering algorithm. Including by breaking the inconsistent side clustering or minimum spanning tree method to introduce the scale clustering [6-7]. However, these methods require human input parameters, such as the resulting class and other parameters, which has a great influence on the ultimate scale of clustering results. It can be said that these algorithms have severe dependence parameters.

2.3 The key issues of high dimensional spatial data clustering analysis

Obviously, the most critical issue of the clustering of high dimensional spatial data analysis is how to choose the high-dimensional into low-dimensional approach. And clustering results can also get feedback and clustering methods applied to the entire high-dimensional space. For an easy-to-use way to reduce the dimension, Although the method of attribute reduction and attribute conversion clustering analysis of high dimensional data. But through the dimension reduction the difference between the normal data and noise data will be reduced and the quality of clustering can't be guarantee.

In addition, although the application of dimensionality reduction techniques reduces the data dimension of the space, a lot of important clustering information get lost. Its comprehensibility and interpretability is poor and there exist certain degree of difficulty to clearly express the results. This shows there are significant limitations in dimensionality reduction based on attribute conversion of high-dimensional data processing applications, and it is not able to meet the development needs of the current applications of high-dimensional clustering.

The subject according to the characteristics of space noise data is large and complex and a high degree of noise under the space. Using the set of points D in the after-dimensional space is a class, and using high-dimensional space, subspace clustering method to Clustering. Through this method to find the largest collection of spatial data centrally connected dense units is the spatial data of the cluster. For the problems such as cluster boundary judgment clear and clustering is not smooth, finally proposed an extended high-dimensional space based on fuzzy clustering algorithm through an important role in determining the cluster boundary grid fuzzy extended.

3. Spatial data mining and spatial clustering

Spacial Data Mining belongs to a branch of Data Mining. Spatial data mining is different from the general data mining and the conventional transaction database. Spatial cluster analysis is not only an important part of spatial data mining, but also widely used in spatial data mining and in-depth study of one of the elements.

3.1 Spatial Data Mining

Compared with traditional data mining, the main different point performance of spatial data mining is in the following areas:

(1) In traditional data mining processing, the processing content is the numbers and types. But in spatial data handling, processing is more complex data types, such as: points, lines and polygons and other objects.

(2) Traditional data mining is usually explicit input, but it Is often implicit in the spatial data mining data input.

(3) Assumptions in the traditional data mining: The data sample is generated independently. However, this assumption is not established in the spatial data mining. In fact, there is a high degree of correlation between the spatial data, for example: People with similar occupational characteristics and background are usually easy to gather in the same region.

Spatial clustering analysis of the requirements of the clustering algorithm is more stringent than the traditional clustering, its clustering algorithm, typical requirements are as follows:

3.1.1 Scalability

Because of the spatial database contains extremely rich data, the structure is more complex, we must seek a more rapid and effective clustering algorithm. Its running time must be acceptable, predictable, so the time complexity of the algorithm of polynomial or exponential no practical value.

3.1.2 Ability to handle different data types

The spatial database contains data not only have complex spatial properties, but also contains the non-spatial attributes. Generally contain a variety of different data types, a good algorithm should be able to handle different data types.

3.1.3 Able to identify all type of spatial clustering of arbitrary shape

The specific features of spatial clustering is not known prior to analysis. Spatial clustering contains a variety of complex shapes, so a good algorithm should be able to identify clusters of arbitrary shape.

3.1.4 The ability to deal with noise

Often the database contains vacancies, outliers, bad data and unknown data in reality. If the clustering algorithm is sensitive to such data, it will lead to poor clustering results.

3.1.5 High dimensional spatial data

Spatial data will generally contain a higher dimension. How to clustering spatial data mining is special requirements on clustering in the high-dimensional space.

3.1.6 Usability and understandability

Results of clustering should be understandable and available for the user.

3.2 High dimensional space clustering method

Spatial Clustering is that the data set in a large multidimensional space, and using the distance metric in order to identify the clustering. This will make object has a high similarity in the same cluster, while the objects in different clusters are different from each other [8-9].

Because of the existence of Dimension effect and the data sparseness problem in high dimensional data, most of the existing clustering algorithm is in a state of failure at present. The characteristics of high dimensional spatial data in the high-dimensional clustering process are in the following areas:

(1) Difficult to define the distance function (similarity function).

Due to the similarity between the conditions low-dimensional Euclidean distance function equidistant measure data in high-dimensional space can't be passed. This lead to clustering operating on the basis of measurement can't be completed, and the new measure of high-dimensional spatial data similar to the standards and guidelines should be re-considered;

(2) Clustering operations generally need to calculate the cluster mean, or to find the value of the neighbors.

In high-dimensional case, the mean calculated by the distance function will be very close to each other. In many cases, a data point and it recently and the furthest data point is almost equidistant, which makes the cluster center can't determine the cause of the clustering operation that can't be;

(3) Because of the increased dimensions, the calculation of the clustering algorithm is to become a huge overhead, which result in reduced efficiency of the algorithm.

In recent years, a large number of scholars are committed to data clustering algorithm suitable for high-dimensional space. The findings can mainly be divided into two types, namely the subspace clustering method and attribute reduction method.

3.3 Attribute Reduction Method

Attribute Reduction Method refers to not lose the original data the main features they put a high-dimensional data to low-dimensional space and method of operation. Generally speaking, attribute reduction is usually by the characteristic transformation (Feaure Transformation) and feature selection (Feature Selection) two ways.

3.3.1 Feature Selection

From the original high-dimensional space, select the number of dimension to form a new coordinate system, and related clustering operation in the new coordinate system is called feature selection. Feature selection process to search, search strategy to search, combinations, the weighted method and random search to evaluate the dimensions. This dimension that has nothing to do with the desired characteristics can be removed in order to achieve the dimension reduction purpose. Feature selection method, however, ignored the relevance of the spatial data, and simply abandoned often lead to spatial clustering, independent of the dimension one-sided results.

3.3.2 Feaure Transformation

The original high-dimensional data set dimensions merged to construct low-dimensional mapping that could ensure that maintaining the characteristics of the same conditions to reduce the number of dimensions, and ultimately enable the algorithm to reflect the high-dimensional data set characteristics of the new low-dimensional space effective clustering method is called the feature transform.

The basic idea is as follows: Guarantee the basic loss of real data, information, and retain only a small part of the correlation dimension, but also so that the data is a considerable degree of approximation. Usually characterized by a transformation dimension reduction method is the use of statistics or linear algebra methods, this technology can also be extended in certain aspects of data performance analysis and has the ability to effectively remove the noise in the data. Feature transform techniques can be divided into two kinds of nonlinear and linear methods. Among them, the linear method of projection pursuit and principal component analysis applies to the handling of linear structure, which has feature of convenient calculation; Nonlinear includes neural network method, the main surface/master curve method to handle the highly nonlinear structure, whose computational is complexity.

3.4 Subspace clustering methods

The basic idea of subspace clustering algorithm is as follows: to find out clusters directly in high-dimensional space is more difficult, considering the original data space is divided into different sub-space from the corresponding sub-space to examine the presence of clustering. For subspace of the high-dimensional data space, it is difficult to understand the relatively abstract visual description high-dimensional space since the space.

Subspace clustering and feature selection is similar to the attempt to found clustering in the same data set to select a different subspace, expanding the feature selection task. Subspace clustering need to use appropriate search strategy and evaluation criteria to select the found subspace clustering search strategy and evaluation criteria, which has a significant impact on the clustering results.

At present, depending on the search direction the subspace clustering methods can be devided into two categories: from the bottom-up search and top-down search method [10].

3.4.1 Search method from the bottom

This approach narrows the search space by clustering the monotony of the dimension, or is the a priori nature of association rules: If a d-dimensional unit is dense, then its projection on the d-1 dimensional space should also be intensive, which is the space of closed; In contrast, d-1 dimensional space given data unit is not intensive, it is in an arbitrary d-dimensional space is not dense. Such algorithms are that each dimension is divided into a number of grids by using subspace clustering algorithm based on density clustering and grid-based method of combining cluster analysis. In this strategy the most prominent problem is easy to produce overlapping clusters, that is, certain points are more than one cluster, or do not belong to any one cluster. What's more based on the density and the density of the grid method the problem threshold step size and grid parameters dependent still exists.

3.4.2 Top-down search method

First the entire data set is divided into k parts, and the same set of weights given to each cluster; Then, you want to repeat some strategy for continuous improvement of these initial clusters, and to update the cluster weights. This strategy have been established for each part of the data clusters, and so to avoid the cluster due to repeated, and then to ensure that a data point exists and exists only in a cluster. However, for large databases, repeat the process of computing the required computational cost is quite high, so most of these algorithms use some strategies to improve the performance of the algorithm on a small part of the inspection data as a data sample. Similarly, in this search strategy the same or similar cluster size, cluster number of parameters will produce heavily dependent. The number of samples is also an important parameter, using the algorithm of the sampling strategy, the values of these parameters on the final clustering results have a significant impact.

4. High dimensional space clustering algorithm based on fuzzy extension

For the problem of non-axis direction of the grid density-based spatial clustering method prone to over-clustering, not smooth, fuzzy clustering and cluster boundary judgment, the authors propose a high-dimensional spatial data subspace clustering algorithm in the literature [11]. This algorithm by extending the adjacent cluster space, the traditional grid clustering method does not consider the data points within the adjacent grid study the influence of the grid caused problems to improve. However, this method does fuzzy extension for all grid cells, and if the number of spatial data dimension is higher, the efficiency of this algorithm is too low.

Through in-depth analysis of the problem, the traditional high-dimensional sub-space grid clustering algorithm showed the boundary is not clear and the problem of clustering is not smooth. The sample point distribution within the cluster boundary region of the grid need to focus on visits, and to them the exact division of clustering, a clustering within the data sample points with emphasis on the significance. Therefore, this paper extended high-dimensional space based on fuzzy clustering method. This method in high-dimensional spatial data subspace clustering analysis method to improve on the basis of the density--the grid method. Through the dense grid density threshold to determine the spatial clustering process, which must be included in the clustering results in the internal.

4.1 Fuzzy set theory

First, fuzzy set theory is proposed by Zadeh in 1965. The different of this theory with the traditional set theory is that in fuzzy set theory, each element in the collection are part of a collection to a certain extent, but to varying degrees included in a number of collections. With this feature, this fuzzy set theory is a lot of meaning to determine but contain inaccurate to provide a mathematical solution.

4.2 Grid fuzzy regions extension and the degree of membership to determine program

The clustering algorithm proposed here is based on the fuzzy extension of the high dimensional space-data space, which is actually a grid--the density-based spatial clustering method. This method inherits the advantages of the grid--density clustering algorithm to identify arbitrary shape clusters. Compared with traditional clustering methods, the biggest difference of this method is the ability to use fuzzy sets in the cluster at the same time also increased the inspection grid space by adjacent grid. Only records the number of data points within the data grid when the grid--the traditional density-based spatial clustering algorithm in Wang Geji number. However, the distribution of the data sample points with those of adjacent grid examined the impact of the data sample points within the grid is negligible, especially in multi-dimensional spatial data with each other a strong relationship between the spatial data correlation.

During the cluster analysis of spatial data, if the interaction between the data is negligible, the accuracy of the clustering results will have a very big impact. After the analysis of fuzzy set theory, fuzzy sets of the relevant grid fuzzy extension and inspection grid of sample points within the grid count is extended to the basic region and fuzzy sample point count range expansion within the region. This method is able to examine the grid to consider the impact of data points within the adjacent grid, to improve the clustering boundary judgment is unclear, and ultimately more accurate results of the subspace clustering of high dimensional spatial data is got.

Fuzzy extension of the grid is extended a certain length of each side of the grid region, and finally the expansion area of the grid is formed. Through repeated experiments, agreed the extended area of the grid for 1/2 grid. Grid on all sides were extended 1/2 side length. It is shown in Figure 1.

4.3 High dimensional space clustering algorithm based on fuzzy extension

High dimensional space clustering algorithm based on fuzzy extensions is described below:

Input: data sample space D, the density threshold value is MinPts;

Output: the center of the initial cluster set; Specific steps:

Step 1: The division of the space grid

In order to form a standard grid structure, the first spatial data sets D in accordance with the formal division. Assume that a given spatial data set D containing d attributes {[X.sub.1], [X.sub.2], [X.sub.3], ...... of [X.sub.d]}, and each attribute of the attribute values

are community. Without loss of generality, set the interval of any property is a semi-open, half closed interval [[m.sub.j], [n.sub.j]), j = 1, 2, 3 ... d, d-dimensional space can be expressed as:

Space X = [[m.sub.1], [n.sub.1]) x [[m.sub.2], [n.sub.2]) x ... x [[m.sub.d], [n.sub.d])

If each grid are an independent index ([i.sub.1], [i.sub.2] ... in.) grids. Since the data in each dimension has the same division of the scale, so you can assume that the number of sample points in each grid can represent the density of data points. In this way, through the formal division of K-order spatial data sets D division of the grid to the same scale.

Step 2: The space within the grid sample count

Space grid sample point count, generally in the grid-based equidistant division of the formation of the grid cell size the same premise, first determine the value of the mesh density, and then to determine the sparse grid and dense grid Classification of the process. During the phase space grid sample count, complete the following two parts:

(1) According to the previous definition, the basic grid area to calculate the number of data points in each grid, and determine the grid according to the density threshold for dense grids or sparse grid.

(2) Sparse grid obtained for all (1) fuzzy extended, taking into account and the grid region and expansion region. The grid sample of fuzzy extension is counted to arrive at the new grid density. Then make a judgment according to the density threshold. Upon completion of the above two steps, you can examine the adjacent grid of sample points within the grid of sample points on the cluster boundary grid. Strong support for the final clustering results clearly identified boundaries, the accuracy of the algorithm significantly improved; At the same time, only for the sparse grid to solve the problem in a most helpful way to examine that all grid fuzzy extension lead to high-dimensional data processing efficiency is not high.

Step 3: Clustering

According to the sparse grid and dense grid of results obtained in Step 2, the adjacent dense grid density connected to the operation. Dense grid merge the corresponding clusters, and the finally the clustering process is completed. In the traditional clustering process, the general shared a side of the grid as is the adjacent grid. Clustering is done only within the adjacent grid, that is clustering direction up, down, left and right. According to this definition of adjacent grid, over-clustering phenomenon of chunking will appear in the non-X-, Y-axis direction. This will allow the original class grid divided into some small cluster affect the clustering results. The algorithm redefine the grid adjacent to solve this problem. Under adjacent grid definition, the clustering process can accord to the nature of the connected to the grid density in the direction of tilt. To the cluster is divided into a number of small clusters is solved at the same time to make up for the shortcomings of traditional clustering methods.

Step 4: Outlier detection and clustering border data points judgement

The high dimensional spatial data is easy to generate a large number of isolated points due to the increased dimension. Therefore, in order to ensure more accurate and reliable clustering results after the end of the clustering process may choose to conduct the necessary outlier detection and cluster boundary point judgment steps. The process is not proposed in this paper the main process of the algorithm, specific practices can be found in the literature [12].

Step 5: The algorithm ends

Output data sample space D dimensional subspace clustering results.

5. Conclusion

Clustering results based on the grid--the density of spatial clustering algorithm do not consider the data points within the adjacent grid currently examining the grid, there is not smooth, the cluster boundary is not clear enough. Therefore, the high-dimensional subspace clustering algorithm based on fuzzy sets, the traditional grid-based density clustering algorithm has been improved: Expansion of the basic grid, the use of fuzzy set membership function on the grid regional and fuzzy extension of data points within the region, counting, taking into account the spatial data correlation, taking into account the influence of adjacent data points within the grid on the current grid, so clustering is not smooth, the boundary is not clear phenomenon is improved; At the same time, re-adjacent grid defined, to some extent, to avoid excessive clustering in the non-coordinate direction. The experimental results show that the method overcomes the deficiencies of the traditional clustering methods. This approach allows the clustering of high dimensional spatial data quality has been improved. Meanwhile, the algorithm execution time overhead than the other improvements in the algorithms have been improved.

Categories and Subject Descriptors:

H.2.8 [Database Applications]; Spatial Databases and GIS: I.5.3 [Clustering]; Algorithms

General Terms: Cluster Analysis, Fuzzy Analysis

Received: 9 May 2013, Revised 19 June 2013, Accepted 25 June 2013

References

[1] Mao Kebiao, Zhihao, Li Haitao, Zhou Ruohong. (2002). Spatial data warehouse-based spatial data mining. Remote sensing information (theoretical research). 1 (9) 26.

[2] Zhang Ruiju. (2002). CHINA. GIS and spatial data mining study of integration issues. Investigation Science and Technology. 2 (1) 21-24

[3] Beyer, K., Goldsteein, J., Ramakrishnan, R., Shaft, U. (1999). When is Nearest Neighbors Meaningful AI. In: Proc. the 7th Int. Conf. Database Theory. Jerusalem Israel, p. 217-235.

[4] Xiangqin Wang Caimei. (2010). Clustering high-dimensional space algorithm-based outlier mining technology research, Computer Technology and Development. (01) 126-131.

[5] Wang Min, Zhou, C., Pei Tao Luo Jiancheng. (2004). MSCMO: based on mathematical morphology operator scale spatial clustering algorithm. Journal of Remote Sensing. 8 (1) 45-50.

[6] Jiang, Hai-Zhou, Li Lingqin. (2004). Multi-scale spatial hierarchical clustering algorithm in land use and land cover study. Geographical Sciences. 24 (4) 477-483.

[7] Takuya Tsuji, Akihito Ito, Toshitsugu Tanaka. (2007). Multi-scale of clustering particles. Powder Technology. 179 (8) 115-125.

[8] Qian Wei-ning, Zhou Ao-ymg. (2002). Analyzing Popular Clustering Algorithms from diffremt Viewpoints. Journal of Software, 13 (8) 148-151.

[9] Martin Ester, Alexander Frornmelt, Hans Peter kriegel. (2000). Spatial Data Mining: Database Primitives. Algorithms and Efficient DBMS Support, Data Mining and Knowledge Discovery. 10 (4) 193-216.

[10] Chen Huiping, Wang Yu, Jiandong. (2007). New Progress in Research of the subspace clustering algorithm. Computer simulation. 24 (3): 6.10, 34

[11] Sheng-Sheng Wang, Da-You Liu, Cao Bin, Liu Jie. (2005). A high-dimensional space data subspace clustering algorithm. Computer Applications. 11 (25) 2615-2617.

[12] Xiekun Wu, Bi Xiaoling Ye Bin. (2007). Based on unit area of high dimensional data clustering algorithm. Computer research and development. 44 (9) 1618-1623.

Shan Donghong (1), Li Wei Yao (2)

(1) Software School of Ping Ding Shan University

(2) Computer Science and Technology School of Ping Ding Shan University

(1, 2) HeNan Ping Ding Shan, 467000. China

Since the beginning of the 21st century, all walks of life have been widely used a large number of spatial database along with the rapid development of the information age. As a result, the field of spatial data mining rise rapidly [1-2]. For Spatial clustering methods have great advantages in solving and analyzing complex problems lack of background information, the method of spatial clustering spatial data has been one of the most active and most widely used technology in mining field.

Now widely used clustering algorithm can be divided into the following types: Level-based approach, based on the partitioning method, grid-based methods, model-based approach and density-based method. However, for a particular clustering method, it is generally a variety of clustering integration together, and from the type Point of view, it can't simply be attributed to a single category. General, different clustering algorithms can be combined with each other, which will become more efficient and more reliable. In recent years, with more in-depth technology found on the clustering of knowledge, more and more new clustering algorithm emerges, which usually combine the idea of a variety of clustering methods.

Accompanied by the continuous expansion of application areas and more in-depth of clustering analysis, in recent years high-dimensional clustering problem has gradually become the focus of the study of cluster analysis. With the rapid development of a variety of detectors and sensor technology, the number of spatial data properties also increased by a few dozens or even hundreds for spatial data. To discover knowledge from dozens up to hundreds-dimensional spatial data clustering has become a challenge direction and difficult in the current course of the study. There are three types of algorithms for clustering of high dimensional data: Collaborative clustering, subspace clustering and attribute conversion. Collaborative clustering is widely used in Web and text mining, cluster analysis, however, Subspace clustering and attribute conversion method are mostly used in cluster analysis of high dimensional space. With the wide application of a variety of complex special database, the actual demand asks for newer and higher requirements on the spatial clustering task, which has brought new challenges for spatial clustering methods and technology.

2. High dimensional space clustering algorithm

2.1 Cluster analysis and high-dimensional data features

Clustring is an important data mining method. The so-called clustering is a data object to group making more than one cluster or class, After generated the cluster is a collection of a set of data objects In the same cluster objects with each other have high similarity, but objects in different clusters are different.

High dimensional spatial data are usually able to show the following characteristics [3]:

2.1.1 Data are sparse

The characteristics of high-dimensional data is different from the low-dimensional data, whose data distribution is relatively sparse. As a result, the distance between the scale and regional density in the data no longer has the intuitive meaning in high-dimensional space, Therefore, the detection and the discovery of outliers are different from the traditional one in high-dimensional space.

2.1.2 Dimension effect

Nowadays, the clustering algorithms are basically for low-dimensional space designed. For the characteristics of the effect of high-dimensional space, the data will show the dimension and sparse data characteristics. Most of the clustering algorithm fails because of the cluster centers can't be determined, the similarity function can't be calculated and other factors. It may not be accepted due to the algorithm time and space complexity increases exponentially. Thus, we need research and develop a high dimensional space clustering algorithm suitable according to the characteristics of high dimensional spatial data. The essential difference between spatial data and other types of data is its spatial properties. Space property contains the contents of the spatial distance, location, geometry and size. What's more, it can be looked as individual relationship between space, such as: Orientation relations, topological relations and metric relations, which make spatial data more complex than other types of data [4].

2.2 The difficulty of high dimensional data clustering

The difficulty is of high dimensional data clustering are as follows:

(1) Definition of the distance function is too difficult. Based on the similarity measure between the data objects for clustering operation, and ultimately the high similarity of the object is classified as a class. The equidistance function of the Euclidean distance is commonly used to measure the similarity in the low-dimensional space. In high-dimensional case, we need to re-design and similarity of guidelines and standards for the establishment of a new measure of data objects for distance function failure.

(2) In the case of the high-dimensional space, to the cluster mean will be very close calculated by distance formula and appear gap between the increasingly zero phenomenon. That is to say, Data dot pitch distance between the farthest and the nearest data points gradually tends to zero along with the increase in dimension. This will result in the clustering operation can't be done without a clear distinction between the center of the cluster.

(3) Dimension of the space is too high will lead to increased computational complexity of traditional clustering algorithms, which will make inefficient and unacceptable limiting to its range of applications.

Apparently, the category number that people are able to perceive is related to scale of the observational data: When the scale of observation is very rough, It can be said that the entire data set is a category; But when the scale of observation is in a very fine condition, all data points can be seen as the same category. This Also brings a new way to determine the most reasonable method of the class number [5]. That is to say, the number of category with the number of selected fixed number within a maximum range of the scale of change is the longest-scale survival of the categories. When the scale conditions are different, multidimensional spatial data often have different spatial forms of performance, but at the same time it also generates different clustering results. So scale problems has been gradually attracted more and more attention in many areas, especially in the field of space science. But for multiscale spatial, data mining research is still in its infancy on the current situation. Currently, Most of the spatial scale under the conditions of spatial clustering algorithm is focused on expansion in the scale based on the hierarchical clustering algorithm. Including by breaking the inconsistent side clustering or minimum spanning tree method to introduce the scale clustering [6-7]. However, these methods require human input parameters, such as the resulting class and other parameters, which has a great influence on the ultimate scale of clustering results. It can be said that these algorithms have severe dependence parameters.

2.3 The key issues of high dimensional spatial data clustering analysis

Obviously, the most critical issue of the clustering of high dimensional spatial data analysis is how to choose the high-dimensional into low-dimensional approach. And clustering results can also get feedback and clustering methods applied to the entire high-dimensional space. For an easy-to-use way to reduce the dimension, Although the method of attribute reduction and attribute conversion clustering analysis of high dimensional data. But through the dimension reduction the difference between the normal data and noise data will be reduced and the quality of clustering can't be guarantee.

In addition, although the application of dimensionality reduction techniques reduces the data dimension of the space, a lot of important clustering information get lost. Its comprehensibility and interpretability is poor and there exist certain degree of difficulty to clearly express the results. This shows there are significant limitations in dimensionality reduction based on attribute conversion of high-dimensional data processing applications, and it is not able to meet the development needs of the current applications of high-dimensional clustering.

The subject according to the characteristics of space noise data is large and complex and a high degree of noise under the space. Using the set of points D in the after-dimensional space is a class, and using high-dimensional space, subspace clustering method to Clustering. Through this method to find the largest collection of spatial data centrally connected dense units is the spatial data of the cluster. For the problems such as cluster boundary judgment clear and clustering is not smooth, finally proposed an extended high-dimensional space based on fuzzy clustering algorithm through an important role in determining the cluster boundary grid fuzzy extended.

3. Spatial data mining and spatial clustering

Spacial Data Mining belongs to a branch of Data Mining. Spatial data mining is different from the general data mining and the conventional transaction database. Spatial cluster analysis is not only an important part of spatial data mining, but also widely used in spatial data mining and in-depth study of one of the elements.

3.1 Spatial Data Mining

Compared with traditional data mining, the main different point performance of spatial data mining is in the following areas:

(1) In traditional data mining processing, the processing content is the numbers and types. But in spatial data handling, processing is more complex data types, such as: points, lines and polygons and other objects.

(2) Traditional data mining is usually explicit input, but it Is often implicit in the spatial data mining data input.

(3) Assumptions in the traditional data mining: The data sample is generated independently. However, this assumption is not established in the spatial data mining. In fact, there is a high degree of correlation between the spatial data, for example: People with similar occupational characteristics and background are usually easy to gather in the same region.

Spatial clustering analysis of the requirements of the clustering algorithm is more stringent than the traditional clustering, its clustering algorithm, typical requirements are as follows:

3.1.1 Scalability

Because of the spatial database contains extremely rich data, the structure is more complex, we must seek a more rapid and effective clustering algorithm. Its running time must be acceptable, predictable, so the time complexity of the algorithm of polynomial or exponential no practical value.

3.1.2 Ability to handle different data types

The spatial database contains data not only have complex spatial properties, but also contains the non-spatial attributes. Generally contain a variety of different data types, a good algorithm should be able to handle different data types.

3.1.3 Able to identify all type of spatial clustering of arbitrary shape

The specific features of spatial clustering is not known prior to analysis. Spatial clustering contains a variety of complex shapes, so a good algorithm should be able to identify clusters of arbitrary shape.

3.1.4 The ability to deal with noise

Often the database contains vacancies, outliers, bad data and unknown data in reality. If the clustering algorithm is sensitive to such data, it will lead to poor clustering results.

3.1.5 High dimensional spatial data

Spatial data will generally contain a higher dimension. How to clustering spatial data mining is special requirements on clustering in the high-dimensional space.

3.1.6 Usability and understandability

Results of clustering should be understandable and available for the user.

3.2 High dimensional space clustering method

Spatial Clustering is that the data set in a large multidimensional space, and using the distance metric in order to identify the clustering. This will make object has a high similarity in the same cluster, while the objects in different clusters are different from each other [8-9].

Because of the existence of Dimension effect and the data sparseness problem in high dimensional data, most of the existing clustering algorithm is in a state of failure at present. The characteristics of high dimensional spatial data in the high-dimensional clustering process are in the following areas:

(1) Difficult to define the distance function (similarity function).

Due to the similarity between the conditions low-dimensional Euclidean distance function equidistant measure data in high-dimensional space can't be passed. This lead to clustering operating on the basis of measurement can't be completed, and the new measure of high-dimensional spatial data similar to the standards and guidelines should be re-considered;

(2) Clustering operations generally need to calculate the cluster mean, or to find the value of the neighbors.

In high-dimensional case, the mean calculated by the distance function will be very close to each other. In many cases, a data point and it recently and the furthest data point is almost equidistant, which makes the cluster center can't determine the cause of the clustering operation that can't be;

(3) Because of the increased dimensions, the calculation of the clustering algorithm is to become a huge overhead, which result in reduced efficiency of the algorithm.

In recent years, a large number of scholars are committed to data clustering algorithm suitable for high-dimensional space. The findings can mainly be divided into two types, namely the subspace clustering method and attribute reduction method.

3.3 Attribute Reduction Method

Attribute Reduction Method refers to not lose the original data the main features they put a high-dimensional data to low-dimensional space and method of operation. Generally speaking, attribute reduction is usually by the characteristic transformation (Feaure Transformation) and feature selection (Feature Selection) two ways.

3.3.1 Feature Selection

From the original high-dimensional space, select the number of dimension to form a new coordinate system, and related clustering operation in the new coordinate system is called feature selection. Feature selection process to search, search strategy to search, combinations, the weighted method and random search to evaluate the dimensions. This dimension that has nothing to do with the desired characteristics can be removed in order to achieve the dimension reduction purpose. Feature selection method, however, ignored the relevance of the spatial data, and simply abandoned often lead to spatial clustering, independent of the dimension one-sided results.

3.3.2 Feaure Transformation

The original high-dimensional data set dimensions merged to construct low-dimensional mapping that could ensure that maintaining the characteristics of the same conditions to reduce the number of dimensions, and ultimately enable the algorithm to reflect the high-dimensional data set characteristics of the new low-dimensional space effective clustering method is called the feature transform.

The basic idea is as follows: Guarantee the basic loss of real data, information, and retain only a small part of the correlation dimension, but also so that the data is a considerable degree of approximation. Usually characterized by a transformation dimension reduction method is the use of statistics or linear algebra methods, this technology can also be extended in certain aspects of data performance analysis and has the ability to effectively remove the noise in the data. Feature transform techniques can be divided into two kinds of nonlinear and linear methods. Among them, the linear method of projection pursuit and principal component analysis applies to the handling of linear structure, which has feature of convenient calculation; Nonlinear includes neural network method, the main surface/master curve method to handle the highly nonlinear structure, whose computational is complexity.

3.4 Subspace clustering methods

The basic idea of subspace clustering algorithm is as follows: to find out clusters directly in high-dimensional space is more difficult, considering the original data space is divided into different sub-space from the corresponding sub-space to examine the presence of clustering. For subspace of the high-dimensional data space, it is difficult to understand the relatively abstract visual description high-dimensional space since the space.

Subspace clustering and feature selection is similar to the attempt to found clustering in the same data set to select a different subspace, expanding the feature selection task. Subspace clustering need to use appropriate search strategy and evaluation criteria to select the found subspace clustering search strategy and evaluation criteria, which has a significant impact on the clustering results.

At present, depending on the search direction the subspace clustering methods can be devided into two categories: from the bottom-up search and top-down search method [10].

3.4.1 Search method from the bottom

This approach narrows the search space by clustering the monotony of the dimension, or is the a priori nature of association rules: If a d-dimensional unit is dense, then its projection on the d-1 dimensional space should also be intensive, which is the space of closed; In contrast, d-1 dimensional space given data unit is not intensive, it is in an arbitrary d-dimensional space is not dense. Such algorithms are that each dimension is divided into a number of grids by using subspace clustering algorithm based on density clustering and grid-based method of combining cluster analysis. In this strategy the most prominent problem is easy to produce overlapping clusters, that is, certain points are more than one cluster, or do not belong to any one cluster. What's more based on the density and the density of the grid method the problem threshold step size and grid parameters dependent still exists.

3.4.2 Top-down search method

First the entire data set is divided into k parts, and the same set of weights given to each cluster; Then, you want to repeat some strategy for continuous improvement of these initial clusters, and to update the cluster weights. This strategy have been established for each part of the data clusters, and so to avoid the cluster due to repeated, and then to ensure that a data point exists and exists only in a cluster. However, for large databases, repeat the process of computing the required computational cost is quite high, so most of these algorithms use some strategies to improve the performance of the algorithm on a small part of the inspection data as a data sample. Similarly, in this search strategy the same or similar cluster size, cluster number of parameters will produce heavily dependent. The number of samples is also an important parameter, using the algorithm of the sampling strategy, the values of these parameters on the final clustering results have a significant impact.

4. High dimensional space clustering algorithm based on fuzzy extension

For the problem of non-axis direction of the grid density-based spatial clustering method prone to over-clustering, not smooth, fuzzy clustering and cluster boundary judgment, the authors propose a high-dimensional spatial data subspace clustering algorithm in the literature [11]. This algorithm by extending the adjacent cluster space, the traditional grid clustering method does not consider the data points within the adjacent grid study the influence of the grid caused problems to improve. However, this method does fuzzy extension for all grid cells, and if the number of spatial data dimension is higher, the efficiency of this algorithm is too low.

Through in-depth analysis of the problem, the traditional high-dimensional sub-space grid clustering algorithm showed the boundary is not clear and the problem of clustering is not smooth. The sample point distribution within the cluster boundary region of the grid need to focus on visits, and to them the exact division of clustering, a clustering within the data sample points with emphasis on the significance. Therefore, this paper extended high-dimensional space based on fuzzy clustering method. This method in high-dimensional spatial data subspace clustering analysis method to improve on the basis of the density--the grid method. Through the dense grid density threshold to determine the spatial clustering process, which must be included in the clustering results in the internal.

4.1 Fuzzy set theory

First, fuzzy set theory is proposed by Zadeh in 1965. The different of this theory with the traditional set theory is that in fuzzy set theory, each element in the collection are part of a collection to a certain extent, but to varying degrees included in a number of collections. With this feature, this fuzzy set theory is a lot of meaning to determine but contain inaccurate to provide a mathematical solution.

4.2 Grid fuzzy regions extension and the degree of membership to determine program

The clustering algorithm proposed here is based on the fuzzy extension of the high dimensional space-data space, which is actually a grid--the density-based spatial clustering method. This method inherits the advantages of the grid--density clustering algorithm to identify arbitrary shape clusters. Compared with traditional clustering methods, the biggest difference of this method is the ability to use fuzzy sets in the cluster at the same time also increased the inspection grid space by adjacent grid. Only records the number of data points within the data grid when the grid--the traditional density-based spatial clustering algorithm in Wang Geji number. However, the distribution of the data sample points with those of adjacent grid examined the impact of the data sample points within the grid is negligible, especially in multi-dimensional spatial data with each other a strong relationship between the spatial data correlation.

During the cluster analysis of spatial data, if the interaction between the data is negligible, the accuracy of the clustering results will have a very big impact. After the analysis of fuzzy set theory, fuzzy sets of the relevant grid fuzzy extension and inspection grid of sample points within the grid count is extended to the basic region and fuzzy sample point count range expansion within the region. This method is able to examine the grid to consider the impact of data points within the adjacent grid, to improve the clustering boundary judgment is unclear, and ultimately more accurate results of the subspace clustering of high dimensional spatial data is got.

Fuzzy extension of the grid is extended a certain length of each side of the grid region, and finally the expansion area of the grid is formed. Through repeated experiments, agreed the extended area of the grid for 1/2 grid. Grid on all sides were extended 1/2 side length. It is shown in Figure 1.

4.3 High dimensional space clustering algorithm based on fuzzy extension

High dimensional space clustering algorithm based on fuzzy extensions is described below:

Input: data sample space D, the density threshold value is MinPts;

Output: the center of the initial cluster set; Specific steps:

Step 1: The division of the space grid

In order to form a standard grid structure, the first spatial data sets D in accordance with the formal division. Assume that a given spatial data set D containing d attributes {[X.sub.1], [X.sub.2], [X.sub.3], ...... of [X.sub.d]}, and each attribute of the attribute values

are community. Without loss of generality, set the interval of any property is a semi-open, half closed interval [[m.sub.j], [n.sub.j]), j = 1, 2, 3 ... d, d-dimensional space can be expressed as:

Space X = [[m.sub.1], [n.sub.1]) x [[m.sub.2], [n.sub.2]) x ... x [[m.sub.d], [n.sub.d])

If each grid are an independent index ([i.sub.1], [i.sub.2] ... in.) grids. Since the data in each dimension has the same division of the scale, so you can assume that the number of sample points in each grid can represent the density of data points. In this way, through the formal division of K-order spatial data sets D division of the grid to the same scale.

Step 2: The space within the grid sample count

Space grid sample point count, generally in the grid-based equidistant division of the formation of the grid cell size the same premise, first determine the value of the mesh density, and then to determine the sparse grid and dense grid Classification of the process. During the phase space grid sample count, complete the following two parts:

(1) According to the previous definition, the basic grid area to calculate the number of data points in each grid, and determine the grid according to the density threshold for dense grids or sparse grid.

(2) Sparse grid obtained for all (1) fuzzy extended, taking into account and the grid region and expansion region. The grid sample of fuzzy extension is counted to arrive at the new grid density. Then make a judgment according to the density threshold. Upon completion of the above two steps, you can examine the adjacent grid of sample points within the grid of sample points on the cluster boundary grid. Strong support for the final clustering results clearly identified boundaries, the accuracy of the algorithm significantly improved; At the same time, only for the sparse grid to solve the problem in a most helpful way to examine that all grid fuzzy extension lead to high-dimensional data processing efficiency is not high.

Step 3: Clustering

According to the sparse grid and dense grid of results obtained in Step 2, the adjacent dense grid density connected to the operation. Dense grid merge the corresponding clusters, and the finally the clustering process is completed. In the traditional clustering process, the general shared a side of the grid as is the adjacent grid. Clustering is done only within the adjacent grid, that is clustering direction up, down, left and right. According to this definition of adjacent grid, over-clustering phenomenon of chunking will appear in the non-X-, Y-axis direction. This will allow the original class grid divided into some small cluster affect the clustering results. The algorithm redefine the grid adjacent to solve this problem. Under adjacent grid definition, the clustering process can accord to the nature of the connected to the grid density in the direction of tilt. To the cluster is divided into a number of small clusters is solved at the same time to make up for the shortcomings of traditional clustering methods.

Step 4: Outlier detection and clustering border data points judgement

The high dimensional spatial data is easy to generate a large number of isolated points due to the increased dimension. Therefore, in order to ensure more accurate and reliable clustering results after the end of the clustering process may choose to conduct the necessary outlier detection and cluster boundary point judgment steps. The process is not proposed in this paper the main process of the algorithm, specific practices can be found in the literature [12].

Step 5: The algorithm ends

Output data sample space D dimensional subspace clustering results.

5. Conclusion

Clustering results based on the grid--the density of spatial clustering algorithm do not consider the data points within the adjacent grid currently examining the grid, there is not smooth, the cluster boundary is not clear enough. Therefore, the high-dimensional subspace clustering algorithm based on fuzzy sets, the traditional grid-based density clustering algorithm has been improved: Expansion of the basic grid, the use of fuzzy set membership function on the grid regional and fuzzy extension of data points within the region, counting, taking into account the spatial data correlation, taking into account the influence of adjacent data points within the grid on the current grid, so clustering is not smooth, the boundary is not clear phenomenon is improved; At the same time, re-adjacent grid defined, to some extent, to avoid excessive clustering in the non-coordinate direction. The experimental results show that the method overcomes the deficiencies of the traditional clustering methods. This approach allows the clustering of high dimensional spatial data quality has been improved. Meanwhile, the algorithm execution time overhead than the other improvements in the algorithms have been improved.

Categories and Subject Descriptors:

H.2.8 [Database Applications]; Spatial Databases and GIS: I.5.3 [Clustering]; Algorithms

General Terms: Cluster Analysis, Fuzzy Analysis

Received: 9 May 2013, Revised 19 June 2013, Accepted 25 June 2013

References

[1] Mao Kebiao, Zhihao, Li Haitao, Zhou Ruohong. (2002). Spatial data warehouse-based spatial data mining. Remote sensing information (theoretical research). 1 (9) 26.

[2] Zhang Ruiju. (2002). CHINA. GIS and spatial data mining study of integration issues. Investigation Science and Technology. 2 (1) 21-24

[3] Beyer, K., Goldsteein, J., Ramakrishnan, R., Shaft, U. (1999). When is Nearest Neighbors Meaningful AI. In: Proc. the 7th Int. Conf. Database Theory. Jerusalem Israel, p. 217-235.

[4] Xiangqin Wang Caimei. (2010). Clustering high-dimensional space algorithm-based outlier mining technology research, Computer Technology and Development. (01) 126-131.

[5] Wang Min, Zhou, C., Pei Tao Luo Jiancheng. (2004). MSCMO: based on mathematical morphology operator scale spatial clustering algorithm. Journal of Remote Sensing. 8 (1) 45-50.

[6] Jiang, Hai-Zhou, Li Lingqin. (2004). Multi-scale spatial hierarchical clustering algorithm in land use and land cover study. Geographical Sciences. 24 (4) 477-483.

[7] Takuya Tsuji, Akihito Ito, Toshitsugu Tanaka. (2007). Multi-scale of clustering particles. Powder Technology. 179 (8) 115-125.

[8] Qian Wei-ning, Zhou Ao-ymg. (2002). Analyzing Popular Clustering Algorithms from diffremt Viewpoints. Journal of Software, 13 (8) 148-151.

[9] Martin Ester, Alexander Frornmelt, Hans Peter kriegel. (2000). Spatial Data Mining: Database Primitives. Algorithms and Efficient DBMS Support, Data Mining and Knowledge Discovery. 10 (4) 193-216.

[10] Chen Huiping, Wang Yu, Jiandong. (2007). New Progress in Research of the subspace clustering algorithm. Computer simulation. 24 (3): 6.10, 34

[11] Sheng-Sheng Wang, Da-You Liu, Cao Bin, Liu Jie. (2005). A high-dimensional space data subspace clustering algorithm. Computer Applications. 11 (25) 2615-2617.

[12] Xiekun Wu, Bi Xiaoling Ye Bin. (2007). Based on unit area of high dimensional data clustering algorithm. Computer research and development. 44 (9) 1618-1623.

Shan Donghong (1), Li Wei Yao (2)

(1) Software School of Ping Ding Shan University

(2) Computer Science and Technology School of Ping Ding Shan University

(1, 2) HeNan Ping Ding Shan, 467000. China

Printer friendly Cite/link Email Feedback | |

Author: | Donghong, Shan; Yao, Li Wei |
---|---|

Publication: | Journal of Digital Information Management |

Article Type: | Report |

Date: | Oct 1, 2013 |

Words: | 4833 |

Previous Article: | A new shuffled frog leaping algorithm with space zoomed factor and gravity attractor. |

Next Article: | Investigating a new design pattern for efficient implementation of prediction algorithms. |

Topics: |