Printer Friendly

On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms.

1. Introduction

1.1. Background. Due to the exciting advancements in digital sensors, advanced computing, communication, and massive storage, tremendous amounts of data are being produced constantly in the modern world. The continuously growing data definitely imply great business value. However, data are useless by themselves; analytical solutions are demanded to pull meaningful insight from the data, such that effective decisions can be achieved. Clustering is an indispensable and fundamental data analysis method. The traditional clustering algorithm is executed in a batch-mode; namely, all data points are necessarily loaded into memory of the host machine. In addition, every data point can be accessed unlimited times during the algorithm execution. Nevertheless, the batch-mode clustering algorithm can not adjust the clustering result in an evolving manner. For instance, it is necessary to incrementally cluster the evolving temporal data such that the underlying structure can be detected [1]. In stream data mining, a preprocessing task like data reduction needs the support of incremental clustering [2, 3]. In addition, incremental clustering can significantly contribute to massive data searching [4]. To sum up, the evolving capability of incremental clustering is indispensable under certain scenarios, such as memory-limited applications, time-limited applications, and redundancy detection. A practical application may possess arbitrary combination of the three aforementioned characteristics. The incremental clustering algorithm proceeds in an evolving manner, namely, processing the input data step by step. In each step, the algorithm receives a newly arrived subset of the input and obtains the new knowledge of this subset. Afterwards, historic clusters of the previous step are updated with the new knowledge. Subsequently, the updated clusters serve as input of the next step. With regard to the first step, there is no updating operation. The new knowledge obtained in the first step serves as input of the second step. Application scenarios of incremental clustering generally raise high requirements on computing capacity of the hardware platform. General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. GPGPU has vast development prospects due to following superiorities: much rapider growing computing power than CPU, high efficiency-cost ratio, and usability.

1.2. Motivation and Related Works. Our previous work revealed that the existing incremental clustering algorithms are confronted with an accuracy-parallelism dilemma [5, 6]. In this predicament, the governing factor is the evolving granularity of the incremental clustering algorithm. For instance, the point-wise algorithms proceed in fine granularity. In each step, the algorithms only receive a single data point (serving as the new knowledge); this point will either be assigned to an existing historic cluster or induce an independent cluster in the historic clusters. Such algorithms generally achieve favorable clustering accuracy. However, they sacrifice parallelism due to strong data dependency. Namely, the next data point cannot be processed until the current one is completely processed. Modern GPGPUs can contain thousands of processing cores, while the number of existing historic clusters needs to increase progressively even if the number eventually reach the magnitude of thousand. GPGPU can fully leverage its computing power only if it runs abundant threads in a SIMD (Single Instruction Multiple Data) manner (commonly twice the number of processing cores or even more). In addition, the work-depth is inevitably no less than the amount of input data points under point-wise setting. Consequently, the computing power is ineluctably underutilized. Moreover, more kernel launches (GPGPU code execution) are required if work-depth is larger. Time overhead of kernel launch is generally high. Some representative point-wise incremental clustering algorithms were elaborated in [7-19]. Ganti et al. used block-evolving pattern to detect changes in stream data [20]. Song et al. adopted the block-wise pattern and proposed an incremental clustering algorithm of GMM (Gaussian Mixture Model) [21]. The algorithm of [21] proceeds in coarse granularity. Each step contains three substeps. First, obtain the new knowledge by running the standard EM (Expectation Maximum) algorithm on a newly received data block. Second, identify the statistically equivalent cluster pairs between the historic clusters and the new clusters. Finally, merge the equivalent cluster pairs separately. The standard EM algorithm of GMM is inherently GPGPU-friendly [22]. The algorithm of [21] maintains the inherent parallelism. However, its clustering accuracy exhibits degradation by order of magnitude, compared to its batch-mode counterpart (the standard EM algorithm of GMM) [5]. Moreover, we qualitatively analyzed the reason why block-wise pattern tends to induce accuracy degradation in our previous work [6]. Algorithms of [23,24] are also block-wise. D-Stream is ostensibly point-wise [25]. Nevertheless, D-Stream is essentially block-wise due to the fact that mapping data points into grids can be parallelized in a SIMD manner. As far as we know, most existing works focus on clustering accuracy. However, existing algorithms, even the block-wise ones, do not explicitly consider algorithm parallelism on SIMD many-core processing devices like GPGPU. Some recent works formally analyzed issues of clustering or machine learning algorithms. Ackerman and Dasgupta pointed out a limitation that incremental clustering cannot detect certain types of cluster structure [26]. They formally analyzed the cause of this limitation and proposed conquering this limitation by allowing extra clusters. Our work is similar to that of [26] in the sense that we also formally analyzed the reason why incremental clustering is inefficient under certain conditions. In contrast, we elaborated our work under the context of GPGPU-acceleration. Ackerman and Moore also formally analyzed perturbation robustness of batch-mode clustering algorithm [27, 28]. Nevertheless, works of [27, 28] only concentrated on classical batch-mode clustering algorithms. Gepperth and Hammer qualitatively analyzed challenges that incremental learning is facing [29]. They pointed out a dilemma between stability and plasticity. However, we focus on the dilemma between accuracy and parallelism.

1.3. Main Contribution. In this paper, we extend our previous works of [6, 30] in the following ways. First, some vital concepts (such as incremental clustering and evolving granularity) are formally defined. Second, we formally proved how evolving granularity exerts influence on accuracy and parallelism of incremental clustering algorithms. In this way, the starting point of our previous works can be formally validated. Finally, we demonstrated the theoretical conclusions through a demo algorithm. The conclusions will be the footstone of our future work.

2. Formal Definition of Terminologies

2.1. Terminologies on Incremental Clustering

Definition 1 (incremental clustering). [x.sub.1],[x.sub.2],[x.sub.3], ..., [x.sub.T] is a series of data points ([x.sub.i] [member of] [R.sup.d], 1 [less than or equal to] i [less than or equal to] T). The data points are partitioned into T1 sets: [X.sub.1],[X.sub.2],[X.sub.3], ..., [X.sub.T1]. This partition satisfied the following conditions:

(1) [X.sub.j] [not equal to] 0 (j = 1, 2, 3, ..., T1).

(2) If [x.sub.i1] [member of] [X.sub.j1], [x.sub.i2] [member of] [X.sub.j2] (i1 [not equal to] i2, j1 [not equal to] j2), then j2 > j1 [??] i2 > (1.

A data analysis task adopts discrete time system and time stamps are labeled as 1, 2, 3, .... This task is incremental clustering if and only if the following applies:

(1) When t = 1, the task receives [X.sub.1]. In time interval [1,2], the task partitions [X.sub.1] into clusters and [C.sub.1] is the set of these clusters. The entire input to the task is [X.sub.1].

(2) When t = j (j = 2,3,4,...), the task receives [X.sub.t]. In time interval [t, t + 1], the task resolves the set of clusters [C.sub.t] such that [for all][x.sub.j] [member of] [[union].sub.1 [less than or equal to] k [less than or equal to] t] [X.sub.k] can find its affiliated cluster in [C.sub.t]. Entire inputs to the task are [X.sub.t] and [C.sub.t].

Definition 2 (the tth step and historic clustering result of the tth step). An algorithm for incremental clustering is an incremental clustering algorithm. The time interval [t,t + 1] (t = 1, 2, 3, ...) is the tth step of an incremental clustering algorithm (or step t of an incremental clustering algorithm). [C.sub.t] is the historic clustering result of the tth step.

Definition 3 (micro-cluster). Let batchAlgorithm represent a batch-mode clustering algorithm. [??] = ([[xi].sup.(1)], [[xi].sup.(2)], ...., [[xi].sup.(p)]) [member of] [([R.sup.+]).sup.p] 1 [less than or equal to] p [less than or equal to] d is the parameter of batchAlgorithm. batchAlgorithm can partition [X.sub.t] into subsets [Sub.sub.k] (k = 1, 2, 3, ..., [K.sub.t,new]). [[??].sub.0] = ([[xi].sup.(1).sub.0], [[xi].sup.(2).sub.0]), ..., [[xi].sup.(p).sub.0])) [member of] [([R.sup.+]).sup.p] is a constant vector. [Sub.sub.k] is a [[??].sub.0]-micro-cluster produced by batchAlgorithm if and only if the following applies:

(1) [mathematical expression not reproducible].

(2) [for all]k1 [not equal to] k2, [Sub.sub.k1] [intersection] [Sub.sub.k2] = 0.

(3) [Sub.sub.k] forms a data cloud in d-dimensional space. The hypervolume of this data cloud is positively related to [[xi].sup.(u).sub.0] (u = 1, 2, 3, ..., p). [Sub.sub.k] contains and only contains one data point if [mathematical expression not reproducible].

(4) [[xi].sup.(u).sub.0] (u = 1, 2, 3, ..., p) are all preset constant values.

Definition 4 (batch-mode part and incremental part of step t). Some incremental clustering algorithms divide step t (t = 1, 2, 3, ...) into two parts [21, 23-25]: In the first part, [X.sub.t] is partitioned into new clusters (or new micro-clusters) pursuant to certain similarity metrics. [C.sub.t, new] is the set of these clusters (or micro-clusters); in the second part, [C.sub.t] is resolved based on [C.sub.t, new] and [C.sub.t-1] (if any). The number of clusters (or micro-clusters) in [C.sub.t, new] is denoted as [K.sub.t, new]. The first part can be accomplished by a batch-mode clustering algorithm; this part is the batch-mode part of step t. The second part is the incremental part of step t.

Definition 5 (benchmark algorithm, benchmark clustering result, and benchmark cluster). Denote [X.sub.1], [X.sub.2], [X.sub.3], ..., [X.sub.T1] (pursuant to Definition 1) as StrParti. Incremental clustering is applied to StrParti, and [C.sub.T0] is the historic clustering result of the T0th step (TO [less than or equal to] T1); let [X.sub.B] = [[union].sub.1 [less than or equal to] k [less than or equal to] T0] [X.sub.k]. If [X.sub.B] could be entirely loaded into memory of the host machine and were processed by a certain batch-mode clustering algorithm, then the resulting clusters were in a set of clusters denoted as [C.sub.T0, benchmark]. The batch-mode algorithm is called benchmark algorithm of the incremental clustering algorithm. [C.sub.T0, benchmark] is the benchmark clustering result up to step TO. An arbitrary cluster of [C.sub.T0, benchmark] is a benchmark cluster.

Definition 6 (local benchmark cluster). [C.sub.T, Benchmark] = {[Clu.sub.k] | k = 1,2, 3, ..., [K.sub.T, Benchmark]} is the benchmark clustering result up to the Tth step; [Clu.sub.k] (k = 1, 2, 3, ..., [K.sub.T, Benchmark]) represent the benchmark clusters in [C.sub.T, Benchmark]. [X.sub.t] is the newly received data set of step t. All data points of [X.sub.t] are labeled such that points with the same label are affiliated to the same benchmark cluster. Partition [X.sub.t] into nonempty subsets, noted as [SubClu.sub.k] (k = 1,2, 3, ..., K'). These subsets satisfy the following conditions:

(1) [mathematical expression not reproducible].

(2) If [x.sub.i], [x.sub.j] [member of] [X.sub.t] then [x.sub.i], [x.sub.j] [member of] [SubClu.sub.u] [??] [x.sub.i] and [x.sub.i] possess the same label.

(3) [SubClu.sub.u], [SubClu.sub.v] [subset] [X.sub.t]; if [x.sub.i] [member of] [SubClu.sub.u], [x.sub.j] [member of] [SubClu.sub.v] (u [not equal to] v), then [x.sub.i] and [x.sub.j] have different labels.

[SubClu.sub.k] (k = 1, 2, 3, ..., [K'.sub.t]) are called the local benchmark clusters of step t or local benchmark clusters for short. We abbreviate local benchmark cluster to LBC.

Definitions 1-4 actually provide terminologies to formally interpret the concept of incremental clustering as well as execution mechanism of incremental clustering. Definitions 5 and 6 furnish a benchmark to evaluate the accuracy of an incremental clustering algorithm.

2.2. Terminologies on Evolving Granularity, Clustering Accuracy, and Parallelism

Definition 7 (containing hypersurface of a data set). Let [Clu.sub.w] [subset] [R.sup.d] represent a set of data points. HS is a hypersurface in the d-dimensional space. HS is the containing hypersurface of [Clu.sub.w], if and only if HS is a close hypersurface and an arbitrary point of [Clu.sub.w] is within the interior of HS.

Definition 8 (envelope hypersurface, envelop body, and envelop hypervolume of a data set). Let [Clu.sub.w] [subset] [R.sup.d] represent a set of data points. [S.sub.HS] = {[HS.sub.u] | u = 1, 2, 3, ..., U} is the set of containing hypersurfaces of [Clu.sub.w]. Let [V.sub.u] represent the hypervolume encapsulated by [HS.sub.u]. Let [HS.sub.v] be a hypersurface. [HS.sub.v] is the envelope hypersurface of [Clu.sub.w], if and only if [mathematical expression not reproducible]. Let [EN.sub.w] represent the envelope hypersurface of [Clu.sub.w]; the region encapsulated by [EN.sub.w] is the envelope body of [Clu.sub.w]; the hypervolume of this envelope body is the envelope hypervolume of [Clu.sub.w].

Definition 9 (core hypersphere, margin hypersphere, core hypervolume, and margin hypervolume of a data set). Let [Clu.sub.w] [subset] [R.sup.d] bea data set. [EN.sub.w] is the envelope hypersurface of [Clu.sub.w]; [center.sub.w] represents the geometric center of the envelope body of [Clu.sub.w]. dist(x) represents the distance between [center.sub.w] and an arbitrary point on [EN.sub.w]. [mathematical expression not reproducible].

A hypersphere is the core hypersphere of [Clu.sub.w] if and only if this hypershpere is centered at [center.sub.w] and its radius is [dist.sub.w, min], noted as [SPSur.sub.min]. The hypervolume encapsulated by [SPSur.sub.min] is the core hypervolume of [Clu.sub.w], noted as [SP.sub.min]([Clu.sub.w]).

A hypersphere is the margin hypersphere of [Clu.sub.w] if and only if it is centered at [center.sub.w] and its radius is [dist.sub.w, max], noted as [SPSur.sub.max]. The hypervolume encapsulated by [SPSur.sub.max] is the margin hypervolume of [Clu.sub.w], noted as [SP.sub.max]([Clu.sub.w]).

Definition 10 (core evolving granularity, margin evolving granularity, and average evolving granularity). In the tth step, the incremental clustering algorithm receives data set [X.sub.t]. [X.sub.t] is partitioned into nonempty subsets pursuant to certain metrics: [Sub.sub.k] (k = 1, 2, 3, ..., [K.sub.t, new]) such that [mathematical expression not reproducible] be the core hypervolume of [Sub.sub.k]. Then in the tth step, the core evolving granularity of the algorithm is [mathematical expression not reproducible].

Up to the Tth step, the core evolving granularity of the algorithm is [Gra.sub.min, T] = [min.sub.1 [less than or equal to] t [less than or equal to] T] [SubGra.sub.min,t].

Let [SP.sub.max]([Sub.sub.k]) be the margin hypervolume of [Sub.sub.k]. Then in the tth step, the margin evolving granularity of the algorithm is [mathematical expression not reproducible].

Up to the Tth step, the margin evolving granularity of the algorithm is [Gra.subn.max, T] = [max.sub.1 [less than or equal to] t [less than or equal to] T] [SubGra.sub.max,t]. Let [EN.sub.k] be the envelope hypersurface of [Sub.sub.k], and [SubSet.sub.k] = [EN.sub.k] n [Sub.sub.k]; [absolute value of [SubSet.sub.k]] is the number of data points within [SubSet.sub.k]; [center.sub.w] represents the geometric center of the envelope body of [Sub.sub.k]. [dist.sub.k,u] represents the distance between [center.sub.w] and [x.sub.u]. Let [mathematical expression not reproducible].

[SP.sub.ave] ([Sub.sub.k]) is the hypervolume of the hypersphere whose center is at [center.sub.w] and radius is ave_[dist.sub.k]. In the tth step, the average evolving granularity of the algorithm is [SP.sub.ave]([Sub.sub.k]).

Up to the Tth step, the average evolving granularity of the algorithm is [Gra.sub.ave,T] = [[summation].sup.T.sub.t=1] [Gra.sub.ave,t] [K.sub.t,new])/[[summation].sup.T.sub.t=1] [K.sub.t,new].

Definition 11 (different-to-same mis-affiliation). Different-to-same mis-affiliation is the phenomenon that, in step t, data points from different benchmark clusters are affiliated to the same cluster of [C.sub.t, new] or [C.sub.t].

Definition 12 (same-to-different mis-affiliation). Same-to-different mis-affiliation is the phenomenon that, in step t,data points from the same benchmark cluster are affiliated to the different clusters of [C.sub.t, new] or [C.sub.t].

We adopt Rand Index [31] to measure clustering accuracy. Larger Rand Index means higher clustering accuracy.

There are numerous criterions of cluster separation measurement in the existing literatures. We select Rand Index due to the fact that this criterion directly reflects our intent: measuring the clustering accuracy by counting occurrences of data point mis-affiliations.

Definition 13 (serial shrink rate (SSR)). Let incAlgorithm represent an incremental clustering algorithm. In step t (= 1, 2, 3, ...), the batch-mode part generates [micro.sub.t] microclusters. Suppose incAlgorithm totally clustered N data points up to step [T.sub.0]. Up to step [T.sub.0], the serial shrink rate of incAlgorithm is

[mathematical expression not reproducible]. (1)

Lower SSR means that less computation inevitably runs in a non-GPGPU-friendly manner. Consequently, smaller SSR means improved parallelism. Work-depth [32] of the algorithm can shrink if SSR is smaller. Hence, more computation can be parallelized.

3. Theorems of Evolving Granularity

3.1. Further Explanation on the Motivation. GPGPU-accelerated incremental clustering algorithms are facing a dilemma between clustering accuracy and parallelism. We endeavor to explain the cause of this dilemma through formal proofs. The purpose of our explanation is that the formal proofs reveal a possible solution to seek balance between accuracy and parallelism. This basic idea of this solution is discussed as follows.

In the batch-mode part of the tth step, data points from different local benchmark clusters may be mis-affiliated to the same micro-cluster of [C.sub.t, new]. Theorem 14 points out that the upper and lower bounds of the mis-affiliation probability are negatively related to margin evolving granularity and core evolving granularity, respectively. The proof of this theorem demonstrates that larger evolving granularity results in more occurrences of different-to-same mis-affiliation.

The batch-mode part should evolve in fine granularity to produce as many homogeneous micro-clusters as possible. Only in this context, the operations of incremental part are sensible. Namely, the incremental part cannot eliminate different-to-same mis-affiliation induced by the batch-mode part. The incremental part absorbs advantages of point-wise algorithms by processing micro-clusters sequentially. This part should endeavor to avoid both same-to-different and different-to-same mis-affiliations on the micro-cluster-level.

Nevertheless, Theorem 15 proves that parallelism is positively related to evolving granularity. Thus, the contrary relations cause the dilemma.

However, we can adopt GPGPU-friendly batch-mode clustering algorithm in the batch-mode part. Moreover, the total number of micro-clusters is much smaller than that of data points up to a certain step. Consequently, the workdepth can be dramatically smaller than that of a point-wise incremental clustering algorithm.

3.2. Theorem of Different-to-Same Mis-Affiliation

Theorem 14. Let [P.sub.t,mis] represent the probability of different-to-same mis-affiliations induced by the batch-mode part of the tth step. [Gra.sub.min,T] and [Gra.sub.max,T] are the core evolving granularity and margin evolving granularity up to the T th step, respectively. The upper bound of [P.sub.t,mis] is negatively related to [Gra.sub.max,T] and the lower bound of [P.sub.t,mis] is negatively related to [Gra.sub.min,T].

Suppose [X.sub.t] contains [K'.sub.t] local benchmark clusters (LBC). [LBCSet.sub.t] is a set containing these LBCs. Between any two adjacent LBCs there exists a boundary curve segment. The boundary curve segment between LBC [SubClu.sub.kl] and LBC [SubClu.sub.k2] (k1, k2 = 1, 2, 3, ..., [K'.sub.t]; k1 [not equal to] k2) is noted as [Cur.sup.(k1, k2)]. Obviously, [Cur.sup.(k1,k2)] and [Cur.sup.(k2,k1)] represent the same curve segment. We define a probability function to represent [P.sub.t,mis]:

[mathematical expression not reproducible], (2)

where

[mathematical expression not reproducible], (3)

and V([SubClu.sub.k]) is the hypervolume enclosed by envelope surface of [SubClu.sub.k].

[P.sub.t,mis](r) reaches the upper bound if r equals the radius corresponding to the margin evolving granularity (Definition 10) in step t; [P.sub.t,mis](r) reaches the lower bound ifr equals the radius corresponding to the core evolving granularity (Definition 10) in step t.

Proof. In order to more intuitively interpret this theorem, we discuss the upper and lower bounds in two-dimensional space. The following proof can be generalized to higher-dimensional space.

In two-dimensional space, the envelop hypersurface (Definition 8) degenerates to an envelope curve. The core hypersphere and margin hypersphere (Definition 9) degenerate to a core circle and a margin circle, respectively. The envelop body degenerates to the region enclosed by the envelope curve. The envelope hypervolume degenerates to the area enclosed by the envelope curve. Let incAlgorithm represent an incremental clustering algorithm. Each step of incAlgorithm includes batch-mode part and incremental part.

(1) Partition Data Points Pursuant to Local Benchmark Clusters. incAlgorithm receives [X.sub.t] in the tth step. Partition [X.sub.t] into local benchmark clusters (Definition 6, LBC for short): [SubClu.sub.u] (u = 1, 2, 3, ... [K'.sub.t]). Let [mathematical expression not reproducible]. [EN.sub.w] is the envelope curve of [SubClu.sub.u]. V([SubClu.sub.u]) represents the area enclosed by [SubClu.sub.u]. Assume that [SubClu.sub.u] (u = 1, 2, 3, ..., [K'.sub.t]) areconvexsets. (We can partition [SubClu.sub.u] into a set of convex sets if it is not a convex set.)

(2) Partition the Boundary Curve between Two Local Benchmark Clusters into Convex Curve Segments. Let Cur be the set of boundary curves between any two adjacent LBCs. Cur = {[Cur.sub.k] | k = 1, 2, 3, ..., K} where [Cur.sub.k] is the boundary curve segment between two adjacent LBCs. Consider an arbitrary LBC [SubClu.sub.u]. Suppose that there are totally U LBCs adjacent to [SubClu.sub.u]. The boundary curves segments are [Cur.sup.(1).sub.u], [Cur.sup.(2).sub.u], ..., [Cur.sup.(U).sub.u]. These boundary curve segments can be consecutively connected to form a closed curve such that only data points of [SubClu.sub.u] are within the enclosed region. Further partition [Cur.sub.k] into a set of curve segments such that [Cur.sub.k,g] (g = 1, 2, 3, ..., [G.sub.k]) are all convex curve segments. [Cur.sub.k,g] can be viewed as a set of data points.

(3) Construct Auxiliary Curves. Figure 1(a) illustrates an example of adjacent LBCs and a boundary curve. The black, gray, and white small circles represent three distinct LBCs (For clarity of the figure, we use small circles to represent data points). The black bold curve is the boundary curve between the black and white LBCs. We cut outa convex curve segment from this boundary curve, noted as [Cur.sub.k,g]. Figure 1(b) magnifies [Cur.sub.k,g]. Assume that the analysis formula of [Cur.sub.k,g] is y = t(x) (x e [D.sub.kg] = [[x.sub.1], [X.sub.2]]).

Let [Cir.sub.E] be a circle centered at point E. The radius of [Cir.sub.E] is threshold (threshold [member of] [R.sup.+]). Place [Cir.sub.E] to the right of [Cur.sub.k,g]. Roll [Cir.sub.E] along [Cur.sub.k,g] such that [Cir.sub.E] is always tangent to [Cir.sub.E]. Let [Cir.sub.I] be a circle centered at point I. The radius of [Cir.sub.I] is also threshold. Place [Cir.sub.I] to the left of [Cur.sub.k,g]. Roll [Cir.sub.I] along [Cur.sub.k,g] such that all points of [Cir.sub.I] are always to the left of [Cur.sub.k,g], except the points of tangency between [Cir.sub.I] and [Cur.sub.k,g]. The trajectories of points E, I form two curves, y = [t.sub.1](x) and y = [t.sub.2](x), respectively. Adjust the starting and ending points of t1(x) and [t.sub.2](x) such that the definition domains of both curves are [D.sub.k,g] = [[x.sub.1], [x.sub.2]].

(4) Characteristics of New Clusters. In step t, [X.sub.t] is partitioned into new clusters (or new micro-clusters) (Definition 3) pursuant to certain methods. Let GR be a set containing data points of an arbitrary new cluster. Without loss of generality, let envelope curve of GR be a circle centered at [center.sun.GR] , noted as ENGR. The radius of this circle is noted as radius. We can view [center.sun.GR] as a random variable. This random variable represents the possible coordinates of GR's center. Let D([SubClu.sub.u]) represent a set of all vectors enclosed by envelope curve of [SubClu.sub.u] in d-dimensional (d = 2) space (including vectors on the envelope curve). Let [mathematical expression not reproducible]. The statistical characteristic of [X.sub.t] is unknown before [X.sub.t] is processed. Consequently, it is reasonable to assume that [center.sun.GR] obeys the uniform distribution on D(AR). Namely, we assume that every point within D(AR) is possible to be GR's center and the probabilities of every point are equal.

(5) Criterion of Different-to-Same Mis-Affiliation. Assume that we can neglect distance between the boundary curve and the right side of the black LBC's envelope curve in Figure 1. Similarly, assume that we can neglect distance between the boundary curve and the left side of the white LBC's envelope curve. The criterion of different-to-same mis-affiliation is as follows.

If ENGR [intersection] [Cur.sub.k,g] [not equal to] 0, then GR contains data points of at least two distinct local benchmark clusters.

Smaller distance between [center.sun.GR] and [Cur.sub.k,g] means higher probability of different-to-same mis-affiliation induced by the batch-mode part; the larger the radius is, the higher the probability is.

In Figure 1(b), two lines (x = [x.sub.1], x = [x.sub.2]) and two curves (y = [t.sub.1](x), y = [t.sub.2](x)) form an open domain, noted as threshold [member of] [R.sup.2]. Let Set represent the set of threshold values that make the following causal relationship hold:

[center.sun.GR] [member of] [D.sub.threshoid] and threshold [less than or equal to] radius [??] ENGR [intersection] [Cur.sub.k,g] [not equal to] 0.

Part (3) of this proof explained the meaning of threshold (threshold [member of] [R.sup.+]). GR's threshold with regard to [Cur.sub.k,g] is [threshold.sub.max] = [max.sub.threshold [member of] Set] threshold. [mathematical expression not reproducible]. Different-to-same mis-affiliation can still occur as long as radius is sufficiently large even if [center.sun.GR] [not member of] [D.sub.critical].

(6) Three Typical Situations of Different-to-Same Mis-Affiliation Induced by Batch-Mode Part. The value of [threshold.sub.max] is dramatically influenced by the following factors: first: shape of LBC's envelop curve and second: the relative positions of [Cur.sub.k,g] and GR. Shape of LBCs envelope curve is generally irregular. We simplify the proof without loss of generality. As illustrated by Figure 2, assume that three sectors are consecutively connected to form the envelope curve. In addition, three sectors' centers overlapped on point O. Radiuses of sectors 1, 2, and 3 are r1, r2, and r3, respectively. r1 equals to the radius of the margin envelope circle. r3 equals to the radius of the core envelope circle. r2 is between r1 and r3. Generally, distances between point O and points on the envelope curve are between r1 and r3. Sector 2 can represent these ordinary points.

Let GR rotate around O. Figures 2(a), 2(b), and 2(c) illustrate three characteristic positions of GR during the rotation. In Figure 2(a), threshold is r1. [D.sub.critical] covers the largest area. In Figure 2(c), threshold is r3. [D.sub.critical] covers the smallest area.

(7) Probability of Different-to-Same Mis-Affiliation Induced by Batch-Mode Part: Lower Bound. As aforemetioned, a boundary curve between two LBCs can be partitioned into curve segments. Data points on both sides of a curve segment are affiliated to the same cluster if different-to-same mis-affiliation occurs (this mis-affiliation occurs in the batchmode part of a certain step). Let [Cur.sub.k,g] represent a curve segment from a boundary curve. Let [P.sub.k,g] be the probability that data points on both sides of [Cur.sub.k,g] are affiliated to the same cluster. Considering all boundary curves in [X.sub.t], [P.sub.t,mis] represents the total probability of different-to-same mis-affiliation in the batch-mode part of step t.

Let [r.sub.min,T] be the radius of the hypersphere corresponding to core evolving granularity up to step T. Assume that auxiliary curves y = [t.sub.1] (x) and y = [t.sub.2](x) are constructed under threshold = [r.sub.minT]. On the basis of the previous parts of proof, we can draw the following inequalities:

[mathematical expression not reproducible]. (4)

(8) Probability of Different-to-Same Mis-Affiliation Induced by Batch-Mode Part: Upper Bound. Let [r.sub.max, T] be the radius of the hypersphere corresponding to margin evolving granularity. Assume that auxiliary curves y = [t.sub.1](x) and y = [t.sub.2](x) are constructed under threshold = rmaxT. On the basis of the previous parts of proof, we can draw the following inequalities:

[mathematical expression not reproducible]. (5)

3.3. Discussions on Theorem 14. (1) Theorem 14 focuses on different-to-same mis-affiliations other than same-to-different mis-affiliations. Hence we assume that we can neglect the distance between [Cur.sub.k,g] and envelope curve of the corresponding LBC (part (5) of the proof).

The probability of different-to-same mis-affiliation will be lower if this distance is not negligible. Consequently, inequalities (5) still hold. In this case, inequalities (4) give the lower bound under the worst situation.

(2) For simplification, we assume the envelope curve of GR is a circle (part (4) of the proof). Theorem 14 still holds even if the envelope curve is of arbitrary shape. The core evolving granularity is determined by [dist.sub.w,min] (Definition 9). Larger [dist.sub.w, min] always means higher [P.sub.t,mis] regardless of the envelope curves shape, or vice versa.

(3) Based on Theorem 14, we can declare that [P.sub.t,mis] is positively related to the hypervolume of GR's envelope body. In the batch-mode part of step t, we can partition [X.sub.t] into more clusters (or micro-clusters) such that data points of each cluster (or micro-cluster) scatter within a smaller range in the d-dimensional space.

(4) The batch-mode part of ordinary block-wise algorithm clusters [X.sub.t] without a global view of [[union].sup.t.sub.t'=1] [X.sub.t']. Consequently, [C.sub.t,new] is facing a high probability of different-to-same mis-affiliations. This means that numerous new clusters (or micro-clusters) in [C.sub.t,new] are containing heterogeneous data points. No matter what methods the algorithm uses to identify and merge homogeneous clusters (or micro-clusters) between [C.sub.t-1] and [C.sub.t,new] (incorporates [C.sub.t, new] into [C.sub.t-1] to obtain [C.sub.t]), clusters in [C.sub.t] will dramatically differentiate from the benchmark clusters up to step t.

3.4. Theorem of Parallelism. Let incAlgorithm represent an incremental clustering algorithm. Step t of incAlgorithm includes two parts: batch-mode part and incremental part. The batch-mode part uses a GPGPU-friendly batch-mode clustering algorithm to partition [X.sub.t] into micro-clusters. The incremental part merges these micro-clusters into [C.sub.t-1] sequentially and obtains [C.sub.t] if t [greater than or equal to] 2. The incremental part clusters the micro-clusters sequentially to obtain [C.sub.1] if t = 1.

Theorem 15. Let [Gra.sub.ave,T] represent average evolving granularity of incAlgorithm up to step T. Serial Shrink Rate of incAlgorithm is negatively related to [Gra.sub.ave,T].

Proof. Suppose incAlgorithm totally clustered N data points up to step T. Let [mClu.sub.i] [member of] [[union].sup.T.sub.t=1]. [C.sub.t,new] (i = 1, 2, 3, ..., micro[N.sub.T]) be a micro-cluster, where micro[N.sub.T] = [[union].sup.T.sub.t=1] [micro.sub.t] and [micor.sub.t] is the number of micro-clusters obtained by batch-mode part of step t. Larger [Gra.sub.ave, T] means that each cluster in [[union].sup.T.sub.t=1] [C.sub.t, new] tends to contain more data points. Thus [[union].sup.T.sub.t=1] [micro.sub.t] decreases. Pursuant to the above analysis and Definition 13, SSR declines when [Gra.sub.ave, T] increases.

Pursuant to Theorem 15 and Definition 13, parallelism of incAlgorithm is positively related to [Gra.sub.ave, T] under measurement of work-depth.

3.5. Cause of the Accuracy-Parallelism Dilemma. incAlgorithm degrades to a point-wise incremental clustering algorithm if every micro-cluster produced by the batch-mode part of step t (t = 1, 2, 3, ...) contains and only contains one data point. In this case, [Gra.sub.min, T], [Gra.subn.max, T], and [Gra.sub.ave, T] all reach the lowest bound. No different-to-same mis-affiliation occurs in batch-mode part and the clustering accuracy tends to rise. However, SSR = 1 and parallelism of incAlgorithm reach the lowest bound under measurement of work-depth.

Parallelism of incAlgorithm rises with the growth of [Gra.sub.ave, T]. Nevertheless, different-to-same mis-affiliation inevitably occurs. Whatever method the incremental part of step t (t = 1, 2, 3, ...) adopts to execute micro-cluster-level clustering, incAlgorithm cannot eliminate such mis-affiliations. Moreover, such mis-affiliations will exert negative influence on the operation of identifying homogeneous micro-clusters. Consequently, clustering accuracy drops. However, SSR decreases and parallelism rises.

4. Experiments

In this section, we validate Theorems 14 and 15 through a demo algorithm. Details of this demo algorithm can be found in our previous work [6]. Let demoAlgorithm represent the demo algorithm. The batch-mode part of demoAlgorithm uses mean-shift algorithm to generate micro-clusters. The incremental part of demoAlgorithm extends Self-organized Incremental Neural Network (SOINN) algorithm [7] to execute micro-cluster-level clustering. We validate the variation trends of accuracy and parallelism with respect to evolving granularity. Issues such as improving cluster accuracy of the demo algorithm and adaptively seeking balance between accuracy and parallelism are left as future work.

The benchmark algorithm of demoAlgorithm is batchmode mean-shift algorithm. Accuracy metrics include the final cluster number, Peak Signal to Noise Ratio (PSNR), and Rand Index. In addition to Serial Shrink Ratio (SSR), we define parallel-friendly rate (PFR) to measure parallelism of demoAlgorithm.

In order to intuitively show the clustering results, images are used as input datasets. The clustering output is the segmented image. Evolving granularity is measured by bandwidth parameters of mean-shift algorithm. Bandwidth parameters are noted in the format of (feature bandwidth, spatial bandwidth).

4.1. Additional Performance Metrics

(1) Accuracy Metrics. Up to step TO, the final cluster number of demoAlgorithm is the quantity of clusters in [C.sub.T0]. Homogenous data points tend to fall into different clusters in the final result if the incremental clustering algorithm produces excessively more clusters than the benchmark algorithm, or vice versa. Consequently, demoAlgorithm tends to achieve higher accuracy if demoAlgorithm can resolve a closer final cluster number to that of the benchmark algorithm.

PSNR values of both incremental clustering result and benchmark result are calculated with respect to the original image.

(2) Parallelism Metrics. Assume that demoAlgorithm runs on a single-core processor. In step t (t = 1, 2, 3, ...), the batch-mode part consumes time [T.sub.1,t] and generates micro,, micro-clusters; the incremental part consumes time [T.sub.2,t]. Suppose demoAlgorithm totally clustered N data points up to step [T.sub.0]. Up to step [T.sub.0], the parallel-friendly rate (PFR) of demoAlgorithm is PFR = [[summation].sup.T0.sub.t=1] + [T.sub.1,t] / ([[summation].sup.T0.sub.t=1] ([T.sub.1,t], [T.sub.2,t]). Larger PFR means superior parallelism.

4.2. Hardware and Software Environment. All experiments are executed on a platform equipped with an Intel Core2TM E7500 CPU, and a NVIDIA GTX 660 GPU. The main memory size of CPU is 4 GB. Linux of 2.6.18-194.el5 kernel is used with gcc 4.1.2 and CUDA5.0.

CUDA grid size is always set to 45. CUDA block size is set to 64. Double precision data type is used in floating point operations on both CPU and GPGPU. Batch-mode mean-shift is used as benchmark algorithm. Uniform kernel is used with mean-shift algorithm [33] for both batch-mode part of demoAlgorithm and benchmark algorithm.

4.3. Input Data Sets. We use grayscale images as input. One pixel corresponds to a input data point. The dimension of a data point (vector) is d = 3. Components of the 3-dimensional vector are X-coordinate of the pixel, Y-coordinate of the pixel, and grayscale value of the pixel, respectively. We selected ten natural scenery images (data set 1) and twenty aerial geographical images (data set 2) from USC SIPI data set [34]. Each image is evenly divided into 128 x 128 data blocks. These blocks are successively input to demoAlgorithm. Every image of data set 1 is incrementally clustered in CPU-only mode and GPGPU-powered mode, separately. Incrementally clustering an image of data set 2 in CPU-only mode is excessively time-consuming due to the large data volume. Consequently, the task of incrementally clustering these images is only executed in GPGPU-powered mode.

4.4. Experiment Results and Discussion. Bandwidth parameters (8,6) and (10,8) are excessively large for mean-shift algorithm (both incremental part of demoAlgorithm and benchmark algorithm). Under this parameter setting, accuracy of benchmark clustering results is excessively inferior. Consequently, it is not meaningful to compare accuracy between the incremental clustering algorithm and the benchmark. Thus, we omitted the experiment results under granularities (8, 6) and (10,8) in Tables 1 and 2 and Figures 4, 5, and 6.

(1) Data Set 1: Natural Scenery Images. Table 1 shows the final cluster numbers resolved by the demo algorithm and the benchmark algorithm. Suppose we single out the cluster number values from column 2 (or 3, 4). Afterwards, we compute the average of this column of values. Let [aveN.sub.inc] represent this average. Suppose we single out cluster number values out of column 5 (or 6,7). Afterwards, we compute the average of this column of values. Let [aveN.sub.bench] represent this average. For example, [aveN.sub.inc] is (2372 + 2348 + 1388 + 2466 + 1715 + 2208 + 2558 + 2374 + 1631 + 2607)/10 = 2166.7 in terms of column 2. [aveN.sub.bench] is (2517 + 2481 + 1469 + 2540 + 1838 + 2326 + 2656 + 2425 + 1700 + 2766)/10 = 2271.8 with regard to column 5.

Under the three ascending granularities, values of [aveN.sub.inc]/[aveN.sub.bench] are 0.9537, 0.9578, and 0.9726, respectively. The final cluster number of demo algorithm is in acceptable agreement with that of the benchmark algorithm.

Figure 3 shows PFR and SSR of the first five inputs from data set 1. The figure reflects positive relation between evolving granularity and parallelism. We omitted the other images due to the fact that they show analogue trends. Figure 4 illustrates the negative relation between accuracy and evolving granularity.

We select granularity (4,2) as a balance point between accuracy and parallelism based on experience. truck has the lowest Rand Index value on this balance point.

Figure 5 shows the original image and experiment results on truck. We can draw a similar conclusion from Figure 5 as that of Figures 3 and 4: speedup rises and accuracy degrades when evolving granularity increases. Moreover, Figure 5 shows the final cluster number values and PSNR values of both incremental clustering result and benchmark result, as well as the speedup. In addition to validating Theorems 14 and 15, our demo algorithm can produce acceptable-accuracy result on certain inputs such as truck.

(2) Data Set 2: Aerial Geographical Images. We continue to use symbols explained in part one of this subsection. Values of ([aveN.sub.inc]/[aveN.sub.bench]) under three ascending granularities are 0.9954,1.1063, and 1.1345, respectively. Overall, the final cluster number of our demo algorithm is close to that of the benchmark algorithm.

In order to avoid unnecessary details, we only list the maximum and minimum Rand Index values (and corresponding image names) under each evolving granularity in Table 2. Similarly, we only list the maximum and minimum SSR values in Table 3. These experiment results can still validate Theorems 14 and 15. We still choose granularity (4,2) as a balance point between accuracy and parallelism based on experience. usc.2.02 has the lowest Rand Index value on granularity (4,2).

Figure 6 shows the final cluster number values and PSNR values of both incremental clustering result and benchmark result. In addition to validating Theorems 14 and 15, our demo algorithm can produce acceptable-accuracy result on certain inputs such as usc.2.02.

5. Conclusion

In this paper we theoretically analyzed the cause of accuracy-parallelism dilemma with respect to the GPGPU-powered incremental clustering algorithm. Theoretical conclusions were validated by a demo algorithm.

Our future work will focus on identifying the suitable granularity for a given incremental clustering task and decreasing mis-affiliations through variable data-block-size.

https://doi.org/10.1155/2017/2519782

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors' sincere thanks go to Dr. Bo Hong (AT & T Corporation; School of Electrical and Computer Engineering, Georgia Institute of Technology, USA). This work is supported by the following foundations: Science and Technology Development Program of Weifang (2015GX008,2014GX028), Doctoral Program of Weifang University (2016BS03), Fundamental Research Funds for the Central Universities (Project no. 3102016JKBJJGZ07), Colleges and Universities of Shandong Province Science and Technology Plan Projects (J13LN82), Natural Science Foundation of China (no. 61672433), and the Ph.D. Programs Foundation of Ministry of Education of China (no. 20126102110036).

References

[1] P. Wang, P. Zhang, C. Zhou, Z. Li, and H. Yang, "Hierarchical evolving Dirichlet processes for modeling nonlinear evolutionary traces in temporal data," Data Mining and Knowledge Discovery, vol. 31, no. 1, pp. 32-64, 2017

[2] S. Ramirez-Gallego, B. Krawczyk, S. Garcia, M. WoZniak, and F. Herrera, "A survey on data preprocessing for data stream mining: current status and future directions," Neurocomputing, vol. 239, pp. 39-57, 2017.

[3] S. Garcia, J. Luengo, and F. Herrera, Data Preprocessing in Data Mining, Springer, 2015.

[4] A. Ordonez, H. Ordonez, J. C. Corrales, C. Cobos, L. K. Wives, and L. H. Thom, "Grouping of business processes models based on an incremental clustering algorithm using fuzzy similarity and multimodal search," Expert Systems with Applications, vol. 67, pp. 163-177, 2017.

[5] C. Chen, D. Mu, H. Zhang, and B. Hong, "A GPU-accelerated approximate algorithm for incremental learning of Gaussian mixture model," in Proceedings of the 2012 IEEE 26th International Parallel and Distributed ProcessingSymposium Workshops (IPDPSW '12), pp. 1937-1943, Shanghai, China, May 2012.

[6] C. Chen, D. Mu, H. Zhang, and W. Hu, "Towards a moderate-granularity incremental clustering algorithm for GPU," in Proceedings of the 2013 5th International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC '13), pp. 194-201, Beijing, China, October 2013.

[7] S. Furao and O. Hasegawa, "An incremental network for online unsupervised classification and topology learning," Neural Networks, vol. 19, no. 1, pp. 90-106, 2006.

[8] S. Furao, T. Ogura, and O. Hasegawa, "An enhanced self-organizing incremental neural network for online unsupervised learning," Neural Networks, vol. 20, no. 8, pp. 893-903, 2007

[9] J. Zheng, F. Shen, H. Fan, and J. Zhao, "An online incremental learning support vector machine for large-scale data," Neural Computing and Applications, vol. 22, no. 5, pp. 1023-1035, 2013.

[10] A. Zhou, F. Cao, W. Qian, and C. Jin, "Tracking clusters in evolving data streams over sliding windows," Knowledge and Information Systems, vol. 15, no. 2, pp. 181-214, 2008.

[11] X. Zhang, C. Furtlehner, and M. Sebag, "Data streaming with affinity propagation," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5212, no. 2, pp. 628-643, 2008.

[12] X. Zhang, C. Furtlehner, J. Perez, C. Germain-Renaud, and M. Sebag, "Toward autonomic grids: analyzing the job flow with affinity streaming," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '09), pp. 987-995, July 2009.

[13] X. Zhang, C. Furtlehner, C. Germain-Renaud, and M. Sebag, "Data stream clustering with affinity propagation," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 7, pp. 1644-1656, 2014.

[14] S. Luhr and M. Lazarescu, "Incremental clustering of dynamic data streams using connectivity based representative points," Data and Knowledge Engineering, vol. 68, no. 1, pp. 1-27, 2009.

[15] D. Yang, E. A. Rundensteiner, and M. O. Ward, "Neighbor-based pattern detection for windows over streaming data," in Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (EDBT '09), pp. 529-540, March 2009.

[16] E. Lughofer, "A dynamic split-and-merge approach for evolving cluster models," Evolving Systems, vol. 3, no. 3, pp. 135-151, 2012.

[17] H. Yang and S. Fong, "Incrementally optimized decision tree for noisy big data," in Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine '12), pp. 36-44, Beijing, China, August 2012.

[18] X.-T. Yuan, B.-G. Hu, and R. He, "Agglomerative mean-shift clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 2, pp. 209-219, 2012.

[19] L.-Y. Wei and W.-C. Peng, "An incremental algorithm for clustering spatial data streams: exploring temporal locality," Knowledge and Information Systems, vol. 37, pp. 453-483, 2013.

[20] V. Ganti, J. Gehrke, and R. Ramakrishnan, "Mining data streams under block evolution," ACM SIGKDD Explorations Newsletter, vol. 3, no. 2, pp. 1-10, 2002.

[21] M. Song and H. Wang, "Highly efficient incremental estimation of Gaussian mixture models for online data stream clustering," in Proceedings of the Defense and Security (SPIE '05), vol. 5803, p. 174, Orlando, Florida, USA, 2005.

[22] N. S. L. P Kumar, S. Satoor, and I. Buck, "Fast parallel expectation maximization for gaussian mixture models on GPUs using CUDA," in Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications (HPCC '09), pp. 103-109, June 2009.

[23] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, "A framework for clustering evolving data streams," in Proceedings of the 29th international conference on Very large data bases (VLDB '03), 2003.

[24] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering data streams: Theory and practice," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 515-528, 2003.

[25] L. Tu and Y. Chen, "Stream data clustering based on grid density and attraction," ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 3, article 12, 2009.

[26] M. Ackerman and S. Dasgupta, "Incremental clustering: the case for extra clusters," Advances in Neural Information Processing Systems, vol. 1, pp. 307-315, 2014.

[27] M. Ackerman and J. Moore, "When is Clustering Perturbation Robust?" Computing Research Repository, 2016.

[28] M. Ackerman and J. Moore, "Clustering faulty data: a formal analysis of perturbation robustness," 2016, https://arxiv.org/abs/ 1601.05900.

[29] A. Gepperth and B. Hammer, "Incremental learning algorithms and applications," in Proceedings of the 24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN '16), pp. 357-368, April 2016.

[30] C. Chen, D. Mu, H. Zhang, and W. Hu, "Nonparametric incremental clustering: a moderate-grained algorithm," Journal of Computational Information Systems, vol. 10, no. 3, pp. 1183-1193, 2014.

[31] L. Hubert and P. Arabie, "Comparing partitions," Journal of Classification, vol. 2, no. 1, pp. 193-218, 1985.

[32] Y. Shiloach and U. Vishkin, "An O (n2 logn) parallel max-flow algorithm," Journal of Algorithms, vol. 2, pp. 1128-1146, 1982.

[33] M. Huang, L. Men, and C. Lai, "Accelerating mean shift segmentation algorithm on hybrid CPU/GPU platforms," in In Proceedings of the International Workshop on Modern Accelerator Technologies for GIScience (MAT4GIScience '12), 2012.

[34] "USC-SIPI[DB/OL]," 2013, http://sipi.usc.edu/database/.

Chunlei Chen, (1) Li He, (2) Huixiang Zhang, (3) Hao Zheng, (4) and Lei Wang (1)

(1) School of Computer Engineering, Weifang University, Weifang Shandong 261061, China

(2) School of Electromechanical Engineering, Guangdong University of Technology, Guangzhou 510006, China

(3) School of Automation, Northwestern Polytechnical University, Xi'an, China

(4) School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA

Correspondence should be addressed to Chunlei Chen; chunlei.chen@wfu.edu.cn

Received 29 January 2017; Revised 17 July 2017; Accepted 31 July 2017; Published 11 October 2017

Academic Editor: Amparo Alonso-Betanzos

Caption: Figure 1: Relation between evolving granularity and different-tosame mis-affiliation induced by the batch-mode part.

Caption: Figure 2: Typical examples of different-to-same mis-affiliations induced by batch-mode part.

Caption: Figure 3: Data set 1: variation trends of PFR and SSR.

Caption: Figure 4: Data set 1: variation trends of Rand Index with respect to evolving granularity.

Caption: Figure 5: truck: original image and incremental clustering results under three ascending granularity values.

Caption: Figure 6: usc22.02: original image and incremental clustering results under three ascending granularity values.
Table 1: Comparison of final cluster number.

                Demo algorithm          Benchmark algorithm

           (2,1)    (4,2)    (6,4)    (2,1)    (4,2)    (6,4)

Boat        2372     1444     589      2517     1536     632
Cars        2348     1151     261      2481     1133     270
fl6         1388     749      406      1469     827      356
Hill        2466     1552     525      2540     1623     540
Peppers     1715     845      385      1838     943      413
Sailboat    2208     1762     1051     2326     1817     1106
Stream      2558     2312     1524     2656     2437     1574
Tank        2374     1137     172      2425     1146     178
Truck       1631     986      339      1700     996      338
Trucks      2607     2256     681      2766     2360     693

Table 2: Dataset 2: max and min Rand Index under ascending
granularities.

          Granularity (measured by bandwidth)

         (2,1)         (4,2)         (6,4)

Max     0.9997        0.9983        0.9850
      (usc2.2.05)   (usc2.2.17)   (usc2.2.17)
Min     0.8369        0.5025        0.1501
      (usc2.2.02)   (usc2.2.02)   (usc2.2.07)

Table 3: Dataset 2: max and min SSR values under ascending
granularities.

Granularity (measured by bandwidth)

              (2,1)                    (4,2)

Min   3.7E - 03 (usc2.2.02)    2.1E - 03 (usc2.2.02)
Max   1.0E - 02 (usc2.2.17)    9.5E - 03 (usc2.2.17)

              (6,4)                    (8,6)

Min   6.3E - 04 (usc2.2.03)    6.1E - 04 (usc2.2.03)
Max   6.2E - 03 (usc2.2.17)    3.3E - 03 (usc2.2.01)

             (10,8)

Min   1.2E - 04 (usc2.2.03)
Max   2.1E - 03 (usc2.2.08)
COPYRIGHT 2017 Hindawi Limited
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Research Article
Author:Chen, Chunlei; He, Li; Zhang, Huixiang; Zheng, Hao; Wang, Lei
Publication:Computational Intelligence and Neuroscience
Article Type:Report
Date:Jan 1, 2017
Words:8607
Previous Article:Patch Based Multiple Instance Learning Algorithm for Object Tracking.
Next Article:Inference Engine in an Intelligent Ship Course-Keeping System.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters