# Separation in data mining based on fractal nature of data.

1 INTRODUCTIONOne task is that we have a large amount of data after extensive search, but we are interested only in their small portion and we wish to separate them from the rest with a minimal error. These tasks are also rather common in particle physics where, during measurements, an extensive amount of data ("events") is registered but only their small portion contains information which has not been previously known. This separation task can be generalized in such a way that there are several classes of data in which we need to sort them.

Often, in classification problems, the only known fact is the learning set, i.e. the set of points each of known class. The problem is how to estimate the probability to which class a query point x of the data space belongs. The different approaches to the classification can be divided into parametric and nonparametric methods. Parametric methods include neural networks of different kinds [6], decision trees or forests [2], also known as the CART method, the support vector machine--SVM [1] and many more. Nonparametric methods are mostly based on the Bayesian approach [6], [10], and the k nearest neighbors (k-NN) method [5], [6], [24].

Comparing different classification methods applied to different data sets, one can find a very interesting fact that sometimes a simple approach outperforms the other methods including those which are very sophisticated. This is the case of the 1-NN method and the k-NN method, for example. These methods are thoroughly elaborated since Cover and Hart [5], and many variants have appeared up to now, see e.g. [8], [24]. Simple methods like 1-NN can be improved by some preprocessing, in fact, by some modification of the learning set. Best known are bagging and bootstrap approaches. Other approaches set some weights to learning set samples or to individual features (coordinates), eventually modify features of individual samples ("move" samples) to get better results [8], [24]. Here we describe new algorithms in a "pure" form without any preprocessing.

Classifiers directly or indirectly take the distribution of data points around a given query point into account. To express the distribution of points from the viewpoint of distances from a given point, the probability distribution mapping function was introduced [16]. The approximation of this function in the form of a suitable power of the distance is presented. How to state this power--the distribution mapping exponent--is described. This exponent is used for probability density estimation in high-dimensional spaces and for classification. It will be seen that a core notion in the transformation mentioned above is a slightly redefined singularity or scaling exponent to fit the notion of distance between points. The scaling considered here is related to distances between pairs of points in a multivariate space. Thus, it is closer to the correlation dimension by Grassberger and Procaccia [11] than to box-counting or other fractal or multifractal dimension definitions [19], [26].

2 MULTIDIMENSIONAL DATA

2.1 Data Space

Let the learning set U of total N samples be given. Each sample [x.sub.t] = {[x.sub.t1], [x.sub.t2],... [x.sub.tn]}; t = 1, 2, ... N, [x.sub.tk] [member of] R; k = 1, 2, ..., n corresponds to a point in n-dimensional metric space [M.sub.n], where n is the sample space dimension. For each [x.sub.t] [member of] U, class function T: [R.sup.n] [right arrow] {1, 2, ... C}: T([x.sub.t]) = c is introduced. With the class function the learning set U is decomposed into disjoint classes [U.sub.c] = {[x.sub.t] [member of] U | T([x.sub.t]) = c}; c [member of] {1, 2, ..., C}, [U.sup.C.sub.c=1] [U.sub.c], [U.sub.c] [intersection] [U.sub.d] = [empty set]; c, d [member of] {1, 2, ..., C}; c [not equal to] d. Let the cardinality of set [U.sub.c] be [N.sub.c] [[summation].sup.C.sub.c=1] [N.sub.c] = N.

2.2 Indexing Data

As we need to express which sample is closer or further from some given point x (the query point), we can rank points of the learning set according to distance [r.sub.i] of point [x.sub.i] from point x. Therefore, let points of U be indexed (ranked) so that for any two points [x.sub.i], [x.sub.j] [member of] U there is i < j if [r.sub.i] < [r.sub.j]; i, j = 1, 2, ... N, and class [U.sub.c] = {[x.sub.i] [memebr of] U | T([x.sub.i]) = c}. Of course, ranking depends on point x and eventually metrics of [M.sub.n]. We use Euclidean ([L.sub.2]) and absolute (Manhattan, [L.sub.1]) metrics here. In the following, indexing by i means the ranking just introduced.

3 BASIC TYPES OF CLASSIFIERS

Classifiers are based on several principles:

* A cut of distributions of individual features (variables) of a pattern; this is rather close to the Bayes classifier, used in physics.

* Bayes classifier. For each variable empirical distribution densities (histograms) are estimated for each class. For a query pattern with unknown class, the corresponding density is estimated from empirical or empiric-based histogram for each feature. Then, for each class a product of these densities is computed and the largest product corresponds to the class to which the pattern belongs.

* Distance based classifiers. These classifiers need the pattern space be a metric space. Therefore, some transformation is often used and metrics introduced. The simplest is the well-known 1-NN (nearest neighbor) classifier that associates the class of its nearest neighbor to the query pattern. Several (k) nearest neighbors (k-NN) can be used; sophisticated variants use a special, even adaptive metrics.

* Cluster-based classifiers try finding clusters according to given criteria, and in a cluster, a threshold is used for classification as in Random Forests.

* Kernel-based classifiers first identify parameters of kernels and then estimate a class according to common densities.

* GMDH classifiers and approximators that are special forms of neural networks where neurons have two inputs only and each neuron's response is a full quadratic polynomial of two input signals [22].

Our class of classifiers is:

A distance-based kind of classifiers (a metrics--distance--is introduced) that can also be considered as kernel classifiers;

Use two observations:

1. Multidimensional data are fractal, even multifractal (nonfractal data are totally random data in the data space or in its subspace).

2. Distance-based fractal measure is the correlation dimension.

4 CORRELATION DIMENSION

The correlation integral, in fact, a distribution function of distances between all pairs of points in a set of points in a space with a distance, was introduced in [11]. Correlation integral C(r) is defined by

[C.sub.1] (r) = [lim.sub.N [right arrow] [infinity]] [1/[N.sup.2]] x {number of pairs (i, j): [paralell][X.sub.i] - [X.sub.j][parallel] < r}

In a more comprehensive form one can write

[C.sub.I] (r) = Pr([parallel][X.sub.i] - [X.sub.j][parallel] < r).

Grassberger and Procaccia [11] have shown that for small r the [C.sub.I](r) grows like power [C.sub.I] (r) ~ [r.sup.v] and that "correlation exponent" v can be taken as the most useful measure of the local structure of the strange attractor. This measure allows distinguishing between deterministic chaos and random noise. The correlation integral can be rewritten in the form

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (1)

where h(.) is Heaviside step function, and considering all N(N-1)/2 pairs of points of set of N points. From it

v = [lim.sub.r [right arrow] [infinity]][1n[C.sub.I](r)/1n r] (r).

Thus, if the correlation integral as a function of r is depicted in log--log coordinates, there appears a nearly straight line with a slope equal to correlation dimension v. This was shown in [11] and the method based on this fact is called Grassberger-Procaccia's estimator of v. There are other methods for estimation of correlation dimension v. One of the most cited is Takens's (viz lit. odkaz 27) estimator [27].

5 DISTRIBUTION MAPPING FUNCTION

Here we single out three notions introduced or used in [14]-[15]. The probability distribution mapping function (DMF) is a mapping of the probability distribution of points in n-dimensional space to the distribution of points in one-dimensional space of the distances. The distribution density mapping function (DDMF) is a one-dimensional analogy to the probability density function. The power approximation of the probability distribution mapping function in the form of [(distance).sup.q] is introduced, where we call the exponent q the distribution mapping exponent (DME).

These notions are local, i.e. are related to a particular (query) point. We show that the distribution mapping exponent q is something like a local value of the correlation dimension according to Grassberger and Procaccia, [11]. It can be also viewed as the local dimension of the attractor by Froehling [9] or singularity eventually scaling exponent ("exponent") in the sense of Stanley and Melkin, [26].

To be more exact, let us introduce two definitions to study a probability distribution of points (patterns) in the neighborhood of a query point x in n-dimensional Euclidean space [E.sub.n].

Definition 1. Let the probability distribution mapping function D(x, r) of the query point x in [E.sub.n] be function [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], where p(z) is the probability density of the points at z; r is the distance from the query point x and B(x, r) is the ball with center x and radius r.

Definition 2. Let the distribution density mapping function d(x, r) of the query point x in En be function d(x, r) = [[partial derivative]/[partial derivative]r] D(x, r), where D(x, r) is a probability distribution mapping function of the query point x with radius r.

Let us try to transform the true distribution of points so that the distribution density mapping function is constant, at least in the neighborhood of the query point.

Definition 3. The power approximation of the probability distribution mapping function D(x, r) is the function [r.sup.q] such that D(x,r)/[r.sup.q] [right arrow] const for r [right arrow] 0+. The exponent q is the distribution mapping exponent.

These definitions were published and used elsewhere [14]-[15].

Intuitively, Fig. 1 gives illustration of mapping functions; "Pure" means theoretical dependence, "True" means real data. From top to bottom:

In Fig. 1a the dependence of order of near point on its distance from the query point is shown. Note that the (first) nearest neighbor is rather far from the query point and that distances of farther points from the query point differ little. Positions of furthest several points are influenced by boundary effect.

In Fig. 1b the same dependence is shown in logarithmic scale; linear dependence appears here. Note the slope that is q = 2.5.

In Fig. 1c the dependence of percentage of points (instead of count) on distance to the q-th power is shown. In fact, this graph is the distribution function of variable z = [r.sup.q]. Again, linear dependence appears. It means a uniform distribution (with the exception of several farthest points). It also appears that radii [r.sub.i], i = 1, 2, ... grow proportionally to the q-th root of index i [r.sub.i] [approximately equal to] [q root of i].

In Fig. 1d the density (histogram) of variable z = [r.sup.q] confirms its uniform distribution.

Exponent q is called the distribution mapping exponent according to Definition 3 or--referring to fractal nature of data--the scaling exponent as it scales distances according to scaling law z = [r.sup.q].

We use this exponent in a classification method.

6 DECOMPOSITION OF THE CORRELATION INTEGRAL TO LOCAL FUNCTIONS

The distribution mapping exponent (DME) reminds one of the so-called correlation dimension by Grassberger and Procaccia [11], and corresponds to the generally used definitions of power scaling laws especially to singularity exponent. It can be seen that the correlation integral is a distribution function of distances between all pairs of points of the data points given. The probability distribution mapping function is a distribution function of the distances from one fixed point x. In the case of finite number of points N, there are N(N - 1)/2 distances between pairs of points and from them one can construct an empirical correlation integral. Similarly, for each point there are N - 1 distances and from these N - 1 distances one can construct an empirical probability distribution mapping function. There are exactly N such functions and the mean of these functions gives the correlation integral. This is also valid for N going to infinity. Let us return to Eq. (1). In this section we show that the correlation integral is the mean of distribution mapping functions.

Theorem. [15] Let the correlation integral, i.e. the probability distribution of distances [l.sub.ij] between all pairs of points from the learning set, be [C.sub.I](r) and let D([x.sub.i], r), where [l.sub.ik] is the distance of k-th neighbor from point [x.sub.i], be the distribution mapping function corresponding to point [x.sub.i]. Then, [C.sub.I](r) is a mean value of D([x.sub.i], r):

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (2)

Proof.

Let h(x) be Heaviside step function. Then, the correlation integral is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

and also

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Comparing the last formula with (1), we get directly (2).

Note the inner sum. It is the distribution mapping function corresponding to the ith point.

7 THE EXPONENT IN A CLASSIFICATION METHOD

Informally, consider partial influences of individual points to the probability that point x is of class c. Each point of class c in the neighborhood of point x adds a little to the probability that point x is of class c, where c = {0, 1} is the class mark. This influence is the larger the closer to point x the point considered is, and vice versa.

From another point of view, let Pr(T(x) = c |T([x.sub.t]) = c) be the probability that the given point x is of class c if neighbor point number i is of the same class as point x. Note that points of learning set U are indexed so that for any two points [x.sub.i], [x.sub.j] [member of] U there is i < j if [r.sub.i] < [r.sub.j]; i, j = 1, 2, ... N as mentioned in Chap. 2.2. In the following, Z is a constant that is used to normalize the probability that point x belongs to a class to 1:

For the first (nearest) point i = 1

Pr(T(x) = c | T([x.sub.1] = c) = K [1/[r.sup.q.sub.1]],

for the second point i = 2

Pr(T(x) = c | T([x.sub.2] = c) = K [1/[r.sup.q.sub.2]],

and so on, generally for point No. i

Pr(T(x) = c | T([x.sub.i] = c) = K [1/[r.sup.q.sub.i]].

A question arises--where do these considerations come from? The answer is simple:

Let us consider a (fixed) query point x. Construct a ball B(x, r) with center x and radius r. Let the volume of this ball be V(x, r) and it holds V(x, r) = [S.sub.n][r.sup.n]. [S.sub.n] is a constant dependent on space dimensionality n. Let there be points in B(x, [r.sub.i]). We are interested in the density of points in the neighborhood of point x.

It is p(x, [r.sub.i]) = 1/V(x,[r.sub.i]) = i/K[r.sup.n.sub.i].

Considering this density constant (in the sense that we get the same density for different values of, i.e. for different radii [r.sub.i]), radii [r.sub.i] would grow proportionally to the n-th root of i, [r.sub.i] [approximately equal to] [nth root of (i)]. But--as shown above--the space can have (locally or globally) effective dimensionality q lesser than n. Thus, we should compute density of points using the volume of q-dimensional ball using formula p (x, r) = [i/[V.sub.q](x,[r.sub.i]) = i/[S.sub.q][r.sup.q.sub.i], where [S.sub.q] is a constant. Taking one isolated point only in distance [r.sub.i], i = 1, 2, ..., we get formulae for the first, second, ... point as above.

Individual points are independent; and then we can sum up probabilities above. Thus, we add the partial influences of k individual points together by summing up

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (3)

Note that the sum goes over indices i for which the corresponding samples of the learning set are of class c, c = 1, 2, ... C, where C is the number of classes. It can be seen that any change of distance [r.sub.i] of any point i from point x will influence the probability that point x is of class c. It can also be considered as a kernel method with kernels K (xt, [r.sub.i], c) = 1/[r.sup.q.sub.i] located at all points [x.sub.i], i = 1, 2,.. in distances [r.sub.i].

We can rewrite (1) in a more suitable form for practical computation.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (4)

The upper sum goes over indices for which the corresponding samples of the learning set are of class c. At the same time all N points of the learning set are used instead of number k. Moreover, we often do not use the nearest point (i = 1). It can be found that its influence is more negative than positive on the probability estimate.

8 A THEOREM

Theorem 1

Let the task of classification into two classes be a given. Let the size of the learning set be N and let both classes have the same number of samples. Let v, 1 < v < n be the correlation dimension, let i be the index of the i-th nearest neighbor of point x (without respect to class), and [r.sub.i] > 0 its distance from point x. Then,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)

is probability that point x belongs to class c.

Proof [18]:

For each query point x one can state the probability distribution mapping function D(x, [r.sub.i], c). We set this function so that it holds (C is a constant)

D(x, [r.sup.q.sub.i], c) = C[r.sup.q.sub.i]

in the neighborhood of point x. Using derivation, according to variable z = [r.sup.q.sub.i], we get d ([r.sup.q.sub.i], c) = C . By the use of z = [r.sup.q.sub.i], the space is mapped ("distorted") so that the distribution density mapping function is constant in the neighborhood of point x for any particular distribution. The particular distribution is characterized by the particular value of the distribution mapping exponent q at point x. In this mapping, the distribution of the points of class c is uniform.

Let us consider sum [E.summation over (i=1)] d(x, [r.sup.q.sub.i], c)/[r.sup.q.sub.i]. For this sum we have

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (6)

because d(x, [r.sup.q.sub.i], c) = d(x, z, c) = p(c|x) for all i (uniform distribution has a constant density).

Given the learning set, we have the space around point x "sampled" by the individual points of the learning set. Let pc(ri) be an a-posteriori probability that point i at distance ri from the query point x is of the class c. Then, pc(ri) is equal to 1 if point xi is of class c and pc(ri) is equal to zero, if not, i.e. if the point is of the other class. Then, the particular realization of p(c | x)[N.summation over (i=1)]1/[r.sup.q.sub.i] is sum [N.summation over ([x.sub.i] [member of] [U.sub.c])] 1/[r.sup.q.sub.i]. Using this sum, we can rewrite (6) into the form [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Dividing this equation by the limit of the sum on the left hand side, we get

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

and due to the same limit transition in the numerator and in the denominator we can rewrite it in form (5). []

Note that the convergence of [S.sub.c] = [N.sumamtion over([x.sub.i][member of][U.sub.c])] /1[r.sup.q.sub.i] is faster the larger DME q is. Usually, for multivariate real-life data the DME is also large (and the correlation dimension as well).

9 THREE METHODS

Modifying considerations above, one can find that there are three methods according to what kind of data we consider and the way we deal with ratio [r.sup.q].

1. If data are multifractal, then variable q differs for different places of data space: It is necessary to compute q for each query point extra. The apparent advantage is a more exact estimate than in the "global" approach below that uses, in fact, the mean of all qs over the whole learning set.

2. For monofractal data, there is q = v; q has a constant value in the data space and equals to the correlation dimension. The advantage is that we need to compute correlation dimension v only once. There are many methods leading to a rather exact estimate of the correlation dimension.

3. IINC (Inverted Indices of Neighbors Classifier) is based on the observation explained below. The method needs no q or v estimation, on the other hand, it uses sorting.

9.1 Multifractal Data Classifier

This classifier was described in Chap. 7. Its advantage that more exact estimate of the distribution mapping exponent q may appear as an disadvantage not only for repeated computation of q for each query point separately, but also for large error in its estimate. It is given by a relatively small amount of data especially in cases of small learning data sets like in e.g. Iris data [21] where there is the learning set of total 150 samples. Fortunately, it can be shown that the method is not too sensitive to exactness of stating the q. In any case one must check the estimated value of q for cases where q is too close to data space dimensionality n or even larger than n, or too close to zero. In these cases it is advisable to use q estimated for another point already computed and near to the query point or to compute q for several near points, exclude cases now discussed and take a mean. Many similar approaches can be designed including estimate of correlation dimension for small subset of the learning set consisting of points near to the query point.

9.2 Monofractal Data Classifier

Beside the advantage that we need to compute correlation dimension v only once, there are two problems related to it. First, the correlation dimension estimation can be time-consuming for large learning data sets because one must compute all N(N-1)/2 distances between points of all pairs of points of the learning set. Second, do we know if our data are monofractal? To answer these questions we show here the sensitivity of classification error to error in correlation dimension estimation, and the fact that the spread of the distribution mapping exponent is relatively small.

9.2.1 Sensitivity of classification error to error in CD estimation

For error sensitivity to the value of correlation dimension no particular threshold [theta] was used. Instead, we use a more general classification quality measure here, the size of area under the ROC (Receiver operating characteristics [7]) curve (the AUC) of dependence of "sensitivity", i.e. the acceptance of class 1 samples (often called signal) on "specificity", i.e. on the suppression of class 0 samples (often called background, i.e. background error). It holds that the larger the area under the curve (AUC, classification efficiency) the better classification in a general sense. The ideal case is unit area, i.e. ROC curve going through point (0, 1), which means 100 % sensitivity (signal efficiency) and 100 % specificity, i.e. zero background error.

In Fig. 2 the classification efficiency as the function of the value q of estimated correlation dimension v for data "Higgs" is shown. These data have estimated correlation dimension v = 6.5. For this value of q, the minimal error is 0.472, the value of error is 0.481 for q = n = 23. This is rather a small difference, only 1.93 %, showing that error in correlation dimension estimate need not be critical.

9.2.2 The correlation dimension and spread of the DME

In Table 1 and Fig. 4 features of six different data sets and corresponding distribution-mapping exponent are summarized. Data sets originate from UCI Machine Learning Repository [21]. Note that mean distribution-mapping exponent is, in fact, the correlation dimension. It can be seen that

--Mean DME (in fact, an estimate of correlation dimension) is much smaller than dimension for all data varying from a little more than 6.2 % (data Ionosphere) to nearly 49.5 % (data RKB).

--DME of a data set lies in a rather narrow band; normalized mean squared variation, sigma/mean, [sigma]/[micro] varies from 7.357 % to less than 19 %.

--Note that lines for Heart, German, and Higgs data look suspiciously similar but these data come from very different independent sources.

Here we can conclude that data are generally multifractals as the scaling exponent, DME varies from point to point of the set. On the other hand, these variations usually lie in a rather narrow band and thus mean DME, i.e. the correlation dimension, may suffice for characterization of the fractal nature of data.

9.3 IINC--Inverted Indices of Neighbors Classifier

This is a modification of the first (multifractal) method above; based on observation that the distribution mapping function grows linearly in log(count)log(distance) coordinates with slope q and thus linearly in coordinates count vs. [(distance).sup.q], as shown in Fig. 1c. So, there the index of the neighbor i is proportional to [r.sup.q.sub.i], i ~ [r.sup.q.sub.i]. Then, 1/[r.sup.q.sub.i] ~ 1/i From this follows that we can use the neighbor's index i instead of its distance to the q-th (or v-th) power, [r.sup.q.sub.i]. Rewriting (3), we get

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

This approach is very simple and very effective; no estimate of the singularity (fractal) exponent is needed.

9.3.1 Kernel method

All three approaches, and especially this last one, are suggestive of kernel methods but with a rather strange kernel. A kernel [6], [13] is a non-negative real-valued integrable function K satisfying

[[integral].sup.[infinity].sub.-[infinity]] K(u)du = 1 and K(-u) = K(u) for all u. (7)

If [x.sub.1], [x.sub.2], ... [x.sub.N] are independent and identically distributed samples of a random variable with probability density function f, then kernel density approximation of f is

[[??].sub.h](x) = [1/Nh]]N.summation over (i=1)]K[x-[x.sub.i]/h],

where K is some kernel function (often standard Gaussian) and h is the bandwidth or smoothing parameter. A kernel function is also the inner product k(x, x')=([PHI](x) x [PHI](x')), where data x are implicitly mapped into the feature space F: x [right arrow] [PHI](x) [13].

The use of kernels for classification and regression is beyond any discussion; support vector machines are subject of many papers and books. Here we point out several basic facts. First, there is a kernel trick that allows computation of (relatively simple) kernel function instead of search for representing features, computation a nonlinear transform [PHI] from samples (pattern) space into feature (kernel) space, and then a scalar product evaluation. As a basis of the support vector machine, the learning error is minimized via quadratic programming problem. To get a unique solution, there is a condition of positively definite Gram matrix, and then kernel should be positively definite. This is a nearly standard demand. Generally, the kernel need not be necessary positively definite. Second, the kernel need not be finite, as according to (7). In [25] it is shown nicely that perceptron is also a kernel machine and sigmoid kernel is nowhere positively definite. Moreover, integral (7) is infinite in this case. In other words, we want to say that in the context of the method we present here these usually very basic and prohibitive conditions need not be studied.

9.3.2 Zipfian distribution (Zipfs law)

The Zipfian distribution (Zipfs law) [20], [28] predicts that out of a population of N elements, the frequency of elements of rank i, f(i;s,N), is given by probability mass function

f(i;s,N) = [[1/[i.sup.s]]/[N.summation over (t=1)]1/[t.sup.s]], (8)

where N is the number of elements, i is their rank, s is the value of the exponent characterizing the distribution. The law may also be written as:

f(i;s,N) = [1/[i.sup.s][H.sub.N,s]],

where [H.sub.N,s] is the Nth generalized harmonic number.

The simplest case of Zipfs law is a "1/f function" arising when s = 1. Given a set of Zipfian distributed frequencies of occurrence of some objects, sorted from the most common to the least common, the second most common frequency will occur 1/2 as often as the first. The third most common frequency will occur 1/3 as often as the first, and so on. Over fairly wide ranges, and to a fairly good approximation, many natural phenomena obey Zipfs law. Note that in the case of a "1/f function", i.e. s = 1, N must be finite and its denominator is equal to [H.sub.N], the socalled harmonic number, i.e. the sum of truncated harmonic series; otherwise the denominator is a sum of harmonic series, which is divergent. This is not true if exponent s exceeds 1, s > 1, then the series is convergent,

[zeta](s) = [[infinity].summation over (t=1)][1/[t.sub.s]] < [infinity],

where Z is Riemann's zeta function.

10 GENERALIZATION

Formulas (4), and (8) need some generalization since there are several classes C and a different number of samples [N.sub.c], c = 1, 2, C of each class in the learning set. General formulas are as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(Eventually q = v),

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

We can see the introduction of relative representation of different numbers of samples [N.sub.c] of each class, i.e. introducing a priori probabilities.

11 DISTRIBUTION MAPPING EXPONENT ESTIMATION

In this section, we suggest a procedure how to determine the distribution mapping exponent for a classifier, which classifies into two classes. The extension for many classes will be straightforward. As before (Chap. 2), let the learning set U of total N samples be given. Let U be composed of points (patterns, samples) [x.sub.cs] where c = {0, 1} is the class mark and s = 1, 2, ..., [N.sub.c] is the index of the point within class c; [N.sub.c] is the number of points in class c and let N = [N.sub.0] + [N.sub.1] be the learning set size. Points [x.sub.cs] of one class are ordered so that index i = 1 corresponds to the nearest neighbor, index i = 2 to the second nearest neighbor, etc. In Euclidean metrics, [r.sub.i] = [parallel]x - x[c.sub.i][parallel] is the distance of the i-th nearest neighbor of class c from point x. [x.sub.i] is the i-th nearest neighbor of point x.

To estimate the distribution mapping exponent q we use a similar approach to the approach of Grassberger and Procaccia [11], for the correlation dimension estimation.

We look for exponent q so that [r.sup.q.sub.i] is proportional to index i, i.e.

[r.sup.q.sub.i] = ki, i = 1, 2, ... [N.sub.c], (9) c = 0 or 1,

where k is a proportionality constant, which will be eliminated later, so we need not bother with it. Using a logarithm, we get

q ln([r.sub.i]) = ln(k) + ln(i), i = 1, 2, ..., [N.sub.c] (10)

This is a task of estimating the slope of a straight line linearly approximating the graph of the dependence of the neighbor's index s as a function of distance in loglog scale. This is the same problem as in the correlation dimension estimation where equations of the same form as (9) and (10) arise. Grassberger and Procaccia, proposed a solution by linear regression. Other authors proposed different modifications and heuristics later. Many of these approaches can be used for the distribution mapping exponent estimation, e.g. the use of [N.sub.v] < [N.sub.c] nearest neighbors instead of [N.sub.c] eliminates the influence of a limited number of the points of the learning set. [N.sub.v] may be equal e.g. to one half or the square root of [N.sub.c]. The accuracy of the distribution mapping exponent estimation is the same problem as the accuracy of the correlation dimension estimation. On the other hand, one can find that a small change of q does not essentially influence the classification results, as discussed already (Chap. 9.2.1).

We solve the system of Nv equations (10) with respect to an unknown q by the use of standard linear regression for both classes. Thus, for two classes we get two values of q, [q.sub.0] and [q.sub.1]. To get a single value of q we use the arithmetic mean, q = ([q.sub.0] + [q.sub.1])/2. For more classes, the arithmetic mean of the qs for the individual classes is used.

12 EXPERIMENTS AND RESULTS

12.1 Synthetic Data

Synthetic data [24] are two-dimensional and consist of three two-dimensional normal distributions with identical a priori probabilities. If [mu] denotes the vector of the means and [C.sub.m] is the covariance matrix, there is

Class A: [mu] = [(2, 0.5).sup.t], [C.sub.m] = (1, 0; 0, 1) (identity matrix)

Class B: [mu] = [(0, 2).sup.t], ([C.sub.m] = (1, 0.5; 0.5, 1)

Class C: [mu] = [(0, -1).sup.t], [C.sub.m] = (1, -0.5; -0.5, 1).

In this experiment, we employed a simple strategy of using a half of the total number of samples in the learning set, which were nearest to the query point. Fig. 4 shows the results obtained by different methods for different learning sets sizes from 8 to 256 samples and a testing set of 5000 samples, all from the same distributions and mutually independent. Each point was obtained by averaging over 100 different runs.

In our method "QCregre", we employed a simple strategy of using a half of the total number of samples in the learning set, which were nearest to the query point in this experiment. For other methods, i.e. 1-NN method with [L.sub.2] metrics and variants of the LWM method [24], the values were estimated from the literature cited.

In Fig. 4, it is seen that the use of class probability estimation with the method presented in this synthetic experiment outperforms all other methods shown in Fig. 4; and for a large number of samples, it quickly approaches the Bayes limit.

12.2 Data from the Machine Learning Repository

Data sets intended just for running with a classifier were prepared by Paredes and Vidal and are available on the Internet. We used all data sets of this corpus. Each task consists of 50 pairs of training and testing sets corresponding to 50fold cross validation. For DNA data, Letter data (Letter recognition), and Satimage (Statlog Landsat Satellite) [21], single partitioning into a training and testing set according to the specification in [21] was used. We also added the popular Iris data set with tenfold cross validation. The results obtained by the QCregre approach, in comparison with data published in [24], are summarized in Table 2. Each row of the table corresponds to one task from [21]. For tasks where the data are not available, only the results for 1-NN method with [L.sub.2] metrics were amended. In the QCregre method, we used a rather complex strategy of robust modification of linear regression, as described above. The interesting point is the experiment with the simplest strategy of using a half of the samples nearest to the query point. For some tasks we obtained very good results.

In Table 3, the results for tasks "Heart", "Ionosphere", and "Iris" are shown together with the results for other methods published in [24]. We can briefly characterize these data sets as follows: The task "Heart" indicates the absence or presence of heart disease for a patient. For the task "Ionosphere", the targets were free electrons in the ionosphere. Good radar returns are those showing evidence of some type of structure in the ionosphere. Bad returns are those that do not return anything; their signals pass through the ionosphere.

The task "Iris" is to determine whether an iris flower is of class Versicolor or Virginica. The third class, Setoza is deleted, as it is linearly separable from the other two. 100 samples, four parameters and tenfold cross validation were used. We do not describe these tasks in detail here as all of them can be found in the descriptions of the individual tasks of the Repository; and also the same approach to testing and evaluation was used. Especially, we used splitting the data set into two disjoint subsets, the learning set and the testing set, and we used cross validation. We also checked some standard

methods for comparison:

* 1-NN--standard nearest neighbor method

* Sqrt-NN--the k-NN method with k equal to the square root of the number of samples of the learning set

* Bayl--the naive Bayes method using ten bins histograms

* LWM1--the learning weighted metrics by Paredes and Vidal [24].

For k-NN, Bayes, LWM and our method, the discriminant thresholds [theta] were tuned accordingly. All procedures are deterministic (even Bayes algorithm) and thus no repeated runs were needed.

13 CONCLUSION

Three methods were presented here:

* The "global" method that supposes mono-fractal data; it needs one estimate of correlation dimension v for all queries.

* The "local" method that supposes multifractal data; it needs an estimate of scaling exponent q for each query a new. Exactness depends on the quality of estimate of q (Grassberger-Procaccia's algorithm for a neighborhood, Takens's method, estimating the slope of DMF or other methods of correlation dimension estimation can be adopted [3], [4], [12], [27].)

* The "IINC" method, in spite of its simplicity, is based on the assumption of multifractal data, as shown above; the estimate of q is, in fact, inherent.

The first two methods need no sorting (perhaps in v or q estimation only), the IINC needs sorting according to the distance of points from the query point. In any case, complexity is proportional to the data dimension, the size of the learning set; in the "IINC" method times log(size of the learning set) counts more. There is no exponential growth of complexity with problem dimension. (We project a multidimensional space into one-dimensional space of distances.) Software for the methods described here is available on the net, see [14].

14 ACKNOWLEDGEMENT

The work was supported by the Technology Agency CR under project of series ALFA No. TA01010490. The project was also supported by the Faculty of Biomedical Engineering of the Czech Technical University in Prague, RVO: 68407700, Czech Republic.

15 REFERENCES

[1.] Abe, S., Support Vector Machines for Pattern Classification (Advances in Pattern Recognition). 2005: Springer-Verlag New York, Inc.

[2.] Breiman, L.: Random Forests, Machine Learning Vol. 45, No. 1, pp. 5-32. (2001)

[3.] Camastra, F.: Data dimensionality estimation methods: a survey. Pattern Recognition 6, 2945-2954 (2003)

[4.] Camastra, P., Vinciarelli, A.: Intrinsic Dimension Estimation of Data: An Approach based on Grassberger-Procaccia's Algorithm. Neural Processing Letters 14(1), 27 34 (2001)

[5.] Cover, T.M., Hart, P.E.: Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory IT-13(1), 21-27 (1967)

[6.] Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification, 2nd edn. John Wiley and Sons, Inc., New York (2000)

[7.] Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. Technical report, Palo Alto, USA: HP Laboratories, 38pp. (2004).

[8.] Friedmann, J.H.: Flexible Metric Nearest Neighbor Classification. Technical Report, Dept. of Statistics, Stanford University (1994)

[9.] Froehling, H. et al.: On determining the dimension of chaotic flow. Physica 3 D (1981), pp. 605-617.

[10.] Gama, J.: Iterative Bayes. Theoretical Computer Science 292, 417-430 (2003)

[11.] Grassberger, P., Procaccia, I.: Measuring the strangeness of strange attractors. Physica 9D, 189-208 (1983)

[12.] Guerrero, A., Smith, L.A.: Towards coherent estimation of correlation dimension. Physics letters A 318, 373-379 (2003)

[13.] Herbrich, R.: Learning Kernel Classifiers. Theory and Algorithms. The MIT Press, Cambridge, Mass., London, England, 2002.

[14.] Jifina, M. and Jifina, Jr., M.: IINC Classifier Based on Inverted Indexes of Neighbors--Software. On-line http://www.marceljirina .cz/files/IINC.zip (2013)

[15.] Jifina, M. and Jifina, Jr., M.: Classification by the Use of Decomposition of Correlation Integral. /Foundations of Computational Intelligence/. Berlin : Springer, 2009 (Abraham, A.; Hassanien, A.; SnaSel, V.) S. 39-55. ISBN 978-3-642-01535-9.--(Studies in Computational Intelligence. 205)

[16.] Jifina, M. and Jifina, Jr., M.: Classifiers Based on Inverted Distances. In: K. Funatsu, K. Hasegawa: New Fundametal Technologies in Data Mining. (Chap. 19) INTECHWEB.org., 2011. Also available at: . http://www.intechopen.com/books/show/title/ new-fundamental-technologies-in-datamining (2011)

[17.] Jifina, M. and Jifina, Jr., M.: Using Singularity Exponent in Distance Based Classifier. Proceedings of the 10th International Conference on Intelligent Systems Design and Applications (ISDA2010) November 29-December 1, 2010 Cairo Egypt, paper No. 1011, pp. 220-224, ISBN: 978-1-4244-8135-4.

[18.] Jifina, M. and Jifina, Jr., M.: Utilization of Singularity Exponent in Nearest Neighbor Based Classifier. Journal of Classification (Springer) 2012, in print

[19.] Mandelbrot, B.B.: The Fractal Theory of Nature. W. H. Freeman and Co., New York, 1982.

[20.] Maslov, V.P.: On a General Theorem of Set Theory Leading to the Gibbs, Bose-Einstein, and Pareto Distributions as well as to the Zipf-Mandelbrot Law for the Stock Market. Mathematical Notes, vol. 78, no. 6, 2005, pp. 807-813.

[21.] Merz, C.J., Murphy, P.M., Aha, D.W.: UCI Repository of Machine Learning Databases. Dept. of Information and Computer Science, Univ. of California, Irvine (1997), http://www.ics.uci.edu/~mlearn/MLSummary. html

[22.] Oh, S.K., Pedrycz, W.: The Design of Self organizing Polynomial Neural Networks. Information Sciences (Elsevier), Vol. 141, Apr. 2002, No. 3-4, pp 237-258.

[23.] Osborne, A.R., Provenzale, A.: Finite correlation dimension for stochastic systems with power-law spectra. Physica D 35, 357 381 (1989)

[24.] Paredes, R., Vidal, E.: Learning Weighted Metrics to Minimize Nearest Neighbor Classification Error. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(7), 1100-1110 (2006)

[25.] Scholkopf, B., Smola, A.J.: Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, Cambridge, Mass., London, England, 2002.

[26.] Stanley, H.E., Melkin, P.: Multifractal phenomena in physics and chemistry. (Review) Nature Vol. 335, 29 Sept. 1988, pp. 405-409.

[27.] Takens, F.: On the Numerical Determination of the Dimension of the Attractor. In: Dynamical Systems and Bifurcations. Lecture Notes in Mathematics, vol. 1125, pp. 99 106. Springer, Berlin (1985)

[28.] Zipf, G.K.: The Psycho-Biology of Language. An Introduction to Dynamic Philology. The MIT Press, 1968. (Eventually: http://en.wikipedia.org/wiki/Zipfs_law)

Marcel Jirina (1) and Marcel Jirina, Jr. (2)

(1) Institute of Computer Science AS CR Pod Vodarenskou vezf 2, Prague 8, 18207 Czech Republic marcel@cs.cas.cz

(2) Faculty of Biomedical Engineering, Czech Technical University in Prague Nam. Sftna 3105, 272 01, Kladno, Czech Republic jirina@fbmi.cvut.cz

Table 1. Parameters of DME distribution for different data sets and color notation for Fig. 3. Data sets are from UCI MLR. mean Dimen mean sigma/ DME/ Data Color sion DME Mean dim. Higgs Red 23 1.75 0.098 0.076 German Aquam. 20 2.71 0.079 0.136 Heart Violet 13 2.42 0.074 0.186 Adult Green 14 5.28 0.158 0.377 RKB Blue 10 4.94 0.185 0.495 Ionosphere Coral 33 2.06 0.139 0.062 Table 2. Classification error rates for different data sets and different NN-based approaches by [18] and LWM1. Empty cells denote data are not available. Dataset L2 CDM CW PW CW QcregreL2 Australian 34.37 18.19 17.37 16.95 16.83 22.72 balance 25.26 35.15 17.98 13.44 17.6 41.32 cancer 4.75 8.76 3.69 3.32 3.53 4.08 diabetes 32.25 32.47 30.23 27.39 27.33 33.54 DNA 23.4 15 4.72 6.49 4.21 46.63 German 33.85 32.15 27.99 28.32 27.29 42.49 glass 27.23 32.9 28.52 26.28 27.48 49.46 heart 42.18 22.55 22.34 18.94 19.82 22.67 ionosphere 19.03 12.87 iris 6.91 5.00 ledl7 20.5 21.84 letter 4.35 6.3 3.15 4.6 4.2 44.20 liver 37.7 39.32 40.22 36.22 36.95 43.33 monkey 1 2.01 10.91 phoneme 18.01 23.03 Satimage 10.6 14.7 11.7 8.8 9.05 28.95 segmen 11.81 15.87 sonar 31.4 40.01 vehicle 35.52 32.11 29.38 29.31 28.09 45.96 vote 8.79 6.97 6.61 5.51 5.26 8.79 vowel 1.52 1.67 1.36 1.68 1.24 16.81 waveform21 24.1 0 0 0 0 52.97 waveform 40 31.66 0 0 0 0 58.12 wine 24.14 2.6 1.44 1.35 1.24 8.68 Table 3. Classification errors for three different tasks shown for the different methods presented in the Machine Learning Repository. The note [fri] means the results according to the report by Friedman [8]. The results computed by authors are shown in bold. Heart Ionosphere Iris Algorithm Test Algorithm Error Algorithm Test QCregrel 0.178 QCregrel 0.02013 scythe [fri] 0.03 LWM1 0.189 Bayl 0.02013 QCregrel 0.04878 Bayes 0.374 LWM1 0.0265 sqrt-NN 0.04879 Discrim 0.393 IB3 0.033 mach:ln [fri] 0.05 (Aha & Kibler, IJCAI- 1989) LogDisc 0.396 backprop 0.04 mach-bth [fri] 0.05 an aver- age of over Alloc80 0.407 sqrt-NN 0.0537 CART 0.06 QuaDisc 0.422 Ross 0.06 mach [fri] 0.06 Quinlan's C4 algorithm Castle 0.441 nearest 0.079 mach:ds [fri] 0.06 neighbor Cal5 0.444 "non- 0.08 1-NN 0.0609 linear" per- ceptron Cart 0.452 "linear" 0.093 LWM1 0.0686 perceptron Cascade 0.467 Bayl 0.0854 KNN 0.478 CART 0.11 Smart 0.478 k-NN 0.8

Printer friendly Cite/link Email Feedback | |

Author: | Jirina, Marcel; Jirina, Marcel, Jr. |
---|---|

Publication: | International Journal of Digital Information and Wireless Communications |

Article Type: | Report |

Date: | Jan 1, 2013 |

Words: | 7927 |

Previous Article: | Adaptivity of a coloring algorithm to unreliable communications for data gathering in wireless sensor networks. |

Next Article: | Collaborative problem solving using public social network media: analyzing student interaction and its impact to learning process. |

Topics: |