# A Cost Model for Query Processing in High Dimensional Data Spaces.

1. INTRODUCTION1.1 Motivation

Indexing high-dimensional data spaces is an emerging research domain, gaining increasing importance due to the need to support modern applications with powerful search tools. In the so-called nonstandard applications of database systems such as multimedia [Faloutsos et al. 1994a; Shawney and Hafner 1994; Seidl and Kriegel 1997]; CAD [Berchtold 1997; Berchtold and Kriegel 1997; Berchtold et al. 1997c; Jagadish 1991; Gary and Mehrotra 1993; Mehrotra and Gary 1993; and Mehrotra and Gary 1995]; molecular biology [Altschul et al. 1990; Kastenmuller et al. 1998; Kriegel et al. 1997; Kriegel and Seidl 1998; Seidl 1997]; medical imaging [Keim 1997; Korn et al. 1996]; time series analysis [Agrawal et al. 1995; 1993; Faloutsos et al. 1994b]; and many others. Similarity search in large data sets is required as a basic functionality.

Feature transformation is a widely applied technique for doing a similarity search, where important properties of the objects in the database are mapped into points of a multidimensional vector space, the so-called feature vectors. Thus, similarity queries are naturally translated into neighborhood queries in the feature space.

In order to achieve high query processing performance, multidimensional index structures [Gaede and Gunther 1998] are applied to manage feature vectors. Unfortunately, the performance of multidimensional index structures deteriorates when the dimension of the data space increases, since they are primarily designed for low dimensional data spaces (2D and 3D) prevalent in spatial database systems, whereas feature vectors are usually high dimensional. Hence, a number of specialized index structures for high dimensional data spaces have been proposed. Most high dimensional index structures are variants of the R-tree family [Gutman 1984; Beckmann et al. 1990; Sellis et al. 1987], using either rectangular or spherical page regions. Minimum bounding rectangles are used by the X-tree [Berchtold et al. 1996]. The SS-tree [White and Jain 1996] and the TV-tree [Lin et al. 1995] use spheres to describe the regions of the space covered by data pages and directory pages. The SR-tree [Katayama and Satoh 1997] uses a combination of a sphere and an MBR, the intersection solid, as a page region. Other high-dimensional index structures that are not directly related to R-trees are the [LSD.sup.h]-tree [Henrich 1994] and the Pyramid technique [Berchtold 1998]. The Hybrid tree [Chakrabarti and Mehrotra 1999] combines the properties of data partitioning (e.g., R-tree) and space partitioning (e.g., k-d-B trees) index structures. An index structure applicable in a parallel computer is the parallel X-tree [Berchtold et al. 1997a].

In spite of these efforts, there are still high dimensional indexing problems under which even specialized index structures deteriorate in performance. Therefore, various optimization techniques for index structures have been proposed. The most common optimization parameter in index structures is the logical page size, i.e., the basic unit of transfer between main memory and secondary storage. Bohm and Kriegel [2000] propose a dynamic and independent page size optimization. It turns out that large page sizes are optimal in high dimensional data spaces, while in medium dimensional cases smaller page sizes are optimal. The optimum depends not only on the dimension but also on the number of objects currently stored in the database and on the underlying data distribution. Weber et al. [1998] give evidence that a dimension high enough to make any thinkable index structure inefficient always exists, and propose the VA-file, a method based on the sequential scan of the data set (combined with a data compression method). Bohm [1998] considers the sequential scan as a special case of an optimized index that has an infinite page size.

Data space quantization is a second optimization technique, it is essentially a data compression technique. It is applied in the VA-file [Weber et al. 1998] and in the IQ-tree [Berchtold et al. 2000a]. The basic idea is to reduce I/O time by representing the attributes only with between 4 and 8 bits, and not by their full 32 bit values. Since the representation of the vectors is inexact in this concept, a refinement step is required, which causes additional I/O cost. A further price is a greater CPU overhead for decompression. If query processing is I/O-bound, the cost balance of data space quantization can be positive.

Dimension reduction is a third optimization technique for indexes. The TV-tree [Lin et al. 1995] uses dimension reduction in the directory. Tree striping [Berchtold et al. 2000b] is based on the idea of decomposing the feature vectors and storing the subvectors in separate indexes with moderate dimensionality. The results of query processing in the separate indexes must be joined. Seidl [1997] and Faloutsos et al. [1994a] apply dimension reduction in the filter step of a multistep query processing architecture.

For the choice of a suitable index structure and query processing technique, as well as for determining the optimal parameters in optimization techniques, it is crucial to provide accurate estimates of the cost of query processing. In this paper we present a methodology for modeling the costs of high dimensional index structures.

1.2 Factors Influencing Query Processing

There are various factors that influence the performance of index-based query processing. First of all the data set. The efficiency of index-based query processing depends on the dimensions of the data space, the number of points in the database, and on the data distribution from which the points are taken. The correlation of dimensions is especially important for efficiency. Correlation means that the attributes of some dimensions are not statistically independent of each other. The value of one attribute is more or less determined by the values of one or more other attributes. In most cases, this dependency is not strict, but is observable by the means of statistics. From a geometric point of view, correlation means that the data points are not spread over the complete data space. Instead, they are located on a lower dimensional subset of the data space, which is not necessarily a single linear subspace of the data space. For most index structures the performance of query processing improves if this intrinsic dimension of the data set is lower than the dimension of the data space. Our cost model takes these properties of the data set into account. The impact of the dimension and the number of data points is considered separately for low dimensional data spaces (Sections 3-4) and for high dimensional data spaces (Section 5). Section 5 also develops a criterion for distinguishing between the low dimensional and high dimensional cases. Correlation effects are handled by the concept of the fractal dimension (in Section 6.)

The metric for measuring the distance between two data points (Euclidean metric, Manhattan metric, maximum metric) has an important influence on query performance, too. Hence, cost formulas for the two most important metrics--the Euclidean and the maximum metric--are presented throughout this paper. While the Manhattan metric is relatively straightforward, modeling more complex kinds of metrics such as quadratic form distance functions [Seidl 1997; Seidl and Kriegel 1997] is for future work.

A second set of influences is connected with the index structure-- the most important being the shape of the page regions. The shape can be a rectangle, a sphere, or a composed page region. If it is a rectangle, it can be a minimum bounding rectangle, or it can be part of a complete decomposition of the data space, as in the k-d-B-tree or in the [LSD.sup.h]-tree. The most difficult to capture in a model are the various heuristics that are applied during insert processing and index construction. The impact of the heuristic on the volume and extension of page regions is very hard to quantify. Another influential factor is the layout of pages on secondary storage. If the layout is clustered, i.e., pairs of adjacent pages are likely to be near each other on disk, the performance can be improved if the query processing algorithm is conscious of this clustering effect. Our cost estimates are presented for rectangular page regions. Considering spherical page regions and cylinders is straightforward, whereas more complex shapes such as composed and intersected solids are difficult. The impact of insertion heuristics is not considered in this paper, or in any cost model for multidimensional query processing known to the author. In fact, these cost models assume idealized index structures, which do not cause deterioration, such as heavy overlap among page regions. Such parameters could eventually be measured for a specific index and be integrated into the formulas, but this is for future work.

A third set of influential factors is due to the choice of a query processing algorithm. The HS algorithm proposed by Hjaltason and Samet [1995] yields a better performance in terms of page access than the RKV algorithm of Roussopoulos et al. [1995]. Disk clustering effects can be exploited by taking the relative position of pages on the background storage into account. We assume a depth-first search strategy for range queries and for the HS algorithm for nearest-neighbor queries.

1.3 Outline

After reviewing some related work on cost models, we start by introducing the modeling of range queries, assuming an independent and uniform distribution of the data points. Moreover, in the beginning of this paper we assumed that queries do not touch the boundary of the data space. Range queries are transformed into equivalent point queries by adapting the page regions accordingly. The central concept for this compensating adaptation is called the Minkowski sum or the Minkowski enlargement. From the Minkowski sum we determine the access probability of pages and use this to develop an expectation for the number of page accesses. Nearest-neighbor queries evaluated by the HS algorithm are modeled in the next section. In concept, this is done with a reduction step that transforms nearest-neighbor queries into an equivalent range query. The corresponding range r can be estimated with a probability density function p(r) using r as a stochastic variable.

The simplifying assumptions of a uniform and independent data distribution and the ignorance of data space boundaries will be dropped step by step in subsequent sections. First, the so-called boundary effects are introduced and are shown to be important in high dimensional data spaces. Our model is modified to take boundary effects into account. We then consider nonuniform data distributions that are independent in all dimensions. Last, we formalize correlations by means of the fractal dimension, and integrate this concept into our cost models for range queries and nearest-neighbor queries. Experimental evaluations that show the practical applicability of the cost models and their superiority over related approaches are provided in each section separately.

1.4 Terminology

Due to the large variety of cost formulas, in this paper we develop the complex notation and identifiers we have to use. There are a few general identifiers for basic cost measures, which will be used throughout this paper:

V for some volume;

R for the expected distance between a query point and its nearest neighbor;

P for some probability distribution function;

X for the access probability of an individual page;

A for the expected number of page accesses.

The boundary conditions for the corresponding cost measure, such as the distance metric and basic assumptions like uniformity and independence in the distribution of data points, are denoted by subscripted indices of the general identifiers. For instance, [X.sub.nn, em, ld, ui] means the access probability for a nearest-neighbor query (nn) using the Euclidean metric (em) on a low dimensional data space (ld) under uniformity and independence assumption (ui). We distinguish:

--the query type: range query (r), nearest neighbor (nn), and k-nearest-neighbor (knn);

--applied metric: Euclidean metric (em) and maximum metric (mm);

--assumed dimensionality: low dimensional (ld) and high-dimensional (hd); and

--data distribution assumption: uniform/independent (ui), correlated (cor).

The metric identifier (em or mm) especially is sometimes left out if an equation is valid for all applied metrics. In this case, it is understood that all terms that lack the metric specification are to be substituted by, of course, the same metric. The volume V is specified by a few indices indicating the geometric shape of the volume. Basic shapes are the hyper-sphere S), the hyper-cube (C), and the hyper-rectangle (R). The Minkowski sum (cf., Section 3) of two solids [o.sub.1] and [o.sub.2] is marked by a plus symbol ([o.sub.1] [direct sum] [o.sub.2]). The clipping operation (cf., Section 5) is marked by the intersection symbol ([o.sub.1] [intersection] [o.sub.2]). Indices often denote only the type of some geometrical object: [V.sub.R[direct sum]S] stands for the volume of a Minkowski sum of a rectangle and a sphere.

We use the following notational conventions throughout this paper: Our database consists of a set of N points in a d-dimensional data space DS [subset or equal to] [R.sup.d]. We will usually restrict ourselves to the data spaces normalized to the unit hypercube [[0..1].sup.d]. It is unclear to what extent normalizing the data space affects the correctness of the resulting set--this depends on the mapping between the object space and the feature space in the feature transformation (this topic is beyond the scope of this paper). The multistep architecture of query processing generally requires that more objects must be retrieved in the feature space than the user requests. The ratio is called filter selectivity. The mapping into a normalized data space may affect filter selectivity positively or negatively. The problem of modeling filter selectivity for various mappings and similarity measures is a subject for future work. We will use the Euclidean metric [[Delta].sub.em] and the maximum metric [[Delta].sub.mm] for distance calculations between two points P and Q:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The average number of points stored in a data page will be denoted as the effective data page capacity [C.sub.eff, data]. In our cost estimations, we do not consider directory page accesses for two reasons: First, directory pages are often assumed to be resident in the database cache after their first access [Berchtold et al. 1996]. But even if the cache is too small to store the complete directory, the number of accessed data pages is often even larger than the number of available directory pages. Even if the ratio of accessed pages decreases when moving from the root level to the leaf level, data page accesses are the predominant factor in the costs of query processing. For concreteness, we restrict ourselves to point access methods with rectilinear page regions minimizing the overlap in index construction. The model can be extended for structures with spherical regions in a straightforward way.

1.5 Algorithms for Query Processing in High Dimensional Spaces

In this paper we consider two kinds of queries: range queries and nearest-neighbor queries. A range query is defined as follows: We are given a database DB of N points in a d-dimensional data space, a query point Q, a radius r called a query range and a distance metric [Delta]. Retrieve all points in the database that have a [Delta]-distance from Q not greater than r:

RangeQuery(DB, Q, r, [Delta]) = {P [element of] DB|[Delta](P, Q) [is less than or equal to] r}

By Q and r, a high dimensional hypersphere, query sphere, is defined. An index-based algorithm for range queries traverses the tree in a depth-first search. All pages intersecting the query sphere are accessed.

In a k-nearest neighbor search, the region of the data space where data points are searched is not previously known. Also given is the database DB, the query point Q, the distance metric [Delta], and a number k of objects to be retrieved from the database. The task is to select the k points from the database with minimum distance from Q:

kNNQuery(DB, Q, k, M) = {[P.sub.0] ... [P.sub.k-1] [element of] DB|

?? [exists]P' [element of] DB \ {[P.sub.0] ... [P.sub.k-1]} [conjunction] ?? [exists]i, 0 [is less than or equal to] i [is less than] k: [Delta](P', Q)}

This can be determined in a depth-first search, as in Roussopoulos et al.'s [1995] RKV algorithm. This algorithm keeps a list of the k-closest points, and the list is updated whenever a new, closer point is found. The distance of the last point in the list is used for pruning branches of the tree from being processed. Although other data is used for pruning as well, the algorithm cannot guarantee a minimum number of page accesses. The HS algorithm proposed by Hjaltason and Samet [1995], and in a similar form earlier by Henrich [1994], performs neither a depth-first search nor a breadth-first search, but accesses pages in order of increasing distance from the query point. This can be implemented using a priority queue. The algorithm stops when all pages with a distance less than the current pruning distance are processed. The details of this algorithm can be taken from the original literature [Henrich 1994; Hjaltason and Samet 1995], or one of the secondary publications [Berchtold et al. 1997b; Bohm 1998]. In contrast to the RKV algorithm, the optimality of the HS algorithm is provable (cf., [Berchtold et al. 1997b]) because it accesses exactly the pages in the knn-sphere, i.e., a sphere around the query point that touches the k-th nearest neighbor. In contrast, RKV accesses a number of additional pages that depend on the effectiveness of the applied heuristics. Hence, the performance of the RKV algorithm is difficult to estimate; and we restrict ourselves to the HS algorithm in our cost model.

2. REVIEW OF RELATED COST MODELS

Due to the high practical relevance of multidimensional indexing, cost models for estimating the number of necessary page accesses were proposed several years ago. The first approach is the well-known cost model proposed by Friedman et al. [1977] for nearest-neighbor query processing using a maximum metric. The original model estimates leaf accesses in a kd-tree, but can easily be extended to estimate data page accesses of R-trees and related index structures. This extension was published in 1987 by Faloutsos [1987], and with slightly different details by Aref and Samet [1991]; Page1 et al. [1993]; and Theodoridis and Sellis [1996]. The expected number of data page accesses in an R-tree is

[A.sub.nn, mm, FBF] = [([sup.d] [square root of 1/[C.sub.eff]] + 1).sup.d].

This formula is motivated as follows: The query evaluation algorithm is assumed to access an area of the data space which is a hypercube of the volume [V.sub.1] = 1/N, where N is the number of objects stored in the database. Analogously, the page region is approximated by a hypercube with the volume [V.sub.2] = [C.sub.eff]/N. In each dimension, the chance that the projection of [V.sub.1] and [V.sub.2]] intersect each other corresponds to [sup.d][square root of [V.sub.1]] + [sup.d][square root of [V.sub.2]] if n [right arrow] [infinity]. To obtain a probability that [V.sub.1] and [V.sub.2] intersect in all dimensions, this term must be taken to the power of d. Multiplying this result with the number of data pages N/[C.sub.eff], yields the expected number of page accesses, [A.sub.nn, mm, FBF]. For several reasons, however, the assumptions of the model are unrealistic for nearest-neighbor queries on high dimensional data. First, the number N of objects in the database is assumed to approach infinity. Second, effects of high dimensional data spaces and correlations are not considered by the model. Cleary [1979] extends the model of Friedman et al. [1977] by allowing nonrectangular page regions, but still does not consider boundary effects and correlations. Eastman [1981] uses the existing models for optimizing the bucket size of the kd-tree. Sproull [1991] shows that the number of data points must be exponential in the number of dimensions for the models to provide accurate estimates. According to Sproull, boundary effects significantly contribute to the costs, unless the following condition holds:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [V.sub.s](r) is the volume of a hypersphere with radius r, which can be computed as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

with the gamma function [Gamma](x), which is the extension of the factorial operator x! = [Gamma](x + 1) into the domain of real numbers:

[Gamma](x + 1) = x [multiplied by] [Gamma](x), [Gamma](1) = 1,

and [Gamma](1/2) = [square root of [Pi]].

For example, in a 20-dimensional data space with [C.sub.eff] = 20, Sproull's formula evaluates to N [is greater than] 1.1 [multiplied by] [10.sup.11]. We will see later how bad the cost estimates of the FBF model are if substantially fewer than a hundred billion points are stored in the database. Unfortunately, for his analysis, Sproull still assumes uniformity and independence in the distribution of data points and queries, i.e., both data points and the center points of the queries are chosen from a uniform data distribution, whereas the selectivity of the queries (1/N) is considered fixed. The above formulas are also generalized to k-nearest-neighbor queries, where k is also a user-given parameter.

The assumptions in the existing models do not hold in the high dimensional case. The main reason for the problems in the existing models is that they do not consider boundary effects. Boundary effects stands for exceptional performance behavior, when the query reaches the boundary of the data space. As we show later, boundary effects occur frequently in high dimensional data spaces, and lead to pruning of major amounts of empty search space, which is not considered by the existing models. To examine these effects, we performed experiments to compare the necessary page accesses with the model estimates. Figure 1 shows the actual page accesses for uniformly distributed point data versus the estimates of the Friedman, Bentley, and Finke model. For high dimensional data, the model completely fails to estimate the number of page accesses.

[Figure 1 ILLUSTRATION OMITTED]

The Friedman, Bentley, and Finkel basic model has been extended in two different directions. The first is to take correlation effects into account by using the concept of the fractal dimension [Mandelbrot 1977; Schroder 1991]. There are various definitions of the fractal dimension, which all capture the relevant aspect (the correlation), but differ in the details of how the correlation is measured. We will not distinguish between these approaches in our subsequent work.

Faloutsos and Kamel [1994] use the box-counting fractal dimension (also known as the Hausdorff fractal dimension) for modeling the performance of R-trees when processing range queries using a maximum metric. In their model they assume a correlation in the points stored in the database. They still assume a uniform and independent distribution for queries. The analysis does not take into account the effects of high dimensional spaces, and the evaluation is limited to data spaces with dimensions less or equal to 3. In a subsequent paper Belussi and Faloutsos [1995] used the fractal dimension with a different definition (the correlation fractal dimension) for the selectivity estimate of spatial queries. Range queries in low-dimensional data spaces using the Manhattan metric, the Euclidean metric, and a maximum metric were modeled in the paper. Unfortunately, the model only allows an estimate of selectivities. It is not possible to extend the model in a straightforward way to determine expected page accesses.

In a recent paper, Papadopoulos and Manolopoulos [1997] used the results of Faloutsos and Kamel and those of Belussi and Faloutsos for a new model. This model is capable of estimating data page accesses of R-trees when processing nearest-neighbor queries in a Euclidean space. It estimates the distance of the nearest neighbor by using the selectivity estimates of Belussi and Faloutsos [1995] in the reverse way. We point out in Section 4 that this approach is problematic from a statistical point of view. Because it is difficult to determine accesses to pages with rectangular regions for spherical queries, the model approximates query spheres by minimum bounding and maximum enclosed cubes and in this way determines upper and lower bounds on the number of page accesses. This approach makes the model inoperative for high dimensional data spaces because the approximation error grows exponentially with the increasing dimension. Note that, in a 20-dimensional data space, the volume of the minimum bounding cube of a sphere is larger by a factor of 1/[V.sub.S](1/2) = 4.1 [multiplied by] [10.sup.7] than the volume of the sphere. The sphere volume, in turn, is [V.sub.S]([square root of d]/2) = 27,000 times larger than the greatest enclosed cube. The Papadopoulos and Manolopoulos model has the asset that queries are no longer assumed to be taken from a uniform and independent distribution, but the query distribution is assumed to follow the data distribution.

The concept of fractal dimension is also widely used in the domain of spatial databases, where the complexity of stored polygons is modeled [Gaede 1995; Faloutsos and Gaede 1996]. (But these approaches are of minor importance for point databases.)

The second direction in which the Friedman, Bentley, and Finkel basic model needs extension is when boundary effects occur when indexing data spaces of higher dimensionality.

Arya et al. [1995] and Arya [1995] presented a new cost model for processing nearest-neighbor queries in the context of a vector quantization application domain. Arya, Mount, and Narayan restricted their model to the maximum metric and neglected correlation effects. Unfortunately, they still assumed that the number of points is exponential with the dimension of the data space. This assumption is justified in their application domain, but it is unrealistic for database applications.

In the BBKK model, Berchtold et al. [1997b] presented a cost model for query processing in high dimensional data spaces. The model provides accurate estimates for nearest-neighbor queries and range queries using the Euclidean metric and considers boundary effects. To cope with correlation, the authors propose the fractal dimension, without presenting the details. The main limitations of the model are (1) that no estimates for the maximum metric is presented; (2) that the number of data pages is assumed to be a power of two; and (3) that complete, overlap-free, coverage of the data space with data pages is assumed. Weber et al. [1998] use the BBKK cost model, without the extension for correlated data, to show the superiority of the sequential scan in sufficiently high dimensions. They present the VA-file, an improvement over the sequential scan. Ciaccia et al. [1998] adapt the BBKK cost model to estimate the page accesses of the M-tree, an index structure for data spaces which are metric spaces but not vector spaces (i.e., only the distances between the objects are known, but no explicit positions). Papadopoulos and Manolopoulos [1998] apply the cost model for declustering data in a disk array. Two papers [Riedel et al. 1998 and Agrawal et al. 1998] present applications of the BBKK model in the data mining domain.

This paper is based on the BBKK cost model, presented in a comprehensive way and extended in many aspects. In contrast to the BBKK cost model, the current paper additionally includes all cost estimates for the maximum metric. The restriction of the BBKK model to numbers of data pages that are a power of two is overcome (cf., Sections 5.2 and 5.3). A further extension of the model regards k-nearest neighbor queries. The corresponding cost formulas are presented in Section 4.4. The numerical methods for integral approximation and for estimating the boundary effects are entirely outside the scope of the paper by Berchtold et al. [1997b]. Improved evaluation methods not contained in the BBKK model are described in Sections 4.3, 5.2, and 5.3. These evaluation methods based on precomputed volume functions for intersection solids are of high practical importance because costly numerical methods such as Montecarlo integration can be completely performed during compile time. Finally, the concept of the fractal dimension, which is also used in the BBKK model in a simplified way (the data space dimension is simply replaced by the fractal dimension) is, in this paper, well established by the consequent application of the fractal power laws.

3. RANGE QUERY

In this section we assume uniformity and independence in the distribution of both data and query points. Moreover, we ignore the existence of a boundary of the data space, or at least assume that page regions and queries are distant enough from the space boundary that the boundary is not touched. We start with a given page region and a given query range r and determine the probability of the page being accessed under the uniformity and independence assumption for the query point. The uniformity and independence assumption means that the query point is a d-dimensional stochastic variable distributed such that each position of the data space has the same probability.

3.1 The Minkowski Sum

The corresponding page is accessed whenever the query sphere intersects with the page region (see Figure 2). In all figures in this paper we symbolize queries by spheres and page regions by rectangles. Also, our term (e.g., query sphere) will often reflect this symbolic representation. We note that queries using the maximum metric correspond to hypercubes rather than hyperspheres. We further note that not all known index structures use hyperrectangles as page regions. Our concepts here are also applicable if the shape of the query is in cube form or the shape of the page region is spherical.

[Figure 2 ILLUSTRATION OMITTED]

We conceptually make the following transformation to estimate the access probability of a page: We enlarge the page region so that if the original page touches any point of the query sphere, then the enlarged page touches the center point of the query. It is obvious from Figure 2 that the page region becomes enlarged by a sphere of the same radius r whose center point is drawn over the surface of the page region. All positions inside the grey region of Figure 2 are the positions of the query points where the page is accessed, and all positions outside the grey area are the positions of the query point where the page is not accessed. So the marked volume divided by the data space volume directly corresponds to the access probability of the page. As we assume, for simplicity, that the unit hypercube [[0..1].sup.d] is the data space, the data space volume corresponds to 1. The concept of enlarging a given geometrical object by another object in the way described is called the Minkowski sum [Kaul et al. 1991]. The Minkowski sum of two objects A and B is defined as the vector sum of all points of the objects:

A [direct sum] B = {a + b|a [element of] A, b [element of] B}

The main application of the Minkowski sum is robot motion planning. In the context of cost modeling, only the volume [V.sub.A[direct sum]B] of the Minkowski sum is needed. Therefore, we will not strictly distinguish between the Minkowski sum as a spatial object (solid) and its volume. For determining the volume, we have to consider various cases. The simplest case is that both solids are hyperrectangles with side lengths [a.sub.i] and [b.sub.i], for 0 [is less than or equal to] i [is less than] d. In this case, the volume [V.sub.R[direct sum]R of the Minkowski sum of two rectangles is the rectangle with the side lengths [c.sub.i] where each [c.sub.i] corresponds to the sum of [a.sub.i] and [b.sub.i]:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

If both solids, query sphere and page region are hyperspheres with radius [r.sub.q] and [r.sub.p], the Minkowski enlargement corresponds to a hypersphere with the radius [r.sub.q] + [r.sub.p]. The corresponding volume of the hypersphere can be evaluated by the following formula:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Evaluating the volume becomes more complex if query and page region are shaped differently. This is illustrated in a 3-dimensional example in Figure 3, which depicts some of the solids by which the page region (gray) becomes enlarged: At each corner of the gray solid, we have a part (1/8) of a sphere (only one part [s.sub.1] is drawn). The composed parts form one complete sphere of radius r (the query radius). At each edge of the region we have a part (1/4) of a cylinder. The parts form three complete cylinders where the radius is the query radius. The length of c1, for instance, is the width of the page region. Finally, at each surface of the region we enlarge the region by a cuboid where two side lengths are defined by the surface of the region and the remaining side length again is r. In the general d-dimensional case, our page region not only has corners, edges and 2-dimensional surfaces, but surface segments with dimensions up to d - 1. Each surface with dimension k is enlarged by a part of a hypercylinder that is spherical in d - k dimensions. In the remaining k dimensions, the hypercylinder has the shape of the surface segment to which it is connected.

[Figure 3 ILLUSTRATION OMITTED]

Before we determine the volume of the Minkowski enlargement in the most complex case of a hypersphere and a hyperrectangle, let us solve for the simpler case that the page region is a hypercube with side length a. In this case, all k-dimensional surface segments have the volume [a.sup.k]. The question of how many such surface segments exist and what the volume of the assigned part of the hypercylinder is still open. The number of surface segments can be determined by a combinatorial consideration. All surface segments (including the corners and the page region itself) can be represented by a d-dimensional vector over the symbols L, U, and *. Here, the symbol L stands for the lower boundary, U for the upper boundary, and the star stands for the complete interval between the lower and upper boundaries. Each position of symbols L, U, or * corresponds to one dimension of the original data space. Using this notation, the corners have no star, the edges have one star, the 2-dimensional surfaces have two stars in the vector, and so on. The hyperrectangle itself has d stars, no L, and no U in its description vector. In our example in Figure 3, $1 is assigned to the vector (LUL), and c1 is assigned to the vector (*UL) because it covers the entire range in the x-direction, and is fixed at the upper end of the y-direction and at the lower end of the z-direction. The number of k-dimensional surface segments corresponds to the number of different vectors having k stars. The positions of the stars can be arbitrarily selected from d positions in the vector, yielding a binomial number of possibilities. The remaining d - k positions are filled with the symbols L and U. Thus, the number of surface segments SSEGM(k) of dimension k is equal to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The fraction of hypercylinder at each surface segment is [1/2.sup.d-k] because the cylinder is halved for each of the d - k dimensions. Therefore, we get the following formula for the Minkowski sum of a hypersphere with radius r and a hypercube with side length a:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

In the most complex case of noncubical hyperrectangles, the k-dimensional surface segments must be summed up explicitly, which is a costly operation. Instead of the binomial multiplied with [a.sup.k], we have to summarize over all k combinations of the side lengths of the hyperrectangle, i.e., the power set [2.sup.{0 ... D-1}]:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

We note that in most cases determining the Minkowski sum of a hypersphere and a hyperrectangle is too costly, since it involves a number of basic operations (multiplications) exponential in dimension. The explicit determination of the Minkowski sum of a real hyperrectangle and a hypersphere of high dimensionality must be avoided, even if exactness is sacrificed.

The usual workaround is to transform the page region into a hypercube with equivalent volume. However, exactness is sacrificed in this approach because the Minkowski sum of a noncubical hyperrectangle is larger than the Minkowski sum of a volume-equivalent hypercube, Our experiments later will show that this approximation does not cause serious errors in the cost estimations for the X-tree, which strives for square-like page regions. For other index structures, this effect could play a more important role. Estimating access probabilities in such cases is subject to future work.

3.2 Estimating Rectangular Page Regions

To make modeling manageable, we assume page regions that have the form of a hypercube. Observe that structures like the X-tree or the R*-tree try to produce such page regions, and that our prediction matches the experimental observation. We exploit the fact that the number of points enclosed in an arbitrary hypercube of side length [a.sub.nonbound] is proportional to the volume of this hypercube. By the index nonbound we indicate that this hypercube is not the minimum bounding rectangle of the enclosed points, i.e., [a.sub.nonbound] is the expected side length of the page regions before taking the MBR (minimum bounding rectangle) effect into account. Similarly, [V.sub.R, nonbound] is the volume of the page region without considering the MBR effect. Due to proportionality, the number [C.sub.eff] of points stored in a page divided by the number N of all points in the database is equal to the volume of the page divided by the volume [V.sub.DS] of the data space:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

In this formula for the side length of a typical page region, we assume complete coverage of the data space with page regions. This assumption is not meaningful for minimum bounding rectangles. There is usually a small gap between two neighboring page regions. The expected width of this gap under the uniformity and independence assumption can be determined by projecting all points of a page onto the coordinate axis perpendicular to the gap. The average distance between two neighboring projections is 1/[C.sub.eff] times the side length of the region because the projections of the points are uniformly distributed. This is also the expected value for the width of the gap by which the side length of the page region is decreased compared to [a.sub.nonbound]. Therefore, the side length of a minimum bounding rectangle can be estimated as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Consideration of gaps between page regions is particularly important if the effective page capacity is low. Figure 4 shows the compensation factor for varying page capacities (left side) and for varying dimensions (right side). It shows the factor that decreases the volume of a page region when gaps are taken into account. If the compensation factor is 1, no compensation is needed. The smaller the compensation factor, the higher the impact of the MBR effect. The left diagram shows the compensation factor for a fixed dimension d = 16 with varying capacity [C.sub.eff]. The strongest MBR effect occurs for low capacities. For a typical capacity between 20 and 40 points per data page (which is the most frequently applied capacity in our experience), the compensation factor ranges between 40% and 70%. The right diagram shows the compensation factor for a fixed effective page capacity (30 points) and varying dimension. Most compensation is necessary for large dimensions.

[Figure 4 ILLUSTRATION OMITTED]

3.3 Expected Number of Page Accesses

By inserting the expected side length a into the formulas for the Minkowski enlargement, it is possible to determine the access probabilities of typical data pages under the uniformity and independence assumption. This is for the maximum metric:

[X.sub.r, mm, ui](r) = [(2r + a).sup.d] = [(2r + (1 - 1/[C.sub.eff]) [multiplied by] [sup.d][square root of [C.sub.eff]/N]).sup.d].

Note again that the volume corresponds to the access probability as the data space is normalized to the unit hypercube. For the Euclidean metric, the access probability for range queries with radius r evaluates to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

From these access probabilities, the expected number of page accesses can be determined by multiplying the access probability with the number N/[C.sub.eff] of data pages:

[A.sub.r, mm, ui](r) = [(2r [multiplied by] [sup.d][square root of N/[C.sub.eff]] + 1 - 1/[C.sub.eff]).sup.d.

For the Euclidean metric, the corresponding result is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

4. NEAREST-NEIGHBOR QUERY

The optimality of the HS algorithm [Hjaltason and Samet 1995] for nearest-neighbor search was proven in Berchtold et al. [1997b]. The HS algorithm yields exactly the same page accesses as an equivalent range query, i.e., a range query using the distance to the nearest neighbor as query range. This enables us to reduce the problem of modeling nearest-neighbor queries to the problem of modeling range queries, which is solved in Section 3. Thus we have to estimate the nearest-neighbor distance and apply it as the radius in the range query model.

As in Section 3, we start with the assumption of an independent, uniform data distribution, and we ignore boundary effects. These effects will be investigated in depth in Sections 5 and 6.

4.1 Coarse Estimates of the Nearest-Neighbor Distance

A simple way to estimate the nearest-neighbor distance is to choose a sphere in the data space such that an expected value of one data point is contained in it, according to the current point density, and to use the radius of this sphere as an approximation of the actual nearest-neighbor distance. In the case of the maximum metric, we get the following formula:

1/N = [V.sub.q] = [(2r).sup.d] r = 2 [multiplied by] 1/[sup.d][square root of N]

For the Euclidean metric, the corresponding formula is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Unfortunately, this approach is not correct from the stochastic point of view, since the operation of building an expectation is not invertible, i.e., the expectation of the radius cannot be determined from the expectation of the number of points in the corresponding sphere. The approximation determined by this formula is rather coarse, and can only be used when a fast and simple evaluation is of higher importance than the accuracy of the model. The general problem is that even under the uniformity and independence assumption, the nearest-neighbor distance yields a certain variance when several range queries are executed.

4.2 Exact Estimate of the Nearest-Neighbor Distance

A stochastically correct approach is to determine a distribution function for the nearest-neighbor distance and to derive an expectation of the nearest-neighbor distance from the corresponding probability density function. From this probability density function, the expectation for the number of page accesses can also be derived.

According to the rules defining a distribution function P(r) for the nearest-neighbor distance, we must determine the probability of the actual nearest-neighbor distance being smaller than the stochastic variable r. The nearest-neighbor distance is larger than r if and only if no data point is contained in the sphere with radius r. We need the opposite event. Hence, P(r) is as follows:

P(r) = 1 - [(1 - V(r)).sup.N].

Here, 1 - V(r) models the probability that a point selected from a uniform distribution is not contained in the volume. Therefore, [(1 - V(r)).sup.N] is the probability that all points are outside the volume. Subtracting this from 1 is the probability of at least one point being inside. Due to the convergence of the limit

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

the distribution function can be approximated for a large number of objects N by the following formula:

P(r) [approximately equals] 1 - [e.sup.-N [multiplied by] V(r)].

This approximation yields negligible relative errors for a large N (starting from 100), and will facilitate the evaluation later in this paper.

For the maximum metric and Euclidean metric, the probability distribution function P(r) can be evaluated in the following way:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

From the distribution function P(r), a probability density function p (r) can be derived by differentiation:

p(r) = [differential]P(r)/[differential]r.

For the maximum and Euclidean metric, this evaluates to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Figure 5 shows some probability density functions for 100,000 uniformly distributed data points in a two-dimensional and an eight-dimensional data space, respectively.

[Figure 5 ILLUSTRATION OMITTED]

To determine the expected value of the nearest-neighbor distance, the probability density function multiplied by r must be integrated from 0 to infinity:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The integration variable is denoted by [differential] instead of the more usual d, to avoid confusion with the identifier d which stands for the dimension of the data space. The integral is not easy to evaluate analytically.

4.3 Numerical Evaluation

We present two numerical methods to evaluate the integrals presented at the end of Section 4.2. The first method is based on the binomial theorem. Basically, the probability density function p(r) is a polynomial in the degree d [multiplied by] N. It can be transformed into the coefficient form

p(r) = [a.sub.0][r.sup.d[multiplied by]N] + [a.sub.1][r.sup.d[multiplied by]N - 1] + ...

using the binomial theorem. We demonstrate this for the maximum metric:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

This alternating series can be approximated with low error bounds by the first few summands (0 [is less than or equal to] i [is less than or equal to] [i.sub.max]) if the absolute value of the summands is monotonically decreasing. This is possible if the power of r decreases in a steeper way with increasing i than the binomial increases. This is guaranteed if the following condition holds:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Therefore, we approximate our formula for the expected nearest-neighbor distance in the following way:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The same simplification can be applied for the Euclidean metric. However, an alternative way, based on a histogram approximation of the probability density function, yields lower approximation errors and results in even less effort in the evaluation.

To facilitate numerical integration methods, such as the middlebox approximation, the trapezoid approximation, or the combined method according to Simpson's rule [Press et al. 1988], we must determine suitable boundaries, where the probability density function has values substantially greater than 0. If, for example, we consider Figure 5, we observe that for d = 8, [p.sub.mm](r) is very close to 0 in the two ranges 0 [is less than or equal to] [is less than or equal to] 0.05 and 0.16 [is less than or equal to] r [is less than or equal to] [infinity] . Only the range between the lower bound [r.sub.ub] = 0.05 and the upper bound [r.sub.ub] = 0.16 contributes significantly. The criterion for a sensible choice of lower and upper bounds is based on the distribution function that corresponds to the area below the density function. We choose the lower bound, rib, such that the area in the ignored range [0.. [r.sub.lb]] corresponds to a probability less than 0.1%, and do the same for the upper bound [r.sub.ub]. We get the following two conditions, resulting from the approximation of the distribution function:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Integration can therefore be limited to the interval from [r.sub.lb] to [r.sub.ub]. The integral can be approximated by a sum of trapezoids or by a sum of rectangles:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Since we limit the integration to a small interval, a small number of rectangles or trapezoids is sufficient for a high accuracy. To achieve a relative error less than 1.0%, an approximation by [i.sub.max] = 5 rectangles was sufficient in our experiments.

4.4 K-Nearest-Neighbor Query

The cost model developed in the previous sections can also be extended for estimating k- nearest-neighbor queries. For the coarse model this is straightforward, since the volume must be chosen such that k objects are contained rather than one. Hence the term 1/N must be replaced by k/N. For the maximum metric, the estimated distance is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For the Euclidean metric, the result is analogous:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For the exact model, the probability distribution must be modeled as a summation of Bernoulli chains with lengths ranging from 1 to k. The probability that at least k points are inside the volume V(r) corresponds to the following formula:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For k = 1, the formula corresponds to the distribution function P(r) for nearest-neighbor queries. The probability density function and the expectation of the nearest-neighbor distance are determined in the same way as in Section 4.2.

We note that the peak in the probability density function of a k-nearest-neighbor query becomes steeper with increasing k (decreasing variance). Therefore, the approximation by the coarse model, which is bad for high asymmetric variances, becomes better with increasing k. For sufficiently large k [is greater than] 10, the coarse model and the exact model yield comparable accuracy.

4.5 Expectation of the Number of Page Accesses

As initially mentioned, the number of page accesses of a nearest neighbor query is equivalent to the number of page accesses of a range query when the nearest neighbor distance is used for the query range. An obvious approach to modeling is therefore to use the expectation of the nearest neighbor distance and to insert it into the expectation of the number of page accesses using range queries:

[A.sub.nn] [approximately equals] [A.sub.r](R).

This approach, however, reveals similar statistical problems and leads to similar inaccuracies as does the coarse estimation approach in Section 4.1. The problem is that the number of page accesses is not linear in the query range. Once again, the approach can be taken if high accuracy is not required or if the variance of the nearest neighbor distance is low.

Instead, we have once again to apply the distribution function P(r) to determine an expectation of the number of page accesses by integration, as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For maximum and Euclidean metrics, this formula evaluates to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

This result can be simplified by a similar technique to the one in Section 4.3.

5. EFFECTS IN HIGH DIMENSIONAL DATA SPACES

In this section, we describe some effects occurring in high-dimensional data space which are not sufficiently considered in our models of the previous sections. We still assume a uniform and independent distribution of data and query points in this section. The models developed in the previous sections will be modified to take the described effects into account.

5.1 Problems Specific to High Dimensional Data Spaces

The first effect that occurrs especially in high dimensional data spaces is that all data and query points are likely to be near the boundary of the data space. The probability that a point randomly taken from a uniform and independent distribution in a d-dimensional data space has a distance of r or less to the space boundary can be determined by the following formula:

[P.sub.surface](r) = 1 - [(1 - 2 [multiplied by] r).sup.d].

As Figure 6 shows, the probability that a point is inside a 10% border of the data space boundary increases rapidly with increasing dimension. It reaches 97% for a 16-dimensional data space.

[Figure 6 ILLUSTRATION OMITTED]

A second effect, which is even more important, is the large extension of query regions. If we use our model for determining an expected value of the nearest-neighbor distance, we observe that the expectation quickly approaches surprisingly high values. Figure 7 shows the expected values for the nearest-neighbor distance with varying dimensions for the maximum and the Euclidean metrics for several databases containing between 10,000 and 10,000,000 points. In particular, when applying the Euclidean metric in a data space of a dimension between 13 and 19, the nearest-neighbor distance reaches a value of 0.5, i.e., the nearest-neighbor sphere has the same diameter as the complete data space. The size of the database has a minor influence on this effect.

[Figure 7 ILLUSTRATION OMITTED]

The combination of the two effects described above leads us to the observation that large parts of a typical nearest-neighbor sphere must be outside the boundary of the data space (cf., Figure 8). The consequences arising from this fact are commonly referred to as boundary effects. As we will investigate in depth in the subsequent sections, the most important consequence is that in our models all volume determinations must consider clipping at the boundary of the data space. On the one hand, expectation of the nearest-neighbor distance increases by boundary effects but, on the other hand, access probabilities of data pages decrease because large parts of the Minkowski sum are clipped away.

[Figure 8 ILLUSTRATION OMITTED]

If the dimension further increases, the typical nearest-neighbor distance grows to values much greater than 1/2. In this case, it becomes very likely that the nearest-neighbor sphere exceeds most of the data space boundary areas.

A similar effect is observable for the page regions. If we assume, following our initial model, hypercube-shaped page regions, the side lengths of such regions quickly exceed 0.5. However, it is impossible for the data space to be covered with pages having side lengths between 0.5 and 1 only. Pagination arises from a recursive decomposition of the data space into parts of approximately the same volume (by the uniformity and independence assumption). Therefore, each page is split several times in each dimension. This means that (approximately) only the side lengths 1, 1/2, 1/4, 1/8, ... can occur. In sufficiently high dimensions, the page regions do not have side lengths of 1/2 or smaller in all dimensions because if every data page is split at least once in each dimension, we need a total number of at least [2.sup.d] data pages to cover the complete data space. For example, in a 30-dimensional data space, we would need one billion data pages to reach such a pagination, resulting in a database size of 4,000 G bytes (assuming a page size of 4K bytes).

Hence we modify our cost models such that this effect is considered for database sizes N with fewer than [N.sub.singlesplit] objects with

N [is less than or equal to] [N.sub.Singlesplit] = [C.sub.eff] [multiplied by] [2.sup.d].

5.2 Range Query

We still assume uniformity and independence in the distribution of data and query points. For a sufficiently high dimension d (such that the inequality above holds), we observe that the data space is only split in a number d' [is less than] d dimensions. The maximum split dimension d' can be determined by using the following formula:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Data pages have an average extension [a.sub.split] with

[a.sub.split] = 0.5 [multiplied by] (1 - 1/[C.sub.eff])

in d dimensions and an average extension [a.sub.unsplit], with

[a.sub.unsplit] = 1 - 1/[C.sub.eff]

in the remaining d' -- d dimensions. Figure 10 clarifies the position of two typical page regions in the data space for split (y-axis) and unsplit (x-axis) dimensions. The projection on an axis of a split dimension shows two page regions. Between these two regions, there is a gap in the average width, 0.5/ [C.sub.eff], caused by the property of the page region to be a minimum bounding rectangle (called the MBR property, cf., Section 3.2). The distance of 0.25/ [C.sub.eff] from the data space boundary is also due to the MBR property. In contrast, the projection on an axis of an unsplit dimension shows only one page region with a distance of 0.5/ [C.sub.eff] from the lower and upper space boundaries, respectively.

[Figure 10 ILLUSTRATION OMITTED]

We now mark the Minkowski sum of the lower page region (cf., Figure 11(a)). We observe that large parts of the Minkowski sum are located outside the data space. The Minkowski sum is the volume from which a query point has to be taken such that the corresponding page is accessed. However, we assume that only query points inside the data space boundary are used as query points. Hence the Minkowski sum has to be clipped at the data space boundary in order to determine the probability that a randomly selected query point accesses the page (cf., Figure 11(b)). We can express the clipping on the boundary with the following integral formula, which summarizes all points v in the data space (i.e., all possible positions of query points) with a distance less or equal to the query range r from the page region R:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here the distance function [[Delta].sub.M] (v, R) denotes the smallest distance between the point v and the page region R according to the metric M (Euclidean or maximum metric). Unfortunately, this integral is difficult to evaluate and has to be simplified. The first useful observation for this purpose is that the distance between the data space boundary and the page region (0.5/[C.sub.eff] for unsplit dimensions and 0.25/[C.sub.eff] for split dimensions) is small compared to a typical radius r (a radius is typical unless it is so small that there is almost no chance to retrieve any database points). Hence the corresponding gap is completely filled by the Minkowski enlargement (the projection of the clipped Minkowski sum to an unsplit dimension is 1). Unsplit dimensions can be ignored for determining the access probability (cf., Figure 11(c)).

[Figure 11 ILLUSTRATION OMITTED]

For the maximum metric, the clipped Minkowski sum can be determined in the following way (cf., Figure 12(a)): We take the side length of the page region in all halved dimensions. The gap between the region and the data space boundary must not be considered, because it is filled by the Minkowski sum. We add the query radius only one time instead of two times (as we did in our initial model), because only the Minkowski enlargement in the middle of the data spaces applies (the rest is clipped). Here, we have also to consider the MBR effect. The result is taken to the power of the number of dimensions which are split:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For a radius greater than 0.5 + 0.25/ [C.sub.eff], the Minkowski enlargement also reaches the data space boundary on the opposite side. This is taken into account by the application of the minimum function in the equation above. In this case, the page has an access probability of 100%.

[Figure 12 ILLUSTRATION OMITTED]

If we apply the Euclidean metric, an additional complication arises as depicted in Figure 12(b). Since the radius r is typically much greater than 0.5 (cf., Section 5.1 and Figure 7), the sphere itself must be clipped. The volume of a clipped hypersphere cannot be determined analytically. However, we show how this volume can be simplified such that a precomputed volume function can be applied to improve the efficiency of the evaluation. The volume function can be precomputed numerically.

The volume of the clipped sphere depends on the dimension, the radius, and the size of the clipping window. This means that we have to store the precomputed values in a 3-dimensional array. The dependency of the size of the clipping window can be eliminated by suitable scaling operations. We scale our clipping region (and also r) such that it is mapped to the unit hypercube. We then determine the corresponding volume by looking in the table of precomputed volumes, and finally apply inverse scaling to the volume in the table. Since our clipping window is now the unit hypercube, we need only a 2-dimensional array for storing the precomputed values.

By [V.sub.csi] (d, r), we define the volume of the intersection of a sphere with radius r and the unit hypercube in d-dimensional space. Figure 13 depicts the intersection volume [V.sub.csi] (2, r) in 2-dimensional data space. We let the origin be the center of the sphere. Obviously, [V.sub.csi] (d, 0) = 0 and [V.sub.csi] (d, [square root of] d) = 1. Between these points, [V.sub.csi] is monotonically increasing.

[Figure 13 ILLUSTRATION OMITTED]

Definition 1. Cube-Sphere Intersection. [V.sub.csi] (d, r) denotes the intersection volume between the unit hypercube [[0..1].sup.d] and a d-dimensional sphere with radius r around the origin:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

[V.sub.csi] (d, r) can be materialized into an array of all relevant dimensions d (e.g., ranging from 1 to 100) and for a discretization of the relevant r between 0 and [square root of d] in a sufficiently high number of steps (e.g., 10,000). For determining the discretization of [V.sub.csi] (d, r), the Monte Carlo integration [Kalos and Whitlock 1986] using the integral formula can be used. Sufficient accuracy is achievable with 100,000 points.

Figure 14 depicts [V.sub.csi] (d, r) for various dimensions d.

[Figure 14 ILLUSTRATION OMITTED]

[V.sub.csi] (d, r) is used to determine the access probability of a page when a range query using the Euclidean metric is performed. As we pointed out in the previous discussion, the range query behaves like a range query in d'-dimensional space because all dimensions, where the page region has full extension, can be projected out without affecting the volume of the Minkowski enlargement.

The query sphere is not clipped by the unit hypercube, but rather by the hypercube representing the empty space between the page region and the opposite boundary of the data space. The side length of this cube [a.sub.empty] is (cf., Figures 11-12):

[a.sub.empty] = 1/2 + 1/4[C.sub.eff]

To determine clipping on such a hypercube, we have to scale the radius accordingly before applying [V.sub.csi](d, r);

[V.sub.csi](d, r, a) = [a.sup.d] [multiplied by] [V.sub.csi](d, r/a).

The resulting formula for access probability using Euclidean range queries is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here, we again sum up over all hypercylinders with k round dimensions and d' - k rectangularly-shaped dimensions. The rectangular part has a side length of [(1/2- 1/(4[C.sub.eff])).sup.d'-k]. The volume of the round part is determined by the volume function [V.sub.csi](d, r). The radius must be scaled before applying [V.sub.csi], to take into account the normalized clipping window. Last, the volume obtained must be brought back to the original scale by multiplying it by [(1/2 + 1/(4[C.sub.eff])).sup.k].

To show the impact of the effects in high dimensional spaces on estimates of access probability, Figure 15 compares our new function [X.sub.r, em, hd, ui](r) to the low-dimensional estimate [X.sub.r, em, ld, ui](r) and to an intermediate form where the volume function [V.sub.csi] is replaced by the unclipped sphere volume [V.sub.s]. In the intermediate form only hypersphere segments lying completely outside the data space are clipped. The database in this experiment contains 280,000 points in a 16-dimensional data space. Whereas the cost model for low-dimensional query processing quickly yields unrealistic access probabilities that are larger than 100%, the intermediate model is accurate for ranges less than r = 0.6. The intermediate model avoids the precomputed discretization of the volume function [V.sub.csi].

[Figure 15 ILLUSTRATION OMITTED]

In Berchtold et al. [1997b], we have given the cost formula for the special case where t the number of data pages N/[C.sub.eff] is a power of two. In this case, the number of split dimensions is equal for all data pages (although the dimensions on which the data pages are actually split may vary). If the number of data pages in the index does not correspond to a power of two, some of the data pages must have been split one time more than the rest of the data pages. Since N/[C.sub.eff] is the number of all data pages in the index, the number of data pages [n.sub.d'], that have been split one time more is equal to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

and the number of data pages split one fewer times, [n.sub.d'-1], is equal to

[n.sub.d'-1] = N/[C.sub.eff] - [n.sub.d'].

Then the expected number of page accesses is equal to

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

This equation holds for the maximum metric as well as the Euclidean metric, and for range queries as well as nearest-neighbor queries.

The accuracy of the low and the high dimensional cost models for range query processing was compared by using a database of 100,000 points taken from a uniform independent data distribution in the 16-dimensional data space. The query range was varied from 0.1 to 0.5 using the maximum metric, yielding selectivities between 6.5 [multiplied by] [10.sup.-12] and 11.8%. The results are depicted in Figure 16. As expected, the high dimensional model yields a reasonable accuracy, whereas the low dimensional model completely fails in this case.

[Figure 16 ILLUSTRATION OMITTED]

5.3 Nearest-Neighbor Query

Query spheres typically exceed data space boundaries in high dimensional query processing. For range queries the consequence is that the result set is actually smaller than we would expect when neglecting the boundary effect, since only the part of the sphere inside the data space contributes to the result set. In contrast, nearest-neighbor queries have a fixed size result set (one point for a one nearest-neighbor query). The consequence here is that a greater radius is needed to achieve the same size result set in the presence of boundary effects.

We first develop an expectation for the volume [V.sub.csi, a](d, r) of the intersection volume of the unit hypercube and a sphere with radius r, whose center is arbitrarily chosen in the unit hypercube. We note that this task is similar to the intersection volume [V.sub.csi](d, r) in the previous section. However, the center of the sphere is now arbitrarily chosen and not fixed in the origin, [V.sub.csi, a](d, r), and corresponds to the probability that two points arbitrarily chosen from the unit hypercube have a distance less or equal to r from each other.

When the maximum metric is used for the query, the expectation for the intersection volume, which is an intersection of two hypercubes, can be determined analytically. Figure 17 depicts three different query positions in the data space. First, we consider only the projection on the x-axis. The center point of [q.sub.1] lies exactly on the lower space boundary. Thus only half of the side length (r) is inside the data space. The center point of [q.sub.2] has a distance greater than r from the data space boundary. Therefore, the complete side length of the cube (2r) is inside the data space. Query [q.sub.3] intersects the right space boundary, but more than half of the side length is inside the data space. The right diagram in Figure 17 depicts the progression of the part of the side length that is inside the data space with a varying position of the query point. The side length is r at the points 0 and 1 and 2r between the positions r and 1 - r. Between 0 and r, the intersection increases linearly. The average of the intersection over all positions is

[V.sub.cci, a](1, r) = 2r - [r.sup.2].

We can extend this result to the d-dimensional case by simply taking the power of d:

[V.sub.cci, a] (d, r) = [V.sub.cci, a][(1, r).sup.d] = [(2r - [r.sup.2]).sup.d].

This result can be used to determine the expected distance of the nearest neighbor. A completely analytical solution is possible if we apply our coarse estimate by equalizing [V.sub.cci, a](d, r) with 1/N:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The impact of boundary effects on the nearest neighbor distance is shown in Figure 18. As expected, boundary effects do not play any role in low dimensions up to 10. With increasing dimension, the effect becomes slightly more important. Neglecting boundary effects, we underestimate the nearest-neighbor distance by 10% in 30-dimensional space. This error, however, is almost negligible. Later in this section we will see that it is not the estimate of the nearest-neighbor distance that causes problems, but the estimate of the Minkowski sum.

[Figures 17-18 ILLUSTRATION OMITTED]

The new volume determination, [V.sub.csi, a](d, r), can also be applied in our exact model for estimating the nearest neighbor. The corresponding probability distribution in this case is

[P.sub.mm, hd](r) = 1 - [(1 - [V.sub.cci, a](d, r)).sup.N] = 1 - [(1 - [(2 [multiplied by] r - [r.sup.2]).sup.d]).sup.N].

The probability density [P.sub.mm, hd](r), the expected nearest-neighbor distance [R.sub.mm, hd], and the expected number of page accesses [A.sub.nn, mm, hd] can be derived from the probability distribution, as described in Section 4.

When the Euclidean metric is applied, the same problem arises as in Section 5.2. It is difficult to determine the intersection volume between the unit hypercube and a sphere with an arbitrary center in an analytical way. To cope with this problem, a similar precomputation of the volume may be used. Again, we define the cube-sphere intersection with arbitrary center, [V.sub.csi, a](d, r), by a multidimensional integral, which can be evaluated using Monte Carlo integration [Kalos and Whitlock 1986]. The result can be stored in an array for use by the model.

Definition 2. Cube-Sphere Intersection with Arbitrary Center. [V.sub.csi, a](d, r) denotes the intersection volume between the unit hypercube [[0..1].sup.d] and a d-dimensional sphere with radius r around a point arbitrarily chosen from the unit hypercube:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Figure 19 shows [V.sub.csi, a](d, r) for the dimensions 2, 4, 8, 16, and 32. The intersection volume was determined for all dimensions between 1 and 100 for each radius between 0 and [square root of d] in 10,000 intervals using 100,000 Monte Carlo integration steps [Kalos and Whitlock 1986]. The 10,000 intervals can be used for an efficient numerical evaluation of the expected number of page accesses. Taking boundary effects into consideration, the probability distribution of the nearest-neighbor distance r for the Euclidean metric is

[P.sub.em, hd] (r) = 1 - [(1 - [V.sub.csi, a](d, r)).sup.N].

The corresponding density function is

[P.sub.em, hd](r) = [differential][P.sub.em, hd](r)/[differential]r = N [multiplied by] [(1 - [V.sub.csi, a](d, r)).sup.N-1] [multiplied by] [differential][V.sub.csi, a](d, r)/[differential]r.

Assuming that [V.sub.csi, a][d, i] is an array with the range [1..[d.sub.max], 0..[i.sub.max]] that contains the precomputed values of [V.sub.csi, a](d, r) for r ranging from 0 to [square root of d] with

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

we are able to replace integral formulas such as the expected value of the nearest-neighbor distance by a finite summation:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The infinite upper bound of the integral can be replaced by [square root of d] because the derivative of [V.sub.csi, a](d, r) is constantly 0 for r larger than [square root of d], while all other terms have finite values. The derivative is replaced by the local difference. The expected value of the number of page accesses can be determined in the same way:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

[Figure 19 ILLUSTRATION OMITTED]

The evaluation of these formulas is not costly because the required volume functions [V.sub.csi, a](d, r) and [V.sub.csi](d, r) are independent of any database-specific setting such as the number of points in the database, the point density, or the effective page capacity [C.sub.eff]. The predetermined discretization of these functions requires a few megabytes of storage and can be statically linked with the programs for evaluating cost models. Costly Monte Carlo integration processes are run only at compile time, not at runtime. Further improvement is achievable if we consider that the probability density only contributes in the interval between [r.sub.lb] and [r.sub.ub] (cf., Section 4.3). Integration and summation can be bounded to this area:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

To evaluate the cost formula for query processing using nearest-neighbor queries, we constructed indexes with varying data space dimensions. All databases contained 100,000 points taken from a uniform and independent distribution. The effective capacity of the data pages was 48.8 in all experiments (the block-size was chosen correspondingly). The dimension varied from 4 to 20. We performed nearest-neighbor queries using maximum and Euclidean metrics on all these indexes and compared the observed page accesses with the predictions for the low dimensional and high dimensional models developed in this paper. The results are depicted in Figure 20. The diagram on the left side shows the results for the maximum metric, the right diagram shows the results for the Euclidean metric. Whereas the cost model for high dimensional query processing provides accurate estimates over all dimensions, the low dimensional model is only accurate in the low dimensional area up to d = 6. Beyond this area, the low dimensional model completely fails to predict the number of page accesses. Not even the order of magnitude is correctly revealed by the low dimensional model. We note that the low dimensional model is mainly related to the original Friedman et al. [1977] model and the extension of the Cleary [1979] model.

[Figure 20 ILLUSTRATION OMITTED]

6. DATA SETS FROM REAL-WORLD APPLICATIONS

It has been pointed out that data sets from real applications consistently violate the assumptions of uniformity and independence [Faloutsos and Kamel 1994; Belussi and Faloutsos 1995]. In this section, we describe the effects and adapt our models to take nonuniformity and correlation into account.

6.1 Independent Nonuniformity

It was already proven in the well-known Friedman et al. [1977] cost model that nonuniformity has no influence on the cost of nearest-neighbor query processing if no correlation occurs and if data distribution is smooth. In this context, smoothness means that the point density does not vary severely inside the Minkowski enlargement of a page region. The intuitive reason follows: Query points are assumed to be taken from the same distribution as data points. For the access probability of a page, we have to determine the fraction of query points inside the Minkowski enlargement of the page. If the point density is above the average in some region (say by a factor of c) due to nonuniformity, then both the average volume of the page regions and the average volume of the query regions are scaled by the factor 1/c. This means that the Minkowski sum is scaled by 1 / c. However, the number of points inside a given volume is higher by a factor of c than in the uniform case. So the number of points in the Minkowski enlargement is the same as in the uniform case. For smooth distributions for which the independence assumption holds, the performance of the evaluation of nearest-neighbor queries is the same as for uniform distributions. In contrast, range queries are difficult to model in the case of nonuniformity because in the same way as point density changes with varying location, the size of the result set and the number of page accesses will change. (Range queries will be the subject for future work, as will cost modeling where data distribution is not smooth.)

6.2 Correlation

For real data the assumption of independent nonuniform data distribution is as unrealistic as the assumption of independent uniform distribution. One of the most important properties of real data is correlation.

If two dimensions are correlated, the value of the one attribute functionally depends on the value of the other attribute. Either it can be determined directly from the other attribute, or there are only a small number of possible values (or a small interval) that the dependent attribute can take on. In contrast to the stochastical definition of correlation taking only linear dependencies into account, we also consider nonlinear correlations. The geometrical meaning of a correlation is the following: The d-dimensional space is not completely covered with data points. Instead, all points are clustered on a lower dimensional area embedded in the data space. An example is shown in Figure 21, where all data points are located on a 1-dimensional line embedded in the 2-dimensional data space. As depicted, the line is not necessarily a straight line. It is also possible that there are several lines that are not connected, or that the data points are located in a cloud around the line.

[Figure 21 ILLUSTRATION OMITTED]

The singular value decomposition (SVD) concept is often used to handle correlation [Duda and Hart 1973; Fukunaga 1990; Golup and van Loan 1989] or principal component analysis (PCA) [Faloutsos and Lin 1995; Press et al. 1988; Strang 1980]. These techniques transform the data points directly into a lower dimensional data space by rotation operations and eliminate the correlation in this way. The point set is indexed in lower dimensional space, and query processing can be modeled using our techniques in Sections 3-6.

However, SVD and PCA can only detect and eliminate linear correlations. In our example, a linear correlation means a single straight line. If there are, for example, two warped lines on which data points are located, SVD and PCA will fail completely. We will show later that query processing improves, even without applying explicit transformations to lower dimensional data spaces, if the intrinsic dimension of the point set is lower than the dimension of the data space.

The general problem of correlations can also be observed in Figure 21. If we consider circles of varying size around some data point, we observe that the number of enclosed points is not proportional to the area of the circle as we would expect. Because the intrinsic dimension of the data set is 1, the number of points enclosed in a circle with radius r is proportional to the radius r. The same observation is valid if we consider cubes or some other d-dimensional objects that are scaled uniformly.

This provides us with a means to define the intrinsic dimension of the data set. Under uniformity and independence assumptions the number of points enclosed in a hypercube with side length s is proportional to the volume of the hypercube

[n.sub.encl] = [Rho] [multiplied by] [s.sup.d] = [Rho] [multiplied by] V,

where [Rho] = N/[V.sub.DS] is called the point density. Real data sets obey a similar power law using the fractal dimension [D.sub.F] of the data set

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [[Rho].sub.F] is the fractal analogue of the point density [Rho]. The power law was used by Faloutsos and Kamel [1994] for the "box counting" fractal dimension. We use this formula directly for the definition of the fractal dimension:

Definition 3. Fractal Dimension. The fractal dimension [D.sub.F] of a point set is the exponent for which the following power law holds, provided that the center of the volume is located in the populated part of the data space:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The fractal dimension is often not constant over all scales. It is possible that the fractal dimension changes, depending on the size of the volume V. In practice, the fractal dimension is often constant over the wide range of the relevant scales. It is also possible that the fractal dimension is not location-invariant, i.e., a subset of the data set forms a different fractal dimension than the rest of the data set. Intuitively, a reason for this behavior can be that our database contains different kinds of objects (e.g., oil paintings and photos in an image database). The restriction to the populated part of the data space in Definition 3 forces us to the assumption that the distribution of the query points must follow the distribution of the data points. In many applications, this assumption seems intuitively reasonable, for instance if, in an image database, images of the same type are the query objects. If this assumption does not hold, query processing is affected in a way that our current model is not able to predict. (This is a subject for future work, however.)

6.3 Model Dependence on the Fractal Dimension

The first consequence of the existence of the fractal dimension [D.sub.F] is that the choice of the model to use (high dimensional or low dimensional) is dependent on [D.sub.F] rather than on the embedding dimension d. If [D.sub.F] is small, then most data points and most queries are far from the data space boundary. Therefore, we need not consider clipping in our model. For this case, we must adapt the cost model for low dimensional data spaces (cf., Sections 3-4). In contrast, if [D.sub.F] is large, the effects of high dimensional data spaces occur. Thus, the model for high dimensional data spaces must be adapted for this case (cf., Section 5). For moderate [D.sub.F], both basic models can be applied. As a practical guide, we assume boundary effects if the fractal dimension [D.sub.F] is greater than or equal to the maximum split dimension d':

[D.sub.F] [is greater than or equal to] d' = [[log.sub.2](N/[C.sub.eff])].

6.4 Range Query

We first want to determine how the access probability of a page changes in the presence of a correlation described by the fractal dimension [D.sub.F]. Let us assume that the fractal point density [[Rho].sub.F] is constant throughout the populated part of the page region and its Minkowski enlargement. This assumption is necessary for range queries, as we pointed out at the beginning of Section 6. For nearest-neighbor queries, a relaxed restriction to smoothly distributed points in the subspace is sufficient. In the case of low [D.sub.F], we can estimate the side length a of a page region according to the power law:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

In the high dimensional case, d' splits are applied explicitly to achieve data pages with a suitable number of points ([C.sub.eff]). However, we must take into account that a split in some dimension automatically leads to a reduced extension in some correlated dimension. We assume the extension

[a.sub.split] = 0.5 [multiplied by] (1 - 1/[C.sub.eff])

(cf., Section 5.2) in a number d" of dimensions with

d" = d' * d/[D.sub.F]

and full extension (up to MBR effects, cf., Section 5.2)

[a.sub.unsplit] = 1 - 1/[C.sub.eff]

in the remaining d - d" dimensions.

We must now make an assumption for the distribution of query points. First, we assume that they are taken from a uniform and independent data distribution. Later, we will assume that data points and query points are selected from the same distribution. For uniformly distributed query points, the Minkowski sum of the page region and a query range r corresponds to the access probability of the page. Following our discussion in Sections 3.3 and 5.2, we get the following access probabilities for the Euclidean and maximum metrics and for the high and low dimensional cases, respectively:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The expected value of the number of page accesses can be easily determined by multiplication with the number of data pages N/[C.sub.eff]

For real applications, uniform distribution of the query points is not a realistic assumption. A better alternative is to assume that data points and query points are taken from the same distribution and yield the same fractal dimension [D.sub.F]. Instead of taking the volume of the Minkowski enlargement for access probability, we should rather determine the percentage of the query points lying in the Minkowski enlargement. The power law can be used for this purpose, yielding

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The other equations can be modified in the same way.

6.5 Nearest-Neighbor Query

Following our coarse model for estimating the nearest-neighbor distance (cf., Section 4.1), we can easily determine a volume with the expectation of I enclosed point. As in the preceding section, we assume that the distribution of the query points follows the distribution of the data points. The volume can then be estimated by using the power law:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

for the maximum metric and

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

for the Euclidean metric. If [D.sub.F] is sufficiently large (according to Section 6.4), boundary effects must be considered. For the maximum metric, we get the following formula:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For the Euclidean metric, we need the inverse function of the cube-sphere intersection with arbitrary center, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII](d, V) (cf., Section 5.3). The discretization of the inverse function can be derived from the discretization of [V.sub.csi,a](d, r) in a single pass. The estimate of the nearest-neighbor distance is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For our exact model, we have to adapt our distribution function in a suitable way. Again, we have to apply the power law:

P(r) = 1 - [(1 - [[Rho].sub.F]/N [multiplied by] V(r)[D.sub.F]/d).sup.N],

where V(r) is the volume of the d-dimensional hypersphere with radius r in the case of the Euclidean metric and the volume of the d-dimensional hypercube with side length 2r in the case of the maximum metric. We have to make a suitable distinction between the low and high dimensional cases when choosing V(r). The rest is straightforward and can be handled as in Sections 4-5: An expectation for the nearest-neighbor distance can again be gained by integrating r multiplied with the derivative of P(r). The new distribution function must be multiplied with the Minkowski sum as in Sections 4-5. For the maximum metric, we get the following formulas for the low and high dimensional cases, respectively:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For the Euclidean metric, the corresponding result is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Techniques facilitating the evaluation of these formulas are presented in Sections 4-5.

To evaluate our model extension, indexes on several data sets from real-world applications were constructed. Our first application is a similarity search system for CAD drawings provided by a subcontractor of the automobile industry [Berchtold and Kriegel 1997]. The drawings were transformed into 16-dimensional Fourier vectors. Our second application is a content-based image retrieval using color histograms with 16 histogram bins [Seidl 1997]. Both databases contained 50,000 objects. Our third application contained 9- dimensional vectors from weather observations. The fractal dimensions of our data sets are 7.0 (CAD), 8.5 (color histograms) and 7.3 (clouds). We performed nearest-neighbor queries using Euclidean and maximum metrics and compared the results to the predictions of the following three cost models:

* the original models by Friedman et al. [1977] and Cleary [1979], cf., Section 4;

* our extension to high-dimensional query processing (cf., Section 5);

* our extension to nonuniformity and correlation.

The results are depicted in Figure 22. In contrast to the low and the high-dimensional models, the new correlation model yields sufficient accuracy in all experiments performed.

[Figure 22 ILLUSTRATION OMITTED]

7. CONCLUSIONS

7.1 Summary

In this paper we have proposed a comprehensive cost model for query processing in high dimensional data spaces. It is based on the concepts of our previous cost model [Berchtold et al. 1997b], and were extended in many aspects. The BBKK model introduced two techniques, Minkowski sum and data space clipping, to estimate the number of page accesses when performing range queries and nearest-neighbor queries in a high dimensional Euclidean space. These techniques are explained in detail, and were also extended to take the maximum metric into account. Our new mode] drops the restriction of the BBKK model, which could only handle numbers of data pages with a power of 2. The k-nearest neighbor queries are a further extension of this paper. We have described in detail, and further developed numerical methods for evaluating the cost model. Computationally expensive steps in estimating costs are moved to compile time by applying precomputed volume functions. Correlated data were finally taken into account using the fractal dimension. Experimental evaluations showed the practical applicability of cost estimates and their superiority over competing techniques in terms of accuracy.

7.2 Future Work

In the course of this paper, we have stated several limitations of the current approach, which are subjects for future work. Open issues are summarized here.

We noted in the introduction that query processing in high dimensional data spaces is often a result of a feature transformation that maps objects of an arbitrary application to points in a high dimensional vector space. In order to guarantee correctness, selectivity in feature space is worse than selectivity in object space, such that more feature vectors than objects must be retrieved. How many more objects must be searched depends on feature transformation, which subsumes tasks such as dimension reduction and scaling (normalizing) of the data space. An estimate of these effects is beyond the scope of this paper. However, selectivity in feature transformation is important for our estimates of index costs. Hence, modeling feature transformations, including effects such as dimension reduction and scaling, are subjects for future work.

In addition, our model is limited by an assumption of smoothness in data distribution and general nonuniformity for range queries. As described in Section 6.1, smoothness means that there are no very sharp changes in point density between regions that are covered and regions that are uncovered by points. Violations of smoothness assumptions that are not due to correlation are not considered by our model. In the one-dimensional case, such distributions can be handled by models like the Zipf distribution, by histogram techniques, and kernel estimators. We will investigate equivalent multidimensional techniques in the future. Similar approaches are needed for modeling range queries in the presence of nonuniformity; these approaches will also be an issue for further research.

Finally, our model does not take into account the impact of various heuristics in the algorithms for index construction and query processing. The insert and split strategies are assumed to produce page regions that are cube-like in the dimensions that are affected by the splits. In reality, it is not always possible to obtain balanced (i.e., cube-like) side lengths. How much the violation of the assumption of balanced side lengths affects query processing is an important question. The next obvious question is how to determine this asymmetry, given the construction heuristics. For the algorithms performing range and nearest-neighbor queries, we have assumed optimality, which is met in the depth-first search for range queries and in the HS algorithm for nearest-neighbor queries. The RKV algorithm is well-known for the nearest-neighbor search (cf., Section 1.5). As we have pointed out, the RKV algorithm is difficult to predict (as it is known to yield worse performance than HS, there is no reason to apply RKV). More interesting is the development of fast index scans [Berchtold et al. 2000a] that, while planning disk access, not only consider the position of a page region in the data space but also the position of the page on disk. Modeling such algorithms is also a subject for future work.

[Figure 9 ILLUSTRATION OMITTED]

ACKNOWLEDGMENTS

I'd like to express particular thanks to the reviewers who read the paper extraordinarily carefully and gave valuable hints for substantial improvements.

REFERENCES

AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., AND RAGHAVAN, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD, Seattle, WA, June). ACM Press, New York, NY, 94-105.

AGRAWAL, R., LIN, K., SHAWNEY, H., AND SHIM, K. 1995. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proceedings of the 21st International Conference on Very Large Databases (VLDB'95, Sept.). 490-501.

AGRAWAL, R., FALOUTSOS, C., AND SWAMI, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the Fourth International Conference on The Foundations of Data Organization and Algorithms (FODO). 69-84.

ALTSCHUL, S., GISH, W., MILLER, W., MYERS, E. W., AND LIPMAN, D. J. 1990. A basic local alignment search tool. J. Molecular Biology 215, 3, 403-410.

AREF, W. G. AND SAMET, H. 1991. Optimization strategies for spatial query processing. In Proceedings of the 17th Conference on Very Large Data Bases (Barcelona, Spain, Sept.). VLDB Endowment, Berkeley, CA, 81-90.

ARYA, S., MOUNT, D. M., AND NARAYAN, O. 1995. Accounting for boundary effects in nearest neighbor searching. In Proceedings of the 11th Annual Symposium on Computational Geometry (Vancouver, B.C., Canada, June 5-12), J. Snoeyink, Chair. ACM Press, New York, NY, 336-344.

ARYA, S. 1995. Nearest neighbor searching and applications. Ph.D. Dissertation. University of Maryland at College Park, College Park, MD.

BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND SEEGER, B. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD '90, Atlantic City, NJ, May 23-25), H. Garcia-Molina, Chair. ACM Press, New York, NY, 322-331.

BELUSSI, A. AND FALOUTSOS, C. 1995. Estimating the selectivity of spatial queries using the correlation fractal dimension. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Switzerland, Sept.). 299-310.

BERCHTOLD, S., BOHM, C., JAGADISH, H. V., KRIEGEL, H. -P., AND SANDER, J. 2000a. Independent quantization: An index compression technique for high-dimensional data spaces. In Proceedings of the 16th International Conference on Data Engineering (ICDE, San Diego, CA, Feb/Mar).

BERCHTOLD, S., BOHM, C., KEIM, D., KRIEGEL, H. -P., AND XU, X. 2000b. Optimal multidimensional query processing using tree striping. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery.

BERCHTOLD, S., BOHM, C., AND KRIEGEL, H. -P. 1998. The pyramid-technique: Towards indexing beyond the curse of dimensionality. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD, Seattle, WA, June). ACM Press, New York, NY, 142-153.

BERCHTOLD, S., BOHM, C., BRAUNM, LLER, B., KEIM, D. A., AND KRIEGEL, H. -P. 1997a. Fast parallel similarity search in multimedia databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD, Tucson, AZ, May). ACM Press, New York, NY, 1-12.

BERCHTOLD, S., BOHM, C., KEIM, D. A., AND KRIEGEL, H.-P. 1997b. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '97, Tucson, AZ, May 12-14), A. Mendelzon and Z. M. Ozsoyoglu, Chairs. ACM Press, New York, NY, 78-86.

BERCHTOLD, S., KEIM, D., AND KRIEGEL, H.-P. 1997c. Using extended feature objects for partial similarity retrieval. VLDB J. 6, 4, 333-348.

BERCHTOLD, S. AND KRIEGEL, H.-P. 1997. S3: Similarity search in CAD database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD, Tucson, AZ, May). ACM Press, New York, NY, 564-567.

BERCHTOLD, S. 1997. Geometry based search of similar parts. Ph.D. Dissertation. Institute for Design Automation, Technical University of Munich, Munich.

BERCHTOLD, S., KEIM, D. A., AND KRIEGEL, H.-P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Bombay, India, Sept.). 28-39.

BOHM, C. 1998. Efficiently indexing high-dimensional data spaces. Ph.D. Dissertation. Utz Verlag, Munich, Germany.

BOHM, C. AND KRIEGEL, H.-P. 2000. Dynamically optimizing high-dimensional index structures. In Proceedings of the 7th International Conference on Extending Database Technology (Konstanz, Germany).

CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998. A cost model for similarity queries in metric spaces. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '98, Seattle, WA, June 1-3), A. Mendelson and J. Paredaens, Chairs. ACM Press, New York, NY, 59-68.

CLEARY, J. G. 1979. Analysis of an algorithm for finding nearest neighbors in Euclidean space. ACM Trans. Math. Softw. 5, 2, 183-192.

DUDA, R. O. AND HART, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley and Sons, Inc., New York, NY.

EASTMAN, C. 1981. Optimal bucket size for nearest neighbor searching in k-d. Inf. Process. Lett. 4.

FALOUTSOS, C. AND GAEDE, V. 1996. Analysis of n-dimensional quadtrees using the Hausdorff fractal dimension. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Bombay, India, Sept.). 40-50.

FALOUTSOS, C. AND LIN, K.-I. 1994. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. Univ. of Maryland Institute for Advanced Computer Studies Report No. UMIACS-TR-94-132. University of Maryland at College Park, College Park, MD.

FALOUTSOS, C., BARBER, R., FLICKNER, M., HAFNER, J., NIBLACK, W., PETKOVIC, D., AND EQUITZ, W. 1994a. Efficient and effective querying by image content. J. Intell. Inf. Syst. 3, 3/4 (July), 231-262.

FALOUTSOS, C., RANGANATHAN, M., AND MANOLOPOULOS, Y. 1994b. Fast subsequence matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY, 419-429.

FALOUTSOS, C. AND KAMEL, I. 1994. Beyond uniformity and independence: Analysis of R-trees using the concept of fractal dimension. In Proceedings of the 13th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '94, Minneapolis, MN, May 24-26), V. Vianu, Chair. ACM Press, New York, NY, 4-13.

FALOUTSOS, C., SELLIS, T., AND ROUSSOPOULOS, N. 1987. Analysis of object oriented spatial access methods. In Proceedings of the ACM SIGMOD Annual Conference on Management of Data (SIGMOD '87, San Francisco, CA, May 27-29), U. Dayal, Ed. ACM Press, New York, NY, 426-439.

FRIEDMAN, J. H., BENTLEY, J. L., AND FINKEL,, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3, 3 (Sept.), 209-226.

FUKUNAGA, K. 1990. Introduction to Statistical Pattern Recognition. 2nd ed. Academic Press Prof., Inc., San Diego, CA.

GAEDE, V. AND GUNTHER, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170-231.

GAEDE, V. 1995. Optimal redundancy in spatial database systems. In Proceedings of the Fourth International Symposium on Advances in Spatial Databases (SSD'95, Portland, ME, Aug.), M. J. Egenhofer and J. R. Herring, Eds. Springer-Verlag, New York, NY, 96-116.

GARY, J. E. AND MEHROTRA, R. 1993. Similar shape retrieval using a structural feature index. Inf. Syst. 18, 7 (Oct.), 525-537.

GOLUB, G. AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore, MD.

GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference on Management of Data. ACM Press, New York, NY, 47-57.

HENRICH, A. 1994. A distance-scan algorithm for spatial access structures. In Proceedings of the Second ACM Workshop on Geographic Information Systems (Gaithersburg, MD, Dec.). ACM Press, New York, NY, 136-143.

HJALTASON, G. R. AND SAMET, H. 1995. Ranking in spatial databases. In Proceedings of the Fourth International Symposium on Advances in Spatial Databases (SSD'95, Portland, ME, Aug.), M. J. Egenhofer and J. R. Herring, Eds. Springer-Verlag, New York, NY, 83-95.

JAGADISH, H. V. 1991. A retrieval technique for similar shapes. SIGMOD Rec. 20, 2 (June), 208-217.

KALOS, M. H. AND WHITLOCK, P. A. 1986. Monte Carlo Methods. Vol. 1: Basics. Wiley-Interscience, New York, NY.

KASTENMULLER, G. AND KRIEGEL, H. -P. 1998. Similarity search in 3d protein databases. In Proceedings of the German Conference on Bioinfomatics (Cologne, Germany).

KATAYAMA, N. AND SATOH, S. 1997. The SR-tree: an index structure for high-dimensional nearest neighbor queries. In Proceedings of the International ACM Conference on Management of Data (SIGMOD '97, Tucson, AZ, May). ACM, New York, NY, 369-380.

KAUL, A., O'CONNOR, M. A., AND SRINIVASAN, V. 1991. Computing Minkowski sums of regular polygons. In Proceedings of the 3rd Canadian Conference on Computing Geometry. 74-77.

KEIM, D.A. 1997. Efficient similarity search in spatial database systems. Habilitation thesis. University of Munich, Munich, Germany.

KORN, F., SIDIROPOULOS, N., FALOUTSOS, C., SIEGEL, E., AND PROTOPAPAS, Z. 1996. Fast nearest neighbor search in medical image databases. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Bombay, India, Sept.). 215-226.

KRIEGEL, H. -P. AND SEIDL, T. 1998. Approximation-based similarity search for 3d surface segments. GeoInformatica J..

KRIEGEL, H.-P., SCHMIDT, T., AND SEIDL, T. 1997. 3-D similarity search by shape approximation. In Proceedings of the Fifth International Symposium on Advances in Spatial Databases (SSD'97, Berlin, July), M. Scholl and A. Voisard, Eds. Springer-Verlag, New York, NY, 11-28.

LIN, K-I., JAGADISH, H., AND FALOUTSOS, C. 1994. The TV-tree--An index structure for high dimensional data. VLDB J. 3 (Oct.), 517-542.

MANDELBROT, B. 1977. Fractal Geometry of Nature. W. H. Freeman and Co., New York, NY.

MEHROTRA, G. 1999. The hybrid tree: An index structure for high dimensional feature spaces. In Proceedings of the 15th International IEEE Conference on Data Engineering (Sydney, Australia, Mar.). IEEE Press, Piscataway, NJ, 440-447.

MEHROTRA, R. AND GARY, J. 1995. Feature-index-based similar shape retrieval. In Proceedings of the Third IFIP WG2.6 Working Conference on Visual Database Systems 3 (VDB-3), S. Spaccapietra and R. Jain, Eds. Chapman and Hall, Ltd., London, UK, 46-65.

MEHROTRA, R. AND GARY, J. 1993. Feature-based retrieval of similar shapes. In Proceedings of the 9th International Conference on Data Engineering (Vienna, Austria, Apr.). IEEE Computer Society, Washington, DC.

PAGEL, B.-U., SIX, H.-W., TOBEN, H., AND WIDMAYER, P. 1993. Towards an analysis of range query performance in spatial data structures. In Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS, Washington, DC, May 25-28), C. Beeri, Chair. ACM Press, New York, NY, 214-221.

PAPADOPOULOS, A. N. AND MANOLOPOULOS, Y. 1998. Similarity query processing using disk arrays. SIGMOD Rec. 27, 2, 225-236.

PAPADOPOULOS, A. AND MANOLOPOULOS, Y. 1997. Performance of nearest neighbor queries in r-trees. In Proceedings of the 6th International Conference on Database Theory (ICDT '97, Delphi, Greece, Jan. 9-10). Springer-Verlag, Berlin, Germany, 394-408.

PRESS, W. H., TEUKOLSKY, S. A., VETTERLING, W. T., AND FLANNERY, B. P. 1992. Numerical Recipes in C: The Art of Scientific Computing. 2nd ed. Cambridge University Press, New York, NY.

RIEDEL, E., GIBSON, G. A., AND FALOUTSOS, C. 1998. Active storage for large-scale data mining and multimedia. In Proceedings of the 24th International Conference on Very Large Data Bases. 62-73.

ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD '95, San Jose, CA, May 23-25), M. Carey and D. Schneider, Eds. ACM Press, New York, NY, 71-79.

SCHROEDER, M. 1991. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. W. H. Freeman and Co., New York, NY.

SEIDL, T. 1997. Adaptable similarity search in 3-D spatial database systems. Ph.D. Dissertation.

SEIDL, T. AND KRIEGEL, H. -P. 1997. Efficient user-adaptable similarity search in large multimedia databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97, Athens, Greece, Aug.). 506-515.

SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The R+-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th Confererence on Very Large Data Bases (Brighton, England, Sept.). VLDB Endowment, Berkeley, CA, 507-518.

SHAWNEY, H. AND HAFNER, J. 1994. Efficient color histogram indexing. In Proceedings of International Conference on Image Processing. 66-70.

SPROULL, R. F. 1991. Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6, 4, 579-589.

STRANG, G. 1980. Linear Algebra and its Applications. 2nd ed. Academic Press, Inc., New York, NY.

THEODORIDIS, Y. AND SELLIS, T. 1996. A model for the prediction of R-tree performance. In Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '96, Montreal, Que., Canada, June 3-5), R. Hull, Chair. ACM Press, New York, NY, 161-171.

WEBER, R., SCHEK, H. -J., AND BLOTT, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases.

WHITE, D. A. AND JAIN, R. 1996. Similarity indexing with the SS-tree. In Proceedings of the 12th IEEE International Conference on Data Engineering (ICDE '97, New Orleans, LA, Feb.). IEEE Press, Piscataway, NJ, 516-523.

Received: December 1998; revised: July 1999; accepted: November 1999

Author's address: Institute for Computer Science, University of Munich, Oettingenstr, 67, Munich, 80538, Germany; e-mail: boehm@informatik.uni-muenchen.de

Printer friendly Cite/link Email Feedback | |

Author: | BOHM, CHRISTIAN |
---|---|

Publication: | ACM Transactions on Database Systems |

Geographic Code: | 4EUGE |

Date: | Jun 1, 2000 |

Words: | 17637 |

Previous Article: | A Model for Compound Type Changes Encountered in Schema Evolution. |

Next Article: | Tracing the Lineage of View Data in a Warehousing Environment. |

Topics: |