Impact of key locality on database performance.CopperEye LtdCopperEye indexing is a general-purpose data indexing technology that claims to out-perform conventional indexing techniques. This paper discusses the reasons behind the superior performance of this technology over alternative methods. Indexes are used in database and data storage/retrieval applications to enable fast access to stored data. Without indexes, applications are forced to scan entire data sets to find the required data. But while indexes speed up data retrieval, they incur a significant overhead during data set updates, and a compromise must often be struck between data retrieval performance and data loading Copying data from one electronic file or database into another. Data loading implies converting from one format into another; for example, from one type of production database into a decision support database from a different vendor. See data entry. performance. Indexes can be categorised Adj. 1. categorised - arranged into categories categorized classified - arranged into classes according to according to prep. 1. As stated or indicated by; on the authority of: according to historians. 2. In keeping with: according to instructions. 3. the data types they handle. For example, an index may be used for accessing general-purpose scalar scalar, quantity or number possessing only sign and magnitude, e.g., the real numbers (see number), in contrast to vectors and tensors; scalars obey the rules of elementary algebra. Many physical quantities have scalar values, e.g. data, such as text, number and date information. Alternatively, an index may be designed for handling structured data formats such as spatial or media data. However, the vast majority of data held in commercial database systems is of a scalar type and scalar indexes are predominant across all industries. Indexes can also be categorised according to the resources for which they are optimised--memory or disk. For example, an index may reside entirely in memory and/or may access entirely memory resident data. Such an index aims to minimise its memory footprint Memory footprint refers to the amount of main memory that a program uses or references while running. This includes all sorts of active memory regions like code, static data sections (both initialized and uninitialized), heap, as well as all the stacks, plus memory required to hold and CPU CPU in full central processing unit Principal component of a digital computer, composed of a control unit, an instruction-decoding unit, and an arithmetic-logic unit. usage, whereas a disk-resident index focuses on reducing its disk access, since disks offer a slow medium for data access. Memory based Programs that hold all data in memory for processing. Almost all spreadsheets are memory based so that a change in data at one end of the spreadsheet can be instantly reflected at the other end. indexes restrict both database growth and tolerance of system failure. Therefore, most commercial databases use disk resident indexes to provide sufficient database size and security. The CopperEye technology is a general-purpose scalar disk-optimised index suited to the majority of data stored in commercial systems. For both brevity Brevity Adonis’ garden of short life. [Br. Lit.: I Henry IV] bubbles symbolic of transitoriness of life. [Art: Hall, 54] cherry fair cherry orchards where fruit was briefly sold; symbolic of transience. and clarity, specialist multi-dimensional data indexes and memory-optimised indexes will not be considered any further in this paper. Throughout this paper, reference is often made to relational databases relational database Database in which all data are represented in tabular form. The description of a particular entity is provided by the set of its attribute values, stored as one row or record of the table, called a tuple. because the reader is likely to be familiar with them However, the CopperEye technology is not specific to relational databases and can apply to any data storage paradigm, such as object-oriented databases See object database. (database) object-oriented database - (OODB) A system offering DBMS facilities in an object-oriented programming environment. Data is stored as objects and can be interpreted only using the methods specified by its class. , XML databases A database that stores XML documents. There are two types. The first is the "XML-enabled database," which is a relational or object-oriented database that has been extended to hold XML data. or flat file systems. Industry Trends Few radical innovations in disk-optimised scalar indexing have occurred over recent decades. Commercial database vendors have concentrated on improving performance through hardware scalability and operational parallelism An overlapping of processing, input/output (I/O) or both. 1. parallelism - parallel processing. 2. (parallel) parallelism - The maximum number of independent subtasks in a given task at a given point in its execution. E.g. , which provide performance at a hardware cost; but even then, not all database applications are open to divide-and-conquer approaches. The technology innovations that do address scalar data tend to optimise query performance at great expense to loading performance and concurrency Operations that are performed simultaneously within the computer. For example, dual-core CPUs provide complete overlapping of two independent processes. See dual core, hyperthreading, multiprocessing, multitasking, multithreading, SMP and MPP. concurrency - multitasking . Meanwhile, commercial database sizes are burgeoning and enterprises expect to exploit their databases more effectively. This exerts pressure on database performance and particularly the index structures used to navigate through the sea of data. The CopperEye technology claims to alleviate these performance issues. Key Localisation (programming) localisation - (l10n) Adapting a product to meet the language, cultural and other requirements of a specific target market "locale". Localisation includes the translation of the user interface, on-line help and documentation, and ensuring the images and Key localisation is the measure of how and where key instances are placed into a storage structure (such as a table or index) in relation to their key values. Key localisation can be categorised by two dimensions--precision and clustering. * Precision dictates the probability of knowing exactly where a key instance is located in the structure given the key value. The greater the precision, the more precisely the location is known, and the more efficiently a search can return an instance of a specific search key. * Clustering dictates the probability of finding multiple instances of similar keys within close proximity of each other The greater the clustering, the less scanning is required and the more efficiently multiple instances can be returned for a range of keys. A heap-organised table, wherein new rows are simply appended at the end of the structure, has the least precise key localisation and no clustering because the key value has no influence on its row locality. In contrast, a hash-organised table, where rows are placed at specific addresses derived from a hash algorithm applied to the associated key, has high precision but no clustering--instances of similar (but different) keys are unlikely to be in close proximity to each other. Whereas a B-tree, with a hierarchy of blocks grouping keys together in key order, exhibits both high precision and clustering--instances of similar key values are likely to be found in the same index block. These different mixtures of key localisation clearly afford different search efficiency. The heaped table requires a full table scan to either locate an instance of a single key or to locate multiple instances for a range of keys; while the hashed table can efficiently locate an instance of a specific key, but still requires a full table scan to locate a range of keys; whereas a B-tree can perform both searches efficiently. Performance The most costly operations for a disk-resident storage structure involve write and read access to and from the disk; where clearly the fewer disk accesses yield better performance. The pattern of disk access is largely determined by the precision of key localisation within the structure; less precision yields easier structure maintenance but affords less efficient key searching and visa versa. The structural features of common indexes, such as the B-tree or hashed table enforce highly precise localisation to achieve the best possible search efficiency. However, the extreme precision of these structures makes each of them expensive to maintain. Contrast the maintenance cost of a heaped table with a hashed table; multiple rows with randomly distributed keys can potentially be appended to a heaped table with a single disk write at the table high watermark High Watermark The highest peak in value that an investment fund/account has reached. This term is often used in the context of fund manager compensation, which is performance based. (Fig. 1); whereas a hashed organised table can demand up to two disk operations per row inserted (one disk read to fetch the target table block encompassing the row address followed by a disk write to persist the updated block) as shown in Fig. 2. These are dramatically different maintenance cost profiles and not surprisingly they are inversely proportional See See also: Inversely to their search efficiencies. The B-tree similarly exhibits high precision localisation and can require two disk operations (or more) for each row inserted (Fig. 3). Perversely, the most common arrangement of a heaped table with a B-tree index on top, gives the worst possible performance all round. The B-tree stifles row inserts, deletes and key updates; while the heaped table dampens range query speed by forcing a disk read for each row returned (Fig. 4). This is partially overcome by B-tree organised tables (whereby the entire table rows are stored within the B-tree structure itself) but only insofar in·so·far adv. To such an extent. Adv. 1. insofar - to the degree or extent that; "insofar as it can be ascertained, the horse lung is comparable to that of man"; "so far as it is reasonably practical he should practice as queries are serviced quickly; meanwhile, the overhead of row inserts, deletes and key updates remains high--indeed it actually worsens because fewer keys can be accommodated in each index block, requiring more frequent internal reorganisation Noun 1. reorganisation - the imposition of a new organization; organizing differently (often involving extensive and drastic changes); "a committee was appointed to oversee the reorganization of the curriculum"; "top officials were forced out in the cabinet activity. [FIGURES 1-4 OMITTED] Moreover, the B-tree organised table can only be built around one key (typically the primary key), requiting secondary indexes to be built on top to support access by secondary keys. While this arrangement affords fast primary key searches, it degrades access by secondary key still further because of the increased complexity involved in navigating the B-tree organised table over the heap organised table. Flexibility The localisation precision is the prime determinant of the structure maintenance overhead, whilst clustering mainly affects the efficiency of key range searches. As we have already seen, the precision of B-trees and hashed tables are similar and exhibit similar maintenance costs despite having very different clustering characteristics. Therefore, high clustering is always desirable if it does not enforce highly precise localisation too. Clustering only influences maintenance performance when pre-ordered data is being inserted into the structure. In these circumstances, high clustering is desirable and yields good insert performance; hence when building B-trees on pre- populated pop·u·late tr.v. pop·u·lat·ed, pop·u·lat·ing, pop·u·lates 1. To supply with inhabitants, as by colonization; people. 2. tables, the keys are usually sorted first. The CopperEye index deploys a structure that can vary the localisation precision at will and maintain a commensurate key clustering Please help recruit one or [ improve this article] yourself. See the talk page for details. . The reduced precision allows unrelated key instances to share disk write operations where scopes of locality overlap (Fig. 5). When built on top of a heap-organised table, the localisation precision of the index can be relaxed to avoid excessive maintenance overhead for row inserts, deletes and key updates while providing query performance suited to the underlying table. Moreover, localisation precision can be adjusted to reflect the access profile of the applications using the index. For example, an application that is predominantly insert-intensive with occasional queries (such as a customer transaction fraud monitoring application) can use indexes with a relaxed precision to ensure that the customer transactions are loaded and indexed as quickly as possible. This is in contrast to a B-tree or hashed table, where the fixed precision dictates that the only route to better throughput is through hardware expansion and/or a parallel architecture. [FIGURE 5 OMITTED] A CopperEye index can also store the table rows internally, providing an alternative to the B-tree organised table. This arrangement can provide much faster range query access than a conventional B-tree and heaped table combination, while still affording very fast row insert, delete and key update performance. Naturally, additional CopperEye indexes can be built on top of a CopperEye table to support secondary key access. Implications Any index that proves burdensome to maintain will be used sparingly spar·ing adj. 1. Given to or marked by prudence and restraint in the use of material resources. 2. Deficient or limited in quantity, fullness, or extent. 3. Forbearing; lenient. and limit the queries it can effectively service. Database design often has to compromise query performance against overall performance--so few database schemas The definition of a database. It defines the structure and content in each data element within the structure. Schemas are often designed with visual modeling tools that automatically create the SQL code necessary to define the table structures. See subschema and XML schema. can implement a comprehensive arrangement of indexes to satisfy virtually any query. An important implication of the more efficient index maintenance possible with CopperEye technology is that it affords index proliferation proliferation /pro·lif·er·a·tion/ (pro-lif?er-a´shun) the reproduction or multiplication of similar forms, especially of cells.prolif´erativeprolif´erous pro·lif·er·a·tion n. , which in turn promotes query flexibility. Furthermore, poor maintenance performance often forces undesirable architectural decisions to be made such as query lockout lockout, intentional closing up of a company, factory, or shop by an employer to prevent employees from working during a strike or labor dispute. The term lockout during load activities and out-of-hours batch loading windows instead of continuous incremental Additional or increased growth, bulk, quantity, number, or value; enlarged. Incremental cost is additional or increased cost of an item or service apart from its actual cost. streaming. www.coppereye.com |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion