Database Overload Overwhelms Database Administrators.
However, as databases, data marts, and data warehouses grow in size and complexity, it can take longer and longer to get information out of them. The problem arises from a fundamental aspect of relational database technology: their schema of rows and columns in tables is designed for efficient storage of data, not necessarily efficient access. While it is possible to extract data in almost any conceivable combination, given a properly designed database, some queries take only seconds, while others can take hours or even days, and potentially bring the system to its knees. Database Administrators (DBAs) have devised many tactics to overcome the limitations of databases, but they can't cover all the bases.
Tables And Columns And Rows, Oh My!
Consider a typical sales database. It will consist of a number of tables. For instance, the customer table will consist of many rows of data, where each row is a unique customer. The first field in each row might be the customer ID number, followed by fields identifying first name, last name, address (that's actually several fields), phone number, and so forth. Another table might identify products with a unique product ID and fields identifying price, brand, inventory level, etc.
Now, imagine a simple query: how many customers are there in California? The simplest way to do this is called a table scan: merely read every row sequentially and count the number of times you find a STATE field containing CA. In a large database, of course, this will take a very long time. In this case, a simple index will greatly speed data access: in this case, indexing the customer table by the state field. Of course, this means that a Database Administrator (DBA) has to create the index and keep it up to date.
Now, consider a more complex query: how many customers in California bought something in December? Now, the database must perform what is called a table join: it must look at both the Orders table, qualifying the ORDER DATE field in terms of a date range, and the Customer table, looking at the STATE field. Table joins are also very slow. In this case, a DBA might tune the database by creating a summary table containing customer orders by month and, again, the DBA would have to keep it up to date.
An even more complex query would be to ask for the records of customers in California that bought a certain product in December. This is a multidimensional query that requires building a multi-part key for another index or one might call for an aggregated result, for instance: what was the average sale per customer in California in December, which requires all of the above operations plus the mathematical averaging of the $AMOUNT fields from all the qualifying Orders. Again, a DBA can create and update a summary table to handle this kind of query.
Data marts and data warehouses face similar problems, made even more urgent by the wider range of queries users are likely to want to make against them. One approach is to create multiple, specialized "single-topic" data marts out of the main data warehouse; but this merely multiplies the problem. Multidimensional databases optimized for complex queries are another approach, but they are limited in the size of the data sets with which they can deal.
Ironically, the booming market in OLAP and other analytical tools, which are deceptively easy to use, has ended up increasing the magnitude of this database overload problem, for they make it easy for a user to submit a query that ends up selecting most of an entire table or calls for joining two immense tables. Imagine the network and processing impact of trying to deliver ten million rows of data to a desktop PC or the impact on database performance for all other users when it tries to join two multi-million row tables. For this reason, many DBAs try to limit the queries users are allowed to submit, but this is likely to eliminate many unforeseen queries that could actually deliver useful business information.
Victor Hugo Was Right
So the problem is one of knowledge and time. Although DBAs can overcome any given set of query problems by tuning, creating indexes, summary tables, and multiple data marts, or forbidding certain kinds of queries, they must know in advance what queries users want to make and would be useful, which requires domain-specific knowledge they often don't have. While 80% of database queries are repetitive and can be optimized in the ways discussed above, 80% of the ROI from database information comes from the 20% of queries that are not repetitive. The result is a loss of business or competitive advantage because of an inability to access the data in corporate databases in a timely fashion.
Throwing more hardware at the problem doesn't really help, either. Big iron and multi-processing servers can certainly accelerate the CPU-intensive parts of the process, but the bottom line of database access is disk access, so the process is I/O bound and I/O doesn't scale as fast as CPU power. You can get around this by putting the entire database in main memory, but the cost of RAM for a multigigabyte database is likely to be a showstopper.
Victor Hugo said that there is nothing so powerful as an idea whose time has come. I was reminded of that saying during a recent briefing by Dynamic Information Systems Corporation (DISC), whose 15-year old Omnidex search engine technology overcomes many of the limitations that plague large databases. (Actually, although the basic idea behind Omnidex is that old, the product has been continually improved and extended as database technology evolved.)
Omnidex is basically a middleware search engine that sits between desktop clients (such as OLAP and other analytical applications) and a back-end database or data mart/data warehouse. It indexes the database using a combination of bit map and inverse list indexes and creates both multidimensional and aggregate indexes. Then, when the user submits a SQL query from the client, Omnidex intercepts it and searches against its indexes, which are far smaller than the main database and are organized in a far more efficient fashion than is possible using ordinary relational databases or data warehouses. The result is, the company claims, about a 50-times speed-up in query processing.
The inverse list indexes used by Omnidex have many advantages that obviate the problems discussed above. Inverse lists reverse the usual structure of data by storing the data from the database as keys to permit rapid searching of the data content with pointers back to the database as data in the index for quick retrieval of the data records. For instance, one inverted list index generated from a customer table might include all the states (AK, AL, AR, etc.) as keys (columns), with pointers back to the customer records containing each state as a field value.
Inverse lists can do many things that the B-tree indexes commonly used by a relational database cannot. They permit keyword searches that are case and position insensitive (an RDBMS can only do this via a table scan; its indexes cannot). Inverse list indexes can also return an up-front count of records that qualify for the current search query, permitting the user to back out of a query that would return ten million rows of data. (Even though Omnidex is only searching an index and the result of this search would not be catastrophic, it obviously wouldn't be very useful, either.) Inverse lists also support "anything by anything" queries against multiple columns (fields) without limitation as to size and, unlike database summary tables and aggregates, since these inverse list indexes store pointers to the original data, you can drill down from the query results to the original data to find out what's behind the results of your query.
For data warehousing applications, the inverse list possesses similar advantages. Unlimited multidimensional analysis is made possible by the ability to index any number of columns from different tables and combine them in the selection criteria. This greatly enhances the usefulness of OLAP tools and other analytical applications and reduces the need to build multiple data marts. In addition, inverse list indexes can be easily updated, allowing their use with real time data.
Omnidex also uses bit-mapped indexes for very high-speed instant counts, keyword searches and similar types of queries. A bit-mapped index looks like a spreadsheet with the possible values as column headings and record numbers as row headings. If a possible value exists within a given record, the intersection cell contains a 1; otherwise, it contains a 0. Such a bitmap can be searched with blinding speed, as long as it represents "low cardinality" data; that is, data with only a few possible values. This kind of index wouldn't work very well for the state example above: a ten million record customer database would result in a 500MB bitmap! In addition, bit-mapped indexes are not easily updated.
Omnidex is not an inexpensive solution--it starts at about $70,000--so it only makes sense for very large databases. However, with the growth of e-commerce, especially for business-to-business transactions, the number of large databases out there is likely to grow very rapidly and there appears to be few solutions to match it. So, while Omnidex is not likely to appeal to any but a fraction of CTR's readers, for those who need it, it is very much an idea whose time has come.
|Printer friendly Cite/link Email Feedback|
|Publication:||Computer Technology Review|
|Date:||Jun 1, 2000|
|Previous Article:||First Data And Entrust Ride The Next Wave In B2B E-commerce.|
|Next Article:||Easier, Cheaper DSL: Replacing Central Office Hardware With Access Concentrator Technology.|