Defying conventional wisdom - an alternative perspective on data warehousing. (Defying Conventional Wisdom).
The reality, however, is quite different, and the justification for creating an entirely separate warehouse database from the operational system is purely technical In theory, the practice is entirely practical and highly desirable, as -- regardless of any technical limitations -- it makes sound business sense, resulting in a less complicated solution with fewer operational overheads and a real-timeview of the enterprise data. So for a business not wishing or able to implement a traditional data warehouse solution, what are the options?
The business case
A better understanding of the enterprise and its business processes are two of the drivers that justify a data warehouse. Increased revenues result from more effective marketing and cross selling to the existing customer base through an in-depth understanding of business potential and customer activity. Identifying and understanding inefficiency in the enterprise also brings reduced costs, through an accurate picture of the business processes involved. A combined increase in revenues with reduced costs yields greater profits and the resulting return on investment is used to underwrite the data warehouse.
The function of the data warehouse in the business is to provide a timely picture, of the enterprise, as well as a complete picture, even though these two aspects generate conflicting technical challenges. The more complete the view of the enterprise, the longer it takes to construct it, while its completion will be less timely. Often these objectives are compromised by the data warehouse architecture, so the solution is less than optimal.
In short, how many of these business drivers demand the separation of the data warehouse solution and the operation system? And are we not being driven in this separation not by the feasibility and practicality we might find in a business case, but by limitations in technology and an unquestioning dedication to a concept based on assumptions prevailing during the 80s and 90s?
The technical case
In looking at the technical reasons for implementing a data warehouse separated from the operational system, there are considerations stemming from a range of conflicts between performance operational and user requirements of the respective systems.
Typically, an operational system is developed and tuned to best cope with a workload based on transaction processing, involving multiple concurrent short-lived transactions and mixing queries with updates. The data warehouse on the other hand will need to support longer-lived queries arising from a smaller user population: quite different profiles: quite different demands being made on hardware resources. A data warehouse typically benefits from a disk configuration optimised for high transfer rates, for example; whereas the operational system hardware needs to support a higher volume of individual random disk operations. Hardware and operating systems are very capable of handling mixed workloads; in fact, this is usually a desirable state of affairs that leads to effective resource utilisation The performance problems that arise with a mixed workload tend to emanate from the hosted applications or the database itself, and it is usually an issue of contention with, or excessive consumption of, resources, such as rollback segments, which can become large, and demand management by DBAS
Users, schemes and joins
There are different optimal schemas for operational systems and data warehouse systems too. An operational system targets performance while maintaining constant transactional integrity, so its schema is designed to achieve exactly that. This schema is likely to be counter-intuitive to users of the system who will be barely aware of the schema involved. Users in the data warehouse environment, however, require greater visibility of the schema and this needs to be intuitive if the potential of the data warehouse is to be fully exploited. The warehouse schema must provide scope for flexible queries without resorting to multi-way joins, where the more tables involved in a join, the more difficult it becomes, and the longer it takes to get the result. Users often find joins difficult to accurately formulate, and complex joins are usually a performance disaster. A de-normalised star schema built from a central fact table surrounded by dimension tables is often the schema of choice here.
Indexes -- the crux
Different indexes are also needed for operational and data warehouse systems. In a data warehouse, a bit map index is useful. Query flexibility is supported, while allowing index scans to be efficiently combined to provide good selectivity from a combination of predicates in a where clause. But bit maps are unsuitable for an operational environment because of their poor concurrency. Multiple sessions attempting to update and query the same bit map index are very likely to suffer contention and performance degrades accordingly. Hash clusters are only appropriate for static data and should be avoided where full table scans are likely, such as a data warehouse, because a hashed organised table is spread over more blocks than an equivalent heap organised table. B-trees, on the other hand, can be appropriate to both applications offering both good concurrency and flexible queries; but they must be used with care on heavily updated tables of significant size, since they can become rapidly disk I0 bound and can cause excessive database checkpoint activity.
What may make the separation of the two systems a desirable objective are differences in scheduling.. Stringent availability requirements may be a characteristic of an operational system whereas it is quite likely that the data warehouse will not be so restrictive. The separation of the systems allows the platforms to be administered and upgraded on different cycles.
The successful implementation of a data warehouse project means overcoming a number of obstacles that await it. How to best implement the operation systems and the data warehouse? This is one major area of technical difficulty. What presents the challenge is that data in the warehouse will be loaded from one or more operational systems. Integration is required at a number of levels, including the fundamental levels of network, hardware, and software, and the application specific areas of data representation, data semantics and data schema. No matter how or where the data warehouse is implemented, the application-specific integration issues need to be addressed; for the network, meanwhile, hardware and software issues generally arise from choosing to separate the data warehouse from the operational system.
For a truly effective system for the enterprise, data warehouse loading must he timely, limiting the size of the loading window and placing constraints on the extent and depth of the data loaded. An intermediate system sometimes addresses the conflict between completeness and timeliness -- a system that sits between the operational system and the data warehouse to provide a volatile and intermediate summary of-business activities. This intermediate system services a restricted set of queries that demand up- to-the minute accuracy. The current short-term position of the business is targeted by this system, while the data warehouse addresses the more wide-ranging queries aimed at understanding long-term trends. Obviously this approach introduces further complications to the technical solution with increased implementation and operational costs. In addition to the technical issues, it is necessary to tackle a variety of traditional commercial and human issues typically associated with a substantial development project. Funding, for example, or resource and skills gathering, as well as all of the usual political battles associated with establishing a new facility.
For the data warehouse, in order to achieve clarity for users and optimal performance for queries, a schema is employed that is different from the operational system. Converting from the operation schema to the warehouse schema involves a mixture of de-normalisation and over-normalisation. De-normalisation combines columns and / or tables to avoid join operations at query time. Significantly enhancing performance, it is also usually more intuitive for end users. Over-normalisation partitions tables horizontally and/or vertically to improve performance. This partitioning opens up the possibility of parallel data loading and parallel query execution. It also provides an effective method for rolling in and rolling out historical data, an aspect that can be central to achieving timely updates to the data warehouse.
Data aggregation is another approach used to achieve better query performance. It involves pre-computing sums and counts of data across various dimensions as the data is loaded. The approach avoids queries having to do this work at query evaluation time, which clearly will very significantly impact query performance. The reason is that a query may only need to fetch a handful of rows from an aggregation table rather than many thousands of rows from an underlying fact table to achieve the same result. So, basically, the cost of performing such calculations is moved from query time to load time. Aggregation can therefore become an issue in its own right if load windows are tight.
Advances and another approach
Many years ago, the underlying architecture for data warehouses was established. Since then, better techniques have become available to increase the performance of separated warehouses. But while these techniques have evolved over time, the, fundamental principles and assumptions have remained much the same. Database technology however, has not remained static and the situation today is not as straightforward as it used to be, a fact that the industry needs to recognise. Today, due to technological advances and new database features, it is quite feasible to use a single system for both operational and decision support functions.
Currently a complete methodology for implementing such disparate applications on a single database does not exist. However, some approaches can still be applied and are worth considering if a full-blown data warehouse is not suitable or worthwhile, and where the availability schedule for the operational system allows. For example: using data replication, the dual-purpose single-database system basically separates the two applications. The operational system uses its conventional transaction optimised schema, and the data from this schema is replicated and mapped to a data warehouse optimised schema on the same database. Using continual incremental synchronisation, data is replicated from the operational system onto the warehouse system. Replication functions, materialised views, triggers and integration features such as message queues are used to achieve this synchronisation between the two applications..
Indexing -- the key
Incremental updating is required of the indexes on the warehouse schema in order to achieve incremental synchronisation. This does therefore limit the choice of indexes feasible for the warehouse schema. For large tables with dynamic data, such as a fact table, B-Tree indexes and hashed clusters are unsuitable. Both, become very disk I0 bound during index updates on sizable tables and are likely to exert an unsustainable impact on the synchronisation process.
For small to medium cardinality columns -- assuming that multiple synchronisation processes are not updating the same index -- bit map indexes may be acceptable, but some care still has to be exercised to avoid contention between the synchronisation process and user queries.
Today, however, alternative third-party indexes exist, such as CopperEye's. These are capable of integration with Oracle and in theory any other database, and can offer the generic functionality of B-trees without suffering the same performance impact during index updates. Fast to update and fast to query, such indexes can readily be used for index updates. Fast to update and fast to query, such indexes can readily be used for fact tables that are under constant maintenance from the synchronisation process while queries are performed simultaneously. These indexes provide a powerful alternative to conventional B-tree indexes, hash clusters and bit maps. They are a vital aspect to achieving good performance with incremental updates.
There is a key benefit to the business of these indexes, in terms of updated and accurate data. In short, the payoff of incremental synchronisation is that the data in the warehouse is more contemporary than is possible with a traditional warehouse deployment. These traditional systems require more often than not, a bulk transfer of data through file systems as well as staged rebuilding of indexes at infrequent intervals.
The data warehouse tables and indexes can be placed on separate disks from the operational data, through the use of separate schemas. This is important to avoid likely disk IO contention between the two systems. Not only that, the allocation and tuning of separate dedicated rollback segments enables the system to meet the different demands of the small transactions prevalent against the operational schema and the larger transactions that dominate the activity across the warehouse schema. It is also worth considering separate buffer pools to guarantee that warehouse queries do not flush out any blocks cached for the operational system. It is important to recognise that the block size chosen for the database used should be appropriate for the operational system. The operational system probably already exists and changing the block size is not an easy option. However, the warehouse queries will require a larger transfer size for full table scans and this is readily achieved by setting an appropriate multi-block scan size. Additional scope for block size tuning is also provided by Oracle 9i, which allows mixed block sizes in the same database instance. All of the decisions made about de-normalising and over-normalising in a separate warehouse schema which are taken to optimise query performance and allow an efficient roll out of historical data, are likely to remain valid for a co-resident scheme. Some additional changes will be required to effectively implement the incremental synchronisation mechanism. In particular, the synchronisation process must pay particular regard for transactional consistency, as queries are active while the incremental update is in progress. This is not usually a concern in a traditional warehouse system which is bulk loaded without the complication of concurrent queries. Therefore some of the partitioning choices made in the belt-and-braces approaches to parallel loading may need to be revised or extended to account for the mixed table updates involved in an incremental transaction.
To ensure that the decision support queries are against a schema optimised for them and to avoid performance-killing join operations or any unnecessary contention with the operational application. Any decision support queries should be confined to the warehouse schema and kept away from the operational schema. The operational schema should be secured away from warehouse users through appropriate roles and privilege barriers.
A compelling case to consider
In implementing a co-resident schema with incremental updates -- rather than deploying an entirely separate bulk loaded system -- there are two major benefits become immediately apparent. First: the timeliness of the updates makes the warehouse data more relevant and useful. Second: using a single database without the complications of bulk data transfers between database instances or even an additional hardware platform is likely to save on operating costs.
Even recognising that more disks, more memory and maybe more CPUs will be required, a single system is likely to give more equitable sharing of the total resources than separated hardware platforms. It is a different matter whether you are able to upgrade the operation platform sufficiently. This is a driving factor in the decision, along with the number and location and availability schedule of the operational systems involved. Nevertheless, controversy and conventional wisdom aside, co-hosting a data warehouse with an operational system is a compelling proposition worthy of consideration and maybe pursuit, and should not be immediately disregarded merely on performance grounds.
Duncan Pauly CopperEye Ltd
|Printer friendly Cite/link Email Feedback|
|Publication:||Database and Network Journal|
|Date:||Apr 1, 2002|
|Previous Article:||Free concept paper on understanding Web services. (Data Management - Customer Service).|
|Next Article:||Open portal strategy for IBM Websphere. (Defying Conventional Wisdom).|