The data warehouse-achieving operational harmony.
Ever since the building of data warehouses began and the term was first coined by industry heavyweight Bill Inmon, the perceived wisdom has been to maintain a separation between the data warehouse and the company's operational systems. Today, while this 'separatist' thinking still dominates many data warehousing professionals, the approach is being questioned. A school of thought is emerging that challenges the division of the data warehouse from operational systems, citing the fact that the justification for separation is purely technical--as opposed to being a divide founded on the drivers of the business. The evolving wisdom, particularly acknowledging the continuous march of technological progress, is to consider alternatives to the implementation of a traditional warehouse solution, where operational systems are capable of living in harmony with the data warehouse.
Why accept a compromise?
Businesses want data warehouses to provide a complete and immediate understanding of the enterprise, offering the capability to react quickly to the marketing place and to out-manoeuvre their competitors. Business drivers include, for example, increased revenues through more effective marketing and cross selling to the existing customer base, again based on better understanding of customer activity and profiles. Similarly, by identifying inefficiencies and areas of strength, cost reductions can be achieved, while revenues can be increased. The benefits, in short, more than pay for the implementation of the data warehouse.
There are also the technical considerations that translate to business benefits for the organisation. Today, technology is such that separate systems--with the concomitant investment and ongoing operational costs--could be unnecessary. The same is true of compromise approaches, or "halfway houses", as discussed later, where an intermediate database is used to attempt to furnish intelligent data rapidly. Not only is this approach very much a poor cousin to the data warehouse, yielding inferior data results, but it also exposes the business to greater costs in terms of setting the solution up, maintaining it and then replacing it in the future.
Disharmony and differing demands
The need to cater for disparate demands is why the data warehouse has traditionally been implemented separately from operational systems: they each have different profiles and make different demands on hardware and applications. Technically, therefore, the IT department has faced a range of conflicts between performance, and the operational and user requirements of the respective systems.
For example, because an operational system is usually built for a transaction processing workload, it needs to cater for multiple concurrent short-lived transactions, mixing queries with updates. The data warehouse, in contrast, supports a smaller user base and longer-lived queries. In detail, for instance, while the data warehouse typically benefits from a disk configuration optimised for high transfer rates, the operational system hardware needs to support a higher volume of individual random disk operations. While it is often a good thing that hardware and operating systems handle mixed workloads, using resources effectively, performance issues arise not with the hardware but from the hosted applications or the database. These are often in the form of contention with resources, or excessive consumption of resources, such as rollback images.
The invisible and the visible
Operational systems and data warehouse systems have different optimal schemas. The former aims to achieve performance and the maintenance of constant transactional integrity, so the schema is designed that way. Users won't be aware of the schema--it will be 'hidden', whereas the schema in a data warehouse is likely to be more visible to users. Why? Because for a data warehouse--in order to fully exploit its potential--the schema needs to be intuitive to users, offering the ability to undertake flexible queries without resorting to multi-way joins. Complex joins are difficult for users and often incur a big performance hit. In this instance, the usual is a de-normalised star schema built from a central fact table surrounded by dimension tables.
Choosing and tuning the indexes
There are issues with indexes too, as the systems require different ones. Query flexibility and the ability to perform index scans efficiently, as well as good selectivity from a combination of predicates in a where clause, means that a bit map index is useful in a data warehouse. But for an operational system, poor concurrency of bit maps makes it unsuitable as contention and performance degradation is highly likely to occur, resulting from multiple sessions attempting to update and query the same bit map index. As for hash clusters, they should not be used for full table scans, as often occurs with a data warehouse, because a hashed organised table is spread over more blocks than an equivalent heap organised table. Hash organised tables are, in short, for static data, while B-trees might be used more appropriately for both systems as they provide good concurrency and flexible queries. However, since they can become rapidly disk I/0 bound and can cause excessive database checkpoint activity, they must be used with care on heavily updated tables of significant size.
Another justification cited for separate systems is scheduling differences. The platform for the two systems is likely to be handled differently in terms of administration and upgrading. The availability requirements of an operational system may be strict, while those for the data warehouse are less so, thus giving different scheduling cycles.
Implementation and potential pitfalls
Implementation of a data warehouse means facing several potential pitfalls. Aside from the technical issues, there are many commercial and people-related issues that typically arise with a major development project. From budget for the project to getting resource and skills, there are hurdles at every corner, including overcoming the politics associated with setting up a new area.
Technically too, there are many potential pitfalls. Consider integration with operational systems--a major issue to be addressed. Operational systems will load the data warehouse, necessitating integration at several levels. This is from the fundamental levels (such as network, hardware and software) to application areas--data representation, data semantics and data schema. Integration challenges must be faced up to with the application areas regardless of where the data warehouse is, but issues faced in relation to the network, hardware and software are actually highly likely to result from having chosen to separate the data warehouse from operational systems.
A costly 'half-way house'?
Then there is the issue of the playoff between the much desired timeliness and completeness. For any data warehouse to be of benefit to the enterprise, it needs to satisfy both wide-ranging queries aimed at understanding long-term trends, and queries of up-to-the-minute data which target the current short-term position of the business. It is difficult, if not impossible, to resolve these conflicting requirements within the constraints of the loading window. This conflict has been difficult to resolve, and in order to do so, many organisations have turned to the typical solution of "the halfway house", as referenced earlier. This is the implementation of an intermediate database that provides the business with its volatile and intermediate summary business information but very much on a restricted query basis. Needless to say, to introduce such a solution introduces more complications to the technical solution--not to mention greater implementation and operational costs. If an operational system and a data warehousing system living in harmony on one server can avoid this one issue alone, apart from all the rest, surely it is worthy of consideration?
Achieving the best query performance
To achieve optimal performance for queries, the schema used by a data warehouse will be different to that used in operational systems, and to convert from an operational to a warehouse schema means employing a combination of de-normalisation and over-normalisation.
De-normalisation combines columns and or tables to avoid join operations at query time. This is usually more intuitive for end users and significantly enhances performance. Over-normalisation partitions tables horizontally and/or vertically to improve performance. Partitioning opens up the possibility of parallel data loading and parallel query execution and also provides an effective method for rolling in and rolling out historical data. This latter aspect can be central to achieving timely updates to the data warehouse. Data aggregation is another possibility. Using this approach to optimise query performance involves pre-computing sums and counts of data across various dimensions as the data is loaded, meaning queries not having to do this work at query evaluation time. This can substantially improve query performance as a query may only need to fetch a handful of rows from an aggregation table--and the cost of performing such calculations is moved from query time to load time. If load windows are tight, aggregation can become an issue in its own right.
The march of progress: database technology
Database technology has moved on since the underlying architecture for data warehouses was first established. While performance improvements have been achieved for the separated warehouse, they have been based on unchanged fundamental principles and assumptions, and have largely ignored the march of technological progress in database technology. These database enhancements 'moved the goal posts', to the degree that new technologies and database features make it quite reasonable to unite operational and decision support functions--so that they can at last live in harmony with one another.
We are not suggesting that there is a nirvana where a complete methodology for implementing disparate application on a single database exists. That needs time to evolve. However, there are some avenues worth exploring should a full data warehouse not be suitable or worthwhile or where the availability schedule for the operational system is flexible.
Data replication plays a key role here. One database serving both operational and data warehouse systems needs to use data replication to separate the two applications, with the operational system using its conventional transaction optimised schema which is then replicated and mapped to a data warehouse optimised schema on the same database. Using continual incremental synchronisation data is replicated onto the warehousing system. The big question here is, what indexes are now suited to the warehouse schema? Big tables with dynamic data, such as a fact table, rule out B-Tree indexes, as well as hashed clusters. Big tables result in both becoming disk I/0 bound during index updates, which is likely to seriously impact the synchronisation process. For small or medium cardinality columns (assuming that multiple synchronisation processes are not updating the same index) bit map indexes may be acceptable. But the avoidance of contention between the synchronisation process and user queries must still be considered.
Third-party indexes: a step toward harmony?
Today, there are alternatives. Third-party indexes have been developed to integrate with the database and they can provide the generic functionality of B-trees without suffering the same performance impact during index updates. In addition to being fast to update and fast to query, they can be used for fact tables that need to be continuously synchronised as queries are simultaneously performed. In looking to bring the data warehouse into a harmonious relationship with the operational system, these indexes provide a compelling alternative to conventional B-tree indexes, hash indexes and bit maps. They are key to attaining good performance with incremental updates. New indexing technologies, such as Adaptive Addressing, provide a more efficient and flexible alternative to conventional indexing techniques, reducing the overheads and constraints traditionally associated with indexing. There are also data aggregation techniques, which can further eliminate some of the indexing requirements and provide fast and flexible access to data analysis.
So the approach is worth considering, as by achieving effective incremental synchronisation, data in the warehouse is far more up-to-date than would be possible with a traditional warehouse, which often involves a bulk transfer of data through file systems and staged rebuilding of indexes at infrequent intervals.
Separate schemas mean that data warehouse tables and indexes can reside on different disks from the operational data, which avoids disk 10 contention. Also separate dedicated rollback resources can be allocated and tuned to meet the different demands of the small transactions for operational schema and the large transactions for the warehouse schema. It is also worth considering separate buffer pools to guarantee that warehouse queries do not flush out any blocks cached for the operational system. The block size should suit the operational system, which is fortuitous since the operational system is likely to exist and changing the block size is not an easy option. Setting an appropriate multi-block scan size will be necessary, however, as the warehouse queries will require a larger transfer size for full table scans.
What are likely to stay the same for a co-resident scheme are the decisions made about de-normalising and over-normalising in a separate warehouse schema, which are taken to optimise query performance and allow an efficient roll out of historical data. Optimum implementation of the incremental synchronisation mechanism will demand some additional changes, and attention must be paid to the synchronisation process for transactional consistency, as queries are active while the incremental update is in progress. To make sure that the decision support queries are against a schema optimised for them, any decision support queries should, of course, be confined to the warehouse schema and kept away from the operational schema.
One for all?
Ultimately, it must always be remembered that any organisation needs to objectively assess the strengths and weaknesses of all approaches based on what the business actually needs. But what must be borne in mind is that new database technology is playing a more significant role than before, and that there are ever more compelling arguments for achieving a state of harmony when implementing a co-resident schema with incremental updates.
For one thing, warehouse data is made more relevant and useful due to the timeliness of the update. Additionally, there is likely to be a reduction of operating costs, as deployment of one database negates the need for another hardware platform, of course, as well as bulk data transfers between database instances and their probable resulting complications.
In essence, organisations need to cut their own way through the dense undergrowth of data today and the data warehouse issue. One system will most likely give more equitable sharing of the total resources than separated hardware platforms, but whichever route is taken more disks, more memory and more CPUs is probable. An important point to remember is that the "halfway house" route may well represent a risky option and is highly likely to result in significantly greater costs to the business in the long term. It is doubtful whether organisations adopting this approach will achieve the true business benefits of data warehousing--such as the rapid access to the type of high quality, reliable and intelligent data that can be used to drive the business forward.
Another key point to keep high on the agenda is the impact of advancements. Today, both traditional and accepted wisdoMs are constantly under scrutiny--they are being tested fiercely by evolving technologies and thinking. It is precisely for this reason that co-hosting a data warehouse with an operational system must be a serious consideration for any forward-thinking business.
|Printer friendly Cite/link Email Feedback|
|Publication:||Database and Network Journal|
|Date:||Feb 1, 2004|
|Previous Article:||Big Book of Internet Host Standards.|
|Next Article:||Web services--the new hope for EAI.|