The data warehouse-achieving operational harmony.
Burgeoning data volumes continue to place organisations' IT departments under greater strain than ever as they look to harness data effectively while overcoming a minefield of performance issues. The importance of the data warehouse in attaining the maximum benefit from data has grown exponentially ex·po·nen·tial
1. Of or relating to an exponent.
a. Containing, involving, or expressed as an exponent.
b. , with its promise to provide an enterprise-wide view of business activities and increase the company's profitability through intelligent data handling. Increased sales, more effective and intelligent marketing, enhanced customer services, and streamlined business processes--the data warehouse is regarded in high esteem by organisations as being capable of paving the way towards the attainment of these business benefits.
Ever since the building of data warehouses began and the term was first coined by industry heavyweight heavyweight - High-overhead; baroque; code-intensive; featureful, but costly. Especially used of communication protocols, language designs, and any sort of implementation in which maximum generality and/or ease of implementation has been pushed at the expense of mundane Bill Inmon Bill Inmon (William Harvey Inmon) (b. July 20, 1945, in San Diego, California) is recognized by many as the "father of data warehousing" and the co-creator of the Corporate Information Factory, and the Government Information Factory, as well as DW 2. , the perceived wisdom has been to maintain a separation between the data warehouse and the company's operational systems. Today, while this 'separatist' thinking still dominates many data warehousing See data warehouse.
data warehousing - data warehouse professionals, the approach is being questioned. A school of thought is emerging that challenges the division of the data warehouse from operational systems, citing the fact that the justification for separation is purely technical--as opposed to being a divide founded on the drivers of the business. The evolving wisdom, particularly acknowledging the continuous march of technological progress, is to consider alternatives to the implementation of a traditional warehouse solution, where operational systems are capable of living in harmony "Living in Harmony" is an episode of the 1967-68 television series The Prisoner. It differs from most other episodes of the series in that it does not begin with the show's standard opening credits sequence. with the data warehouse.
Why accept a compromise?
Businesses want data warehouses to provide a complete and immediate understanding of the enterprise, offering the capability to react quickly to the marketing place and to out-manoeuvre their competitors. Business drivers include, for example, increased revenues through more effective marketing and cross selling to the existing customer base, again based on better understanding of customer activity and profiles. Similarly, by identifying inefficiencies and areas of strength, cost reductions can be achieved, while revenues can be increased. The benefits, in short, more than pay for the implementation of the data warehouse.
There are also the technical considerations that translate to business benefits for the organisation. Today, technology is such that separate systems--with the concomitant concomitant /con·com·i·tant/ (kon-kom´i-tant) accompanying; accessory; joined with another.
concomitant adjective Accompanying, accessory, joined with another investment and ongoing operational costs--could be unnecessary. The same is true of compromise approaches, or "halfway houses halfway house /half·way house/ (haf´wa hous) a residence for patients (e.g., mental patients, drug addicts, alcoholics) who do not require hospitalization but who need an intermediate degree of care until they can return to the community. ", as discussed later, where an intermediate database is used to attempt to furnish fur·nish
tr.v. fur·nished, fur·nish·ing, fur·nish·es
1. To equip with what is needed, especially to provide furniture for.
2. intelligent data rapidly. Not only is this approach very much a poor cousin to the data warehouse, yielding inferior data results, but it also exposes the business to greater costs in terms of setting the solution up, maintaining it and then replacing it in the future.
1. Lack of harmony; discord.
2. Something not in accord; a conflict: "the disharmonies that assail the most fortunate of mortals" Peter Gay. and differing demands
The need to cater for disparate demands is why the data warehouse has traditionally been implemented separately from operational systems: they each have different profiles and make different demands on hardware and applications. Technically, therefore, the IT department has faced a range of conflicts between performance, and the operational and user requirements of the respective systems.
For example, because an operational system is usually built for a transaction processing Updating the appropriate database records as soon as a transaction (order, payment, etc.) is entered into the computer. It may also imply that confirmations are sent at the same time.
Transaction processing systems are the backbone of an organization because they update constantly. workload The term workload can refer to a number of different yet related entities. An amount of labor
While a precise definition of a workload is elusive, a commonly accepted definition is the hypothetical relationship between a group or individual human operator and task demands. , it needs to cater for multiple concurrent short-lived transactions, mixing queries with updates. The data warehouse, in contrast, supports a smaller user base and longer-lived queries. In detail, for instance, while the data warehouse typically benefits from a disk configuration optimised for high transfer rates, the operational system hardware needs to support a higher volume of individual random disk operations. While it is often a good thing that hardware and operating systems Operating systems can be categorized by technology, ownership, licensing, working state, usage, and by many other characteristics. In practice, many of these groupings may overlap. handle mixed workloads, using resources effectively, performance issues arise not with the hardware but from the hosted applications or the database. These are often in the form of contention with resources, or excessive consumption of resources, such as rollback A DBMS feature that reverses the current transaction out of the database, returning the data to its former state. A rollback is performed when processing a transaction fails at some point, and it is necessary to start over. See two-phase commit. images.
The invisible and the visible
Operational systems and data warehouse systems have different optimal schemas Schemas
Fundamental core beliefs or assumptions that are part of the perceptual filter people use to view the world. Cognitive-behavioral therapy seeks to change maladaptive schemas. . The former aims to achieve performance and the maintenance of constant transactional integrity, so the schema is designed that way. Users won't be aware of the schema--it will be 'hidden', whereas the schema in a data warehouse is likely to be more visible to users. Why? Because for a data warehouse--in order to fully exploit its potential--the schema needs to be intuitive to users, offering the ability to undertake flexible queries without resorting to multi-way joins. Complex joins are difficult for users and often incur a big performance hit. In this instance, the usual is a de-normalised star schema A data warehouse design that enhances the performance of multidimensional queries on traditional relational databases. One fact table is surrounded by a series of related tables. Data is joined from one of the points to the center, providing a so-called "star query." See OLAP. built from a central fact table surrounded by dimension tables.
Choosing and tuning the indexes
There are issues with indexes too, as the systems require different ones. Query flexibility and the ability to perform index scans efficiently, as well as good selectivity selectivity /se·lec·tiv·i·ty/ (se-lek-tiv´i-te) in pharmacology, the degree to which a dose of a drug produces the desired effect in relation to adverse effects.
1. from a combination of predicates in a where clause, means that a bit map index is useful in a data warehouse. But for an operational system, poor concurrency Operations that are performed simultaneously within the computer. For example, dual-core CPUs provide complete overlapping of two independent processes. See dual core, hyperthreading, multiprocessing, multitasking, multithreading, SMP and MPP.
concurrency - multitasking of bit maps makes it unsuitable as contention and performance degradation DEGRADATION, punishment, ecclesiastical law. A censure by which a clergy man is deprived of his holy orders, which he had as a priest or deacon. is highly likely to occur, resulting from multiple sessions attempting to update and query the same bit map index. As for hash clusters, they should not be used for full table scans, as often occurs with a data warehouse, because a hashed organised table is spread over more blocks than an equivalent heap organised table. Hash organised tables are, in short, for static data, while B-trees might be used more appropriately for both systems as they provide good concurrency and flexible queries. However, since they can become rapidly disk I/0 bound and can cause excessive database checkpoint (programming) checkpoint - Saving the current state of a program and its data, including intermediate results, to disk or other non-volatile storage, so that if interrupted the program could be restarted at the point at which the last checkpoint occurred. activity, they must be used with care on heavily updated tables of significant size.
Another justification cited for separate systems is scheduling differences. The platform for the two systems is likely to be handled differently in terms of administration and upgrading. The availability requirements of an operational system may be strict, while those for the data warehouse are less so, thus giving different scheduling cycles.
Implementation and potential pitfalls
Implementation of a data warehouse means facing several potential pitfalls. Aside from the technical issues, there are many commercial and people-related issues that typically arise with a major development project. From budget for the project to getting resource and skills, there are hurdles at every corner, including overcoming the politics associated with setting up a new area.
Technically too, there are many potential pitfalls. Consider integration with operational systems--a major issue to be addressed. Operational systems will load the data warehouse, necessitating integration at several levels. This is from the fundamental levels (such as network, hardware and software) to application areas--data representation, data semantics semantics [Gr.,=significant] in general, the study of the relationship between words and meanings. The empirical study of word meanings and sentence meanings in existing languages is a branch of linguistics; the abstract study of meaning in relation to language or and data schema. Integration challenges must be faced up to with the application areas regardless of where the data warehouse is, but issues faced in relation to the network, hardware and software are actually highly likely to result from having chosen to separate the data warehouse from operational systems.
A costly 'half-way house'?
Then there is the issue of the playoff play·off also play-off
1. A final game or series of games played to break a tie.
2. A series of games played to determine a championship.
Noun 1. between the much desired timeliness and completeness. For any data warehouse to be of benefit to the enterprise, it needs to satisfy both wide-ranging queries aimed at understanding long-term trends, and queries of up-to-the-minute data which target the current short-term position of the business. It is difficult, if not impossible, to resolve these conflicting requirements within the constraints CONSTRAINTS - A language for solving constraints using value inference.
["CONSTRAINTS: A Language for Expressing Almost-Hierarchical Descriptions", G.J. Sussman et al, Artif Intell 14(1):1-39 (Aug 1980)]. of the loading window. This conflict has been difficult to resolve, and in order to do so, many organisations have turned to the typical solution of "the halfway house", as referenced earlier. This is the implementation of an intermediate database that provides the business with its volatile and intermediate summary business information but very much on a restricted query basis. Needless to say, to introduce such a solution introduces more complications to the technical solution--not to mention greater implementation and operational costs. If an operational system and a data warehousing system living in harmony on one server can avoid this one issue alone, apart from all the rest, surely it is worthy of consideration?
Achieving the best query performance
To achieve optimal performance for queries, the schema used by a data warehouse will be different to that used in operational systems, and to convert from an operational to a warehouse schema means employing a combination of de-normalisation and over-normalisation.
De-normalisation combines columns and or tables to avoid join operations at query time. This is usually more intuitive for end users and significantly enhances performance. Over-normalisation partitions tables horizontally and/or vertically to improve performance. Partitioning To divide a resource or application into smaller pieces. See partition, application partitioning and PDQ. opens up the possibility of parallel data loading Copying data from one electronic file or database into another. Data loading implies converting from one format into another; for example, from one type of production database into a decision support database from a different vendor. See data entry. and parallel query execution and also provides an effective method for rolling in and rolling out historical data. This latter aspect can be central to achieving timely updates to the data warehouse. Data aggregation is another possibility. Using this approach to optimise optimise - To perform optimisation. query performance involves pre-computing sums and counts of data across various dimensions as the data is loaded, meaning queries not having to do this work at query evaluation time. This can substantially improve query performance as a query may only need to fetch a handful of rows from an aggregation table--and the cost of performing such calculations is moved from query time to load time. If load windows are tight, aggregation can become an issue in its own right.
The march of progress: database technology
Database technology has moved on since the underlying architecture for data warehouses was first established. While performance improvements have been achieved for the separated warehouse, they have been based on unchanged fundamental principles and assumptions, and have largely ignored the march of technological progress in database technology. These database enhancements 'moved the goal posts', to the degree that new technologies and database features make it quite reasonable to unite operational and decision support functions--so that they can at last live in harmony with one another.
We are not suggesting that there is a nirvana nirvana (nērvä`nə), in Buddhism, Jainism, and Hinduism, a state of supreme liberation and bliss, contrasted to samsara or bondage in the repeating cycle of death and rebirth. where a complete methodology for implementing disparate application on a single database exists. That needs time to evolve. However, there are some avenues worth exploring should a full data warehouse not be suitable or worthwhile or where the availability schedule for the operational system is flexible.
Data replication In database management, the ability to keep distributed databases synchronized by routinely copying the entire database or subsets of the database to other servers in the network.
There are various replication methods. plays a key role here. One database serving both operational and data warehouse systems needs to use data replication to separate the two applications, with the operational system using its conventional transaction optimised schema which is then replicated and mapped to a data warehouse optimised schema on the same database. Using continual incremental Additional or increased growth, bulk, quantity, number, or value; enlarged.
Incremental cost is additional or increased cost of an item or service apart from its actual cost. synchronisation Noun 1. synchronisation - the relation that exists when things occur at the same time; "the drug produces an increased synchrony of the brain waves"
synchroneity, synchronicity, synchronism, synchronization, synchronizing, synchrony data is replicated onto the warehousing system an arrangement for lodging imported articles in the customhouse stores, without payment of duties until they are taken out for home consumption. If reëxported, they are not charged with a duty. See
See also: Warehousing . The big question here is, what indexes are now suited to the warehouse schema? Big tables with dynamic data, such as a fact table, rule out B-Tree indexes, as well as hashed clusters. Big tables result in both becoming disk I/0 bound during index updates, which is likely to seriously impact the synchronisation process. For small or medium cardinality A quantity relationship between elements. For example, one-to-one, one-to-many and many-to-one express cardinality. See cardinal number.
(mathematics) cardinality - The number of elements in a set. If two sets have the same number of elements (i.e. columns (assuming that multiple synchronisation processes are not updating the same index) bit map indexes may be acceptable. But the avoidance of contention between the synchronisation process and user queries must still be considered.
Third-party indexes: a step toward harmony?
Today, there are alternatives. Third-party indexes have been developed to integrate with the database and they can provide the generic functionality of B-trees without suffering the same performance impact during index updates. In addition to being fast to update and fast to query, they can be used for fact tables that need to be continuously synchronised Adj. 1. synchronised - operating in unison; "the synchronized flapping of a bird's wings"
synchronal, synchronic, synchronous - occurring or existing at the same time or having the same period or phase; "recovery was synchronous with therapy"- as queries are simultaneously performed. In looking to bring the data warehouse into a harmonious relationship with the operational system, these indexes provide a compelling alternative to conventional B-tree indexes, hash indexes and bit maps. They are key to attaining good performance with incremental updates. New indexing technologies, such as Adaptive Addressing, provide a more efficient and flexible alternative to conventional indexing techniques, reducing the overheads and constraints traditionally associated with indexing. There are also data aggregation techniques, which can further eliminate some of the indexing requirements and provide fast and flexible access to data analysis.
So the approach is worth considering, as by achieving effective incremental synchronisation, data in the warehouse is far more up-to-date than would be possible with a traditional warehouse, which often involves a bulk transfer of data through file systems and staged rebuilding of indexes at infrequent in·fre·quent
1. Not occurring regularly; occasional or rare: an infrequent guest.
Separate schemas mean that data warehouse tables and indexes can reside on different disks from the operational data, which avoids disk 10 contention. Also separate dedicated rollback resources can be allocated and tuned to meet the different demands of the small transactions for operational schema and the large transactions for the warehouse schema. It is also worth considering separate buffer pools An area of memory reserved for buffers. to guarantee that warehouse queries do not flush To empty the contents of a memory buffer. See buffer.
Elizabeth Barrett Browning’s spaniel, subject of a biography. [Br. Lit.: Woolf Flush in Barnhart, 446]
See : Dogs
(data) flush out any blocks cached for the operational system. The block size should suit the operational system, which is fortuitous since the operational system is likely to exist and changing the block size is not an easy option. Setting an appropriate multi-block scan size will be necessary, however, as the warehouse queries will require a larger transfer size for full table scans.
What are likely to stay the same for a co-resident scheme are the decisions made about de-normalising and over-normalising in a separate warehouse schema, which are taken to optimise query performance and allow an efficient roll out of historical data. Optimum implementation of the incremental synchronisation mechanism will demand some additional changes, and attention must be paid to the synchronisation process for transactional consistency, as queries are active while the incremental update is in progress. To make sure that the decision support queries are against a schema optimised for them, any decision support queries should, of course, be confined con·fine
v. con·fined, con·fin·ing, con·fines
1. To keep within bounds; restrict: Please confine your remarks to the issues at hand. See Synonyms at limit. to the warehouse schema and kept away from the operational schema.
One for all?
Ultimately, it must always be remembered that any organisation needs to objectively assess the strengths and weaknesses of all approaches based on what the business actually needs. But what must be borne in mind is that new database technology is playing a more significant role than before, and that there are ever more compelling arguments for achieving a state of harmony when implementing a co-resident schema with incremental updates.
For one thing, warehouse data is made more relevant and useful due to the timeliness of the update. Additionally, there is likely to be a reduction of operating costs operating costs npl → gastos mpl operacionales , as deployment of one database negates the need for another hardware platform, of course, as well as bulk data transfers between database instances and their probable resulting complications.
In essence, organisations need to cut their own way through the dense undergrowth of data today and the data warehouse issue. One system will most likely give more equitable sharing of the total resources than separated hardware platforms Each hardware platform, or CPU family, has a unique machine language. All software presented to the computer for execution must be in the binary coded machine language of that CPU. Following is a list of the major hardware platforms in existence today. See platform. , but whichever route is taken more disks, more memory and more CPUs is probable. An important point to remember is that the "halfway house" route may well represent a risky option and is highly likely to result in significantly greater costs to the business in the long term. It is doubtful whether organisations adopting this approach will achieve the true business benefits of data warehousing--such as the rapid access to the type of high quality, reliable and intelligent data that can be used to drive the business forward.
Another key point to keep high on the agenda is the impact of advancements. Today, both traditional and accepted wisdoMs are constantly under scrutiny--they are being tested fiercely by evolving technologies and thinking. It is precisely for this reason that co-hosting a data warehouse with an operational system must be a serious consideration for any forward-thinking business.