Printer Friendly

An overview of data warehousing and OLAP technology.

INTRODUCTION

The concept of data warehousing dates back was introduced in late 1980s, the concept of Barry Devlin and Paul "business data warehouse" for the Data warehouse was debuted IBM researchers Barry Devlin and Paul Murphy developed the "warehouse" .for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it, in the absence of data warehouse architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support Environments to operate independently, though each environment served different users; they often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, such as long-term existing operational systems, was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users.

Need for Data Warehousing:

The data needed to provide reports, analytic applications and ad hoc queries all exists within the set of production applications that supports the organization. There is no reason to add the name of organization to the long list of failures and many reasons that the direct connection never works. Incase of the new releases of application software, frequently introduce changes that make it necessary to rewrite and test reports. These changes make it difficult to create and maintain reports that summarize data originating within more than one release, field names are often hard to decipher and meaningless strings of characters, application data is often stored in odd formats such as Century Julian dates and numbers without decimal points. Tables are structured to optimize data entry and validation performance, making them hard to use for retrieval and analysis. There is no good way to incorporate worthwhile data from other sources into the database of a particular application. Developing and storing metadata is the process without a data warehouse, there is no obvious place to put it. Many data fields that users are accustomed to seeing on display screens are not present within the database, such as rolled-up, general ledger balances and priority to be given to transaction processing. Reporting and analysis functions tend to perform poorly when run on the hardware that handles transactions.

Differences between Operational Database Systems and Data Warehouses:

IT systems can be divided into transactional (OLTP) and analytical(OLAP),in general, it is assumed that OLTP systems provide source data to data warehouses (Thomsen, E., 1997), whereas OLAP systems help to analyze it and difference between them is given in table 1 and in Fig 1.

Relational OLAP (ROLAP):

Use relational or extended-relational DBMS to store and maintain warehouse data and OLAP middle ware to support missing pieces, include optimization of DBMS backend, implementation of aggregation navigation logic, and additional parameter and services greater scalability. Multidimensional OLAP (MOLAP), array-based multidimensional storage engine fast indexing to pre-computed summarized data, hybrid OLAP (HOLAP) is combination of ROLAP and MOLAP technology--User flexibility, such as low level; relational and high-level; array Specialized SQL (Gupta and I.S. Mumick, 1999) servers specialized support for SQL queries over star and snowflake schemas.

Indexing OLAP Data: Bitmap Index:

Index on a particular column each value in the column has a bit vector, bit-op is fast and the length of the bit vector number of records in the base table was the ith bit is set if the ith row of the base table has the value for the indexed column not suitable for high cardinality domains (many different values) given in Fig 2.

Indexing OLAP data and joining indices was as follows Join index: JI(R-id, S-id) where R (R-id, ...) S (Sid, ...), traditional indices map and the values to a list of record ids, it materializes relational join in JI file and speeds up relational joining rather costly operation in data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table illustrated from fact table Sales and two dimensions city and product. A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the sales in the city (ditto for join index on item) and join indices can span multiple dimensions. Efficient Processing OLAP queries determine which operations should be performed on the available cuboids, trans form drill, roll, etc. into corresponding SQL and/or OLAP operations, and it can be illustrated from the figure dice = selection + projection determines to which materialized cuboids (s) the relevant operations should be applied. Exploring indexing structures and compressed against dense array structures in MOLAP.

On-Line Analytical Processing to On Line Analytical Mining (OLAM):

The advantage of online analytical mining were high quality of data in data warehouses, contains integrated, consistent, cleaned data available information processing structure surrounding data warehouses for ODBC, OLEDB, web accessing, service facilities, reporting and OLAP tools such as OLAP-based exploratory data analysis mining with drilling, dicing, pivoting. On-line selection of data mining (Sristava, D., et al., 1996) functions integration and swapping of multiple mining functions and tasks were presented in Fig 3.

Data Warehouse Architecture:

The architecture in Fig4 is quite common, might and it customize the warehouse's architecture for different groups within the organization, do this by adding data mart which are systems designed for a particular line of business and Fig 4 illustrated on example were purchasing sales, and inventories are separated in this example a financial analysis's of historical data for purchase and sales.

Data Warehouse Development and Recommended Approach:

Enterprise warehouse was collection of all the information about subjects spanning the entire organization (Fig 5), data mart a subset of corporate-wide data that is the value to specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart, virtual warehouse is a set of views over operational databases and may be materialized in future.

Conceptual Modeling of Data Warehouses:

Star schema in which a fact table in the middle connected to a set of dimension tables, It contains a large central table that is(fact table given Fig 6), a set of smaller attendant tables (dimension table), and one for each dimension

Snowflake schema is refinement of star schema where some dimensional hierarchy is further splitting (normalized) into a set of smaller dimension tables (Fig 7), forming a shape similar to snowflake. However, the snowflake structure can reduce the effectiveness of browsing, since more joins will be needed

Multiple share dimension tables where viewed as a collection of stars, called galaxy schema or fact tables' fact constellation (Fig 8)

Concept hierarchies way defined as defined by grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy.

Data Cube:

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions named as sales data warehouse.

Design of Data Warehouse:

Four views regarding the design of a data warehouse and they are top-down view allows selection of the relevant information necessary for the data warehouse, data source view exposes the information being captured, stored, and managed by operational systems, data warehouse view take care of consists of fact tables and dimension tables, business query view sees the perspectives of data in the warehouse from the view of end-user.

Three Data Warehouse Models:

Enterprise warehouse collects all of the information about subjects spanning the entire organization, data marta subset of corporate-wide data that is of value to specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart warehouse, and a set of views over operational databases. Only some of the possible summary views may be materialized.

Metadata Repository:

Meta data is the data defining warehouse objects. It has the following kinds of description of the structure of the warehouse schema, view, dimensions, hierarchies, derived data defines, data mart locations and contents of operational meta-data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails). The algorithms used for summarization the mapping from operational environment to the data warehouse, data related to system performance of warehouse schema, view and derived data definitions of business data and business terms and definitions, ownership of data, charging policies.

3D Data cube Example:

Efficient Data Cube Computation:

Data cube can be viewed as a lattice of cuboids, the bottom-most cuboids is the base cuboids, the top-most cuboids (apex) contains only one cell. Materialization (pre-computation) of data cube, materialize every (cuboids) (full materialization), none (no materialization), or some (partial materialization), selection of such cuboids to materialize is based on size, sharing, access frequency, etc. cube definition and computation in DMQL defines cube sales[item, city, year]: sum(sales_in_dollars) and compute cube sales transform, it into a SQL-like language (with a new operator cube by, introduced by Gray et a/.1996) example SELECT item, city, year, SUM (amount), FROM SALES CUBE BY item, city, year Need compute the following Group-Bys(year, item, city), (year, item), (year, city), (item, city), (year), (item), (city) cube computation: ROLAP-Based Method of efficient cube computation methods were ROLAP-based cubing algorithms (Agarwal et al 1996), Array-based cubing algorithm (Zhao et al 1997), Bottom-up computation method (Bayer and Ramarkrishnan 1999).

ROLAP-based cubing algorithms were sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples, grouping is performed on some sub aggregates as a "partial grouping step". Aggregates may be computed from previously computed aggregates, rather than from the base fact table. Hash/sort based methods (Agarwal et al. VLDB 1996) (Berson and S.J. Smith, 1993) in this method smallest-parent computing a cuboids' from the smallest previously computed cuboids', Cache results as caching results of a cuboids from which other cuboids are computed to reduce disk I/Os, a mortisescans is computing as many as possible cuboids at the same time to amortize disk reads, share-sorts was sharing sorting costs cross multiple cuboids when sort-based method is used and share-partitions sharing the partitioning cost across multiple cuboids when hash-based algorithms are used.

Multi-way Array Aggregation for Cube Computation:

Partition arrays into chunks (a small sub cube which fits in memory), compressed sparse array addressing: (chunk_id, offset), compute aggregates in "multi way" by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. The planes should be sorted and computed according to their size in ascending order. keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane and Limitation of this method were computing well only for a small number of dimensions, if there are a large number of dimensions, "bottom-up computation" and iceberg cube computation methods (Donjerkovic, D. and R. Ramakrishnan, 1999) can be explored. Another way of Multi-Way Array Aggregation for Cube Computation is the planes should be sorted and computed according to their size in ascending order and keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane and limitation associated with this method is computing well only for a small number of dimensions, in case of a large number of dimensions, "bottom-up computation" and iceberg cube computation methods can be explored.

Discovery-Driven Exploration of Data Cubes:

Hypothesis-driven: Exploration by user, huge search space, discovery-driven (Sarawagi et al.'98), precompute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation and exception; significantly different from the value anticipated, based on a statistical model and visual cues such as background color are used to reflect the degree of exception of each cell. Computation of exception indicator can be overlapped with cube construction. Selfexp confirms the degree of surprise of the cell value, relative to other cells at the same level of aggregation. Inexp the degree of surprise somewhere beneath the cell, if one were to drill down from it and path the degree of surprise for each drill-down path from the cell.

Data Warehouse Back-End Tools and Utilities:

Data extraction gets data from multiple, heterogeneous, and external sources such as data cleaning (detect errors in the data and rectify them when possible), Data transformation (convert data from legacy or host format to warehouse format) load, sort, summarize, consolidate, compute views, check integrity, and build indices and partitions and refresh propagate the updates from the data sources to the warehouse.

Data Warehouse Usage:

Three kinds of data warehouse applications, information processing, supports querying, basic statistical analysis, and reporting using cross tabs, tables, charts and graphs and analytical processing, multidimensional analysis of data warehouse data, supports basic OLAP operations, slice-dice, drilling, pivoting data mining, knowledge discovery from hidden patterns and supports associations constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools of differences among the three tasks.

Conclusion:

Data warehouse a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision-making process. A multi-dimensional model of a data warehouse star schema, snowflake schema, fact constellations and data cube consists of dimensions and measures, OLAP operations includes drilling, rolling, slicing, dicing and pivoting OLAP servers such as ROLAP, MOLAP, HOLAP. Efficient computation of data cubes Partial against full and no materialization, multi way array aggregation, bitmap index and join index implementations.

ARTICLE INFO

Article history:

Received 12 October 2014

Received in revised form 26 December 2014

Accepted 1 January 2015

Available online 17 February 2015

REFERENCES

Berson and S.J. Smith, 1993. Data Warehousing, Data Mining, and OLAP. New York: McGraw-Hill, 1997.E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World.

Imho[R], C., N. Galemmo and J.G. Geiger, 2003. Mastering Data Warehouse Design: Relational and Dimensional Techniques. John Wiley & Sons, New York.

Donjerkovic, D. and R. Ramakrishnan, 1999. Probabilistic optimization of top N queries, In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), pages 411{422, Edinburgh, UK, Sept.

Sristava, D., S. Dar, H.V. Jagadish and A.V. Levy, 1996. Answering queries with aggregation using views. In Proc. 1996 Int. Conf. Very Large Data Bases (VLDB'96), pages 318{329, Bombay, India, Sept.

Thomsen, E., 1997. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons. Gupta and I.S. Mumick, 1999. Materialized Views: Techniques, Implementations, and Applications, MIT Press.

Gray, J., S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow and H. Pira-hesh, 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals, Data Mining and Knowledge Discovery, 1: 29,54.

Hellerstein, J., P. Haas and H. Wang, 1997. Online aggregation, In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'97), pages 171{182, Tucson, Arizona.

Hellerstein, J.M., R. Avnur, A. Chou, C. Hidber, C. Olston, V. Raman, T. Roth and P.J. Haas, 1999. Interactive data analysis: The control project. IEEE Computer, 32: 51,59.

Widom, J., 1995. Research problems in data warehousing, In Proc. 4th Int. Conf. Information and Knowledge Management, pages 25, 30, Baltimore, Maryland.

Beyer, K. and R. Ramakrishnan, 1999. Bottom-up computation of sparse and iceberg cubes. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), pages 359Philadelphia, PA.

Ross, K. and D. Srivastava, 1997. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf. Very Large Data Bases (VLDB'97), pages 116{125, Athens, Greece.

Carey, M. and D. Kossman, 1998. Reducing the braking distance of an SQL query engine. In Proc. 1998 Int. Conf. Very Large Data Bases (VLDB'98), pages 158{169, New York, NY.

Fang, M., N. Shivakumar, H. Garcia-Molina, R. Motwani and J. D. Ullman, 1998. Computing iceberg, In Proc. 1998 Int. Conf. Very Large Data Bases (VLDB'98), pages 299{310, New York, NY

O'Neil, P. and D. Quass, 1997. Improved query performance with variant indexes. In Proc. 1997 ACMSIGMOD Int. Conf. Management of Data (SIGMOD'97), pages 38{49, Tucson, Arizona.

O'Neil, P and G. Graefe, 1995. Multi-table joins through bitmapped join indices. SIGMOD Record, 24: 8, 11.

Valduriez, P., 1987. Join indices. ACM Trans. Database Systems, 12: 218,246 Deshpande, P., J. Naughton, K. Ramasamy, A. Shukla, K. Tufte and Y. Zhao, 1997. Cubing algorithms, storage estimation, and storage and processing alternatives for OLAP. Data Engineering Bulletin, 20: 3,11.

Agrawal, R., A. Gupta and S. Sarawagi, 1997. C multidimensional databases. In Proc. 1997 Int.Conf. Data Engineering (ICDE'97), pages 232{243, Birmingham, England.

Kimball, R. and M. Ross, 2002. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (2ed.). John Wiley & Sons, New York.

Agarwal, S., R. Agrawal, P.M. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan and S. Sarawagi, 1996. On the computation of multidimensional aggregates, In Proc. 1996 Int. Conf. Very Large Data Bases (VLDB'96), pages 506{521, Bombay, India.

Chaudhuri, S. and U. Dayal, 1997. An overview of data warehousing and OLAP technology. SIGMODRecord, 26: 65, 74.

Sarawagi, S. and M. Stonebraker, 1994. E[+ or -]cient organization of large multidimensional arrays. In Proc.1994 Int. Conf. Data Engineering (ICDE'94), pages 328{336, Houston, TX.

Shoshani, 1997. OLAP and statistical databases: Similarities and differences. In Proc. 16th ACM Symp. Principles of Database Systems, pages 185{196, Tucson, Arizona.

Harinarayan, V., A. Rajaraman and J.D. Ullman, 1996. Implementing data cubes efficiently. In Proc. 1996, ACM-SIGmOd Int. Conf. Management of Data (SIGMOD'96), pages 205{216, Montreal, Canada.

Inmon, W.H., 1996. Building the Data Warehouse. John Wiley & Sons.

Zhao, Y., P.M. Deshpande and J.F. Naughton, 1997. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'97), pages 159,170, Tucson, Arizona.

(1) E.Rajakumar and (2) Dr. R.Raja

(1) Assistant Professor, Department of Computer Science and Engineering, Sri Aravindar Engineering College, Sedarapet,

(2) Principal, Sri Aravindar Engineering College, Sedarapet, Aadhar ID: 301904009901

Corresponding Author: E. Rajakumar, Assistant Professor, Department of Computer Science and Engineering, Sri Aravindar Engineering College, Sedarapet.

E-mail: rajkumar30980@gmail.com

Table 1: Difference between the OLTP and OLAP

CONTENT            OLTP

Source of data     Operational data; OLTPs are the
                   original source of the data.

Purpose of data    To control and run fundamental
                   business tasks

What the data      Reveals a snapshot of ongoing
                   business processes

Inserts and        Short and fast inserts and updates
Updates            initiated by end users

Queries            Relatively standardized and simple
                   queries Returning relatively few
                   records

Processing         Typically very fast
Speed

Space              Can be relatively small if
Requirements       historical data is archived

Data base          Highly normalized with many tables
design

Backup recovery    Backup religiously; operational
                   data is critical to run the
                   business, data loss is likely to
                   entail significant monetary loss
                   and legal liability

CONTENT            OLAP

Source of data     Consolidation data; OLAP data
                   comes from the various OLTP
                   Databases

Purpose of data    To help with planning, problem
                   solving, and decision support

What the data      Multi-dimensional views of various
                   kinds of business activities

Inserts and        Periodic long-running batch jobs
Updates            refresh the data

Queries            Often complex queries involving
                   aggregations

Processing         Depends on the amount of data
Speed              involved; batch data refreshes
                   and complex queries may take many
                   hours; query speed can be improved
                   by creating indexes

Space              Larger due to the existence of
Requirements       aggregation structures and history
                   data; requires more indexes
                   than OLTP

Data base          Typically de-normalized with fewer
design             tables; use of star and/or
                   snowflake schemas

Backup recovery    Instead of regular backups, some
                   environments may consider simply
                   reloading the OLTP data as
                   a recovery method
COPYRIGHT 2015 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:online analytical processing
Author:Rajakumar, E.; Raja, R.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Jun 1, 2015
Words:3262
Previous Article:Content based image retrieval using perceptual hashing.
Next Article:Deduplication in cloud backup services using geneic algorithm for personal computing devices.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters