Printer Friendly

Simplifying a complex data ecosystem.

DATA WAREHOUSING HAS a reputation for being complex. So does Hadoop. So why would complexity + complexity = simplicity?

The answer lies in the things that any given technology simplifies. Each involves one technology that dramatically simplifies one technology - with the added (though smaller) price of some additional cost and/or technological complexity.

For example, data warehousing simplifies retrieval of data by structuring it in a form that's better organized for that purpose. It's worth the complexity of data warehousing to get the simplification and standardization achieved for query and analytics.

In the same way, Hadoop simplifies the process of structuring data, such as is needed to create data warehouses. Hadoop does so by enabling organizations to ingest data into a highly scalable process without changing its structure: In other words, it enables them to extract data, load it into Hadoop, and then transform it (ELT), which is a significant change from traditional extract, transform, load (ETL) approaches. Using Hadoop and ELT, data is copied (so-called "data at rest") or streamed ("data in motion") into place without any danger of losing continuity, and then manipulated from there.

Using ELT, every data processing step is, in theory, reproducible, while strain on operational systems is dramatically reduced. Just as the shift from on-premises to in-the-cloud has simplified operations, the move from ETL to ELT has simplified the organization of the data.

For example, data cleansing used to be something that happened in some stage--often an independent, temporary stage--in the movement of the data from the source system to the operational data store or data warehouse. In that kind of process, it was often hard to track the provenance of any given record or data point. In a Hadoop-based data warehousing project, on the other hand, the data is ingested "dirt and all." Then, the data is cleansed, with the dirty data and cleansed data coexisting in the result set.

Since Hadoop allows you to manage large data sets without throwing anything away, it's easy to track how the data is being manipulated. New rules can be applied and new data sets generated without worrying as much about the effect on other data sets. No data warehouse has to be updated as a "single version of the truth" as long as the Hadoop data store contains valid data sets with valid rules to get from one form of data to another.

Similarly, records can be scored with calculated, often statistical and predictive, measures as they're streamed into the Hadoop ecosystem. For instance, a statistical model can be created with large samples of data. Once the model is developed, it can be applied to individual incoming records as part of a Flume job, a Kafka topic stream, or a Spark pipeline. Machine learning (e.g., through Mahout) works in the same sort of way.

So ELT into the Hadoop ecosystem can simplify data warehousing, including data ingestion, data transformation, data quality, scoring, and machine learning. What can simplify the Hadoop ecosystem?

Some of the more tedious tasks in Hadoop can be simplified by software. For instance, management of environment variables and parameters often can be automated so that developers don't need to specify them by hand.

More importantly, data management jobs can be developed for multiple technologies without additional effort. For example, a tool that manages a Flume job that implements ingestion, transformation, and cleansing of a data stream can, with the right tooling, be deployed using Map/Reduce, Spark, and other technologies. Differences between Flume, Flafka, and Kafka can be kept to a minimum. Spark pipelines can be implemented quickly and easily.

This is important when getting data experts to work on data management jobs. They don't have to understand Hadoop, as long as they understand the data.

It's also important when a job needs to migrate from one implementation type to another, such as Map/Reduce to Spark.

The end goal is clear: Simplify query and analytics through data warehousing; simplify data warehousing through Hadoop-based data management; simplify Hadoop-based data warehousing projects through better data management tooling.
COPYRIGHT 2017 Information Today, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Information Builders
Publication:Database Trends & Applications
Date:Apr 1, 2017
Words:673
Previous Article:Oracle data warehouse solutions: delivering business value from the datacenter to the cloud.
Next Article:The role of bitemporality: in data governance and compliance.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters