Don't Get Washed Out by the Overflowing Data Lake: Five Key Considerations.
However, as database professionals, we know the truth. The data isn't just sitting there waiting to be analyzed. It's dirty, unformatted, and not necessarily ready for analysis. Collecting and harnessing the power of massive amounts of data is a complicated exercise.
In many organizations, data management pros are still challenged by a collection of legacy enterprise data warehouse architectures, Hadoop, and cloud storage (to name just a few). The ever-growing volumes of data that we generate lead us to looking for places to put the data, such as the data lake. It grows and grows with data that may or may not have value, but the prevailing thought is that it may be needed someday.
Regardless of how good the platforms are at storing data, most data lakes are not full analytics platforms. This leaves many organizations with a massive amount of data and a huge usability gap when trying to perform analytics on it. It takes a solid strategy for governing data to make sure data analytics can be leveraged for continued business insights that can translate to company success.
Here are five key considerations for organizations that are trying to make more out of their data lake.
1. Consider Query Scaling, Not Just Data Scaling
When data is coming at you with volume, variety, and varying velocity, less consideration is given to the scalability of analytics than keeping up with data. Data management pros may find themselves working on the data conveyer belt, needing to take data from the belt and put it somewhere. Often this means that they put the data conveniently in Hadoop Distributed File System (HDFS) volumes or cloud storage such as Amazon S3, Google Cloud Storage, or Azure Blobs. These storage locations offer a fast and convenient way to store data with an amazingly low cost.
However, keep in mind that cloud storage and HDFS are not databases. A database management system (DBMS) is a place where you can load and store data in the most optimal way for queries. ACID (atomicity, consistency, isolation, durability) compliance, workload management, and concurrency are the foundations of a database. When your data is in cloud storage, you're more likely to use a query engine to perform analytics. A query engine system is less concerned about optimizations and more interested in data exploration. Use it to explore the data that is outside the constraints of service level agreements.
You need a DBMS when your database needs a new home that can deliver compliance with standards for SQL, ACID compliance, and where backup and restoration are part of the system. A DBMS provides advanced methods for optimization and for faster analytics. Most importantly, you store data in a database with a DBMS when you're expecting it to meet service-level agreements on analytics. In other words, if you have to run X number of reports in X number of minutes, use a DBMS. If you have hundreds or even thousands of end users analyzing data, a query engine looking at unknown data volumes generally won't cut it for timely analytics.
2. Consider the Analytics You Need, Not Just the Storage You Need
For data management systems, the two most important factors are safely storing the data and effectively analyzing the data. Yet, the analysis part is often under-scrutinized. Analytical systems vary greatly in the depth of analysis offered. Some analytical systems don't offer a complete set of SQL queries. If you need to do a JOIN with a WHERE clause, for example, some setups can't handle it. If you want to do geospatial analytics such as finding the distance between addresses or LAT/LONG points, some systems require extra add-ons that make it clunky and onerous.
Much of today's big data is time-series data, so having specific functions for time is crucial. No matter if you're looking at IoT data, financial services data, or data from your IT infrastructure, data that is created at regular intervals presents its own challenges for data quality and analytics. For example, a handful of systems provide gap-filling functionality, constructing new datapoints through interpolation algorithms within the range of a discrete set of known datapoints. Another example is event-based windows that let you break time series data into windows that flag significant events within the data. This is especially relevant in financial data, where analysis often focuses on specific events as triggers to other potential nefarious activity. If you need to analyze time series data, make sure your analytics system has features that can actually do it. Otherwise, you may be burdened with extra custom coding and extensive data preparation.
Predictive analytics is changing the way companies across every industry operate, grow, and stay competitive. Consider early whether you need predictive analytics now, or anticipate needing it in the future, and whether your analytics architecture supports it. Organizations are applying predictive analytics to everything from improving machine uptime to reducing customer churn. With advanced platforms that offer data analytics without limits and newly announced industry solutions that enable interactive analysis of massively large datasets within the industry, analysts can now leverage SQL to natively create and deploy machine learning models based on larger datasets without down sampling to accelerate the decision making process.
3. Consider Deploying Anywhere, Not Just in the Cloud
Particularly for public cloud deployments, many analytical solutions mandate that you bring the data into the database to perform analytics. No big deal, but there's a catch. Moving your data out of a public cloud, even for normal transactions, costs you real money for every gigabyte. It harkens back to the days when companies would buy an appliance for data warehousing and load all of their data into it. The appliance was a locked system that made it difficult to export data, and so are many cloud platforms. Watch out for systems that lock you into one solution.
It's imperative that you choose tools that have the widest range of deployment models. It shouldn't matter if you deploy on premise, in the cloud, and on Hadoop. You should be able to bring analytics to the data without making any copies. This can save you not only on the time and costs of moving the data but also the licensing costs to store the data in a commercial database.
Looking back 3 years at the deployment strategy then as compared to the current strategy, most companies would admit that they had no idea where they would be today. Cloud deployment is popular today, but tomorrow, who knows? If you can deploy anywhere, including on premise, in the cloud, or on virtual machines, you will have little to worry about with future deployments.
4. Consider Storing Data in Multiple Tiers, Not Just One Tier
It's well known that storing data in different tiers offers varied costs. In-memory databases, such as Spark or SAP HANA, for example, are high-end because they require lots of expensive memory and generally more expensive hardware to run. Hadoop and S3 storage are low cost by comparison, but generally don't offer the analytical performance of an in-memory database or columnar database. Enterprise architects need to store data on the correct tier, one that will meet the service level agreements of the enterprise while keeping costs low. Companies will frequently store data in Amazon S3 or Hadoop without knowing the value of much of it. They may peel off portions of the data to their database, usually a data warehouse, to perform analytics.
A powerful analytics platform can store and access all your data. Users need to store big data on Hadoop, in databases, and in the cloud because they offer varying cost models at the expense of performance. A multitiered approach offers quick access to hot data that is important to daily business and low-cost storage for data that varies in importance and timeliness.
Analytics platforms that support tiered storage can help you manage multiple storage tiers for cost-effective analytics. You should be able to perform advanced SQL queries on bulk data stored in HDFS. You should be able to access the ORC, Parquet, text, and JSON files that exist in the tiers and use them without moving the data. Move the data into a different tier when your organization requires faster performance for in-depth analytics.
5. Consider Open Standards, Not Just Open Source
There are some amazing technologies that have been developed by the open source community for data management. Technologies such as Kafka and Spark are common in today's landscape and provide useful functions such as data ingestion and operational analytics. However, when the open source tools can't handle all of your unique needs, your commercial tools need to play nice with them. The commercial tools need to exhibit "openness" toward the open source tools.
This is why your software must support open standards. With open standards, your company can pick and choose among competing vendors and not be locked in to any one platform or technology. Many people seem to think that open source software offers the same advantages, but it doesn't. Open source simply means that the underlying software code is available for inspection and modification. If you want to access your Parquet data, it shouldn't matter whether the source code of the solution is open or not. It only matters that you can access that file format without copying or moving the data across solutions with no requirement to convert to a proprietary format.
The standards extend beyond open source as well. Can you use standard SQL to perform analytics? Can the extract-transform-load tool talk to the database to load data? Can you use a standard visualization tool to create stunning data stories? Can users who prefer Python use the data within your analytical systems, taking advantage of all of the optimizations it offers? Sticking with open standards helps everyone using the data, even when migrating systems.
Our business colleagues see the fictional portrayal of data analytics in the movies and it's no surprise that they crave it. They see the Uber or GrubHub app depicting how far away their driver or food is from its destination, and it's useful for them. They see how Google Maps can predict where they want to go and how long it will take to get there, and they want that information. So it's no surprise that colleagues expect more from you in managing your data architecture and delivering analytics for delivering on business challenges.
The realization of fictional portrayals of data analytics is not far from reach. Only by avoiding some of the pitfalls of available technologies can we start to achieve success in the data lake. As we advance in technology, fiction will drive our reality. Be ready for it.
By Steve Sarsfield
Steve Sarsfield is senior director for Vertica at Micro Focus and is an author and expert in data quality and data governance. He authored the book, The Data Governance Imperative, a comprehensive exploration of data governance from the business perspective. At Micro Focus, Sarsfield is focused on data governance, data analytics, machine learning, data integration, data quality, and big data.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||The State of Data Lakes|
|Publication:||Big Data Quarterly|
|Date:||Dec 22, 2018|
|Previous Article:||Top Trends Driving Digital Transformation The Strategic Investment Companies Can Make to Stay Ahead of the Curve.|
|Next Article:||Data Integration Patterns.|