Cloud Data Warehouses vs Cloud Data Lakes – Where are the Lines Drawn?
Background:
When data lakes first came on the scene they were usually on-premises Hadoop clusters. Such systems could store any kind of data in its raw format at massive scale. With the Hadoop Distributed File System (HDFS) the data was stored multiple times across the cluster so there was little to chance of data loss (although it could happen). Hadoop also supported parallel processing so massive jobs could be broken down into smaller chunks and run concurrently. Performing analytics on the data was possible but took a long time and was mostly done in batch (using map reduce routines written in Java). Although on-premises Hadoop clusters ushered in the era of Big Data and supported data use cases that previously were not possible, many of these systems proved to be costly and cumbersome to maintain. One thing was for sure though – at this point in the evolution of the data lake there was essentially no integration with data warehouses. Yes the data lake was used as a staging ground for feeds into the data warehouse and, yes, offloading long running ETL to Hadoop clusters was an early favorite use case. However, from an analytics perspective there was no integration between the data warehouse platform and the data lake. They were two distinctly different planets and no one would ever confuse the two. At the time (approximately 10 years ago), there was even talk of data lakes possibly replacing data warehouses.
In today’s world data lakes and data warehouses still have very different primary objectives (data lakes store data in raw format, utilize schema-on-read, data has varying levels of accuracy and often is not verified; data warehouses store curated, governed data; utilize schema-on-write; considered a source of truth), but they are often thought of as two integrated components of a larger data architecture that addresses all the data use cases of an organization. This is a positive trend but with modern Cloud architectures sometimes the lines can get blurry between the two. In the rest of this blog post I will describe the evolution of data lakes and data warehouses and how they have become more integrated.
Evolution:
As on-premises data lakes became popular and more common, some new requirements began to emerge. People wanted to query and explore the data in the data lake with much less latency than the predominant batch analytics of the time; they wanted more transparency about what data was in the data lake; they wanted more governance with respect to what data was being put into the data lake (i.e. avoid a ‘data swamp’); they wanted to reduce storage costs for the large volume of data that was being stored and they wanted a unified analytics front end to be able to analyze data in both data lakes and data warehouses.
Concurrent to this emerging set of requirements, cloud computing started to come into the forefront with the possibility of dramatically changing the economics of data lakes most significantly with respect to object storage. All the major cloud vendors provided the capability to store any kind of data in object storage at very low cost. This data in the Cloud could then easily be made available to big data and/or analytics processing in the Cloud. So we began to see a migration of on-premises data lakes to the Cloud and a preference for the Cloud for new data lakes.
As the Cloud was becoming the new preferred platform for data lakes, Cloud data warehouses were also evolving and becoming the preferred platform for data warehouses – both in terms of migration of on-premises data warehouses and new data warehouses.
With all of an organization’s data being stored in the Cloud (i.e., unstructured, semi-structured, structured), it was only natural for Cloud vendors to begin thinking about how best to leverage that data and extract value in an integrated and unified way that served the needs of the full spectrum of users (i.e., executives, professional data scientists, data analysts, business analysts, casual end users, etc).
So the vendors of the major cloud data warehouses started to build in features and functionality to allow Cloud data warehouses to tackle some Cloud data lake use cases. This took three major forms which I have outlined below:
1. ‘Extending’ the Cloud Data Warehouse to include external tables containing Data Lake data – this is where data that is stored in the data lake, typically in object storage at one of the cloud vendors, is defined to the Cloud Data Warehouse as an external table. The data can then be queried using SQL in the same manner and using the same analytics platform as the users would use for data in the data warehouse. It should be noted that in this example the data is not ingested into the Cloud data warehouse – but rather it is defined logically to the Cloud Data Warehouse. This allows data lake data to be analyzed and explored immediately without any time required for loading and transformation.
Related Resources:
2. Ingesting semi-structured data into the data warehouse – the second way in which Cloud data warehouses begin to take on some Cloud data lake functionality is by ingesting semi structured data directly into the data warehouse. To support this scenario the major Cloud vendors have added special data types to their data warehouses to allow loading of semi-structured data directly into data warehouse tables. Once there, the data can be queried using SQL like any other data in the data warehouse. Because semi-structured data has some structure, the data warehouse is able to parse it out and infer the structure – thus enabling querying without a specific interim step. The major types of semi-structured data include JSON, Parquet, ORC, AVRO, XML, CSV. There is some variation across vendors with respect to support for these formats.
Related Resources:
3. Run Spark Processing Jobs Against Data in the Cloud Data Warehouse – The third way that Cloud Data warehouses have taken on some data lake use cases is by running Spark processing jobs against data in the Cloud data warehouse. Spark is a unified analytics engine for large-scale data processing. Spark runs in-memory and supports parallel processing so it has become almost a de facto standard for Big Data analytical processing. For many years running a Spark job was done almost exclusively against data in the data lake. Now, the Cloud data warehouses from the major vendors have been updated so that Spark can read data from and write data to them. So it is now possible to run Spark jobs directly against the data in Cloud data warehouses.
Related Resources:
As can be seen from this article, Cloud data warehouses have evolved and have been extended to take on some functions previously only performed in data lakes. Even though this may get confusing at times, it is overall a positive trend as it provides more options to end users and makes it easier to run analytics on and extract value from the full spectrum of an organization’s data. Lastly, it should be noted that Cloud Data Lakes and Cloud Data Warehouses continue to have very different primary objectives and should be considered as two separate components of an overall modern data architecture – recognizing that from an implementation standpoint the separation may only be logical and not physical (i.e., the Cloud data warehouse may serve as the data lake but not vice versa).