- Data Dive
- Posts
- The Modern Data Repository Crash Course
The Modern Data Repository Crash Course
Database -> Data Warehouse -> Data Lake -> Data Lakehouse
Time to read this: 5 minutes 30 seconds
Database → Data Warehouse → Data Lake → Data Lakehouse
By the end of this blog post, you’ll have a solid understanding of each, and finally understand (in simple terms) how the Data Lakehouse came to be and why they are all the rage.
An artists rendering of the invention of the Data Lakehouse
When you think of collecting data, the first thing that pops into your head is likely a database.
And that’s a great start! Databases were the first way to efficiently store and recall large amounts of data in an organized fashion. The RDBMS reigned supreme for years.
Then came this thing called the internet. During the explosion of the internet and e-commerce throughout the later 1990s and early 2000s, there was a subsequent explosion of data being generated. These events gave rise to the age of “Big Data”.
Big Data gave rise to the era of analytics and reporting. These activities, under the umbrella of “Business Intelligence”, came to dominate enterprises. The ability to accurately obtain historical data for reporting, forecasting, customer analysis, market trends, etc quickly became a key focus of every business everywhere.
But we needed a technology to power our Business Intelligence.
So you might be thinking, why not use a database? Doesn’t it store lots of data?
Databases are designed to be able to write data really fast - that’s what made them special. The problem is, the use cases of reporting and analytics require reading large amounts of data fast. A system purpose built for storing and reading massive amounts of historical data did not exist at the time. That is, until, the Data Warehouse came on the scene at the turn of the Millennium.
The purpose of an Enterprise Data Warehouse (EDW) is to consolidate data in an organized fashion from a variety of databases to help businesses slice and dice their data. The ultimate goal of this was to use data to make better business decisions.
It’s worth noting that EDWs don’t actually do the slicing and dicing (Business Intelligence tools do that instead). Instead, the EDW provides those tools a trustworthy foundation upon which allows us to reliably slice and dice that historical data.
A Data warehouse can be broken down into 4 components:
Ingestion - use ETL tools to bring data from siloed sources
Storage - stores the data in a central database
Metadata - Data about your data, specifying things like usage, values, statistics, and other insights
Consumption - Tools to access the data within your data warehouse such as querying, reporting, development and OLAP tooling
With Data Warehouses, enterprises were now enabled with BI, analytics and reporting. Finally we had solid way to slice and dice swathes of historical data!
But the continued explosion of data soon ran into a new problem…
The issue of unstructured and semi-structured data from sources like social-media, IoT sensors, email, and more.
Data warehouses were designed to work really well with structured data. And while EDWs could technically work with unstructured/semi-structured… that data had to be drastically cleaned up first (which wasn’t really practical). This presented a big problem, as organizations were unable to get much (if any) value from their non-structured data sources.
This all changed around 2010, when Pentaho CTO James Dixon introduced the concept of a Data Lake.
A data lake is a repository of data where we are storing files or objects in their original format.
In this case, there is no pre-defined schema like in a data warehouse. This allows us to consolidate and analyze data for all kinds for a variety of business purposes. The end result? Valuable insights from previously scrambled data.
The key advantage of a data lake is being able to store most any type/size/format of data in its original state (both structured and unstructured). The main trade off of here, however, is that data lakes can lack governance and guardrails on that data.
As data lakes emerged, they were (and sometimes still are) custom built. This gives data engineers great flexibility as they can choose what each component is made of. The key components include:
Data ingestion - ETL tools for batch, as well as Kafka (and others) for real-time/streaming. You want to have a standardized ingestion framework.
Storage - Early data lakes were built using on-premises HDFS clusters. But the high cost of these systems ultimately ushered in the era of cloud data lakes (which are based on object storage).
Processing (trusted) zone - This is where data is transformed and enriched for use (quality checks and remediation).
Consumption zone - how the data is accessed for business use.
Data governance and management zone - Data auditing, metadata management, lineage, cataloging, security, monitoring, operations, etc. this zone applies to the other 4 as an overlay.
You’re probably thinking “Hey these components seem similar to that data warehouse”.
The components are similar, but the key differentiator is the need to specify the schema and cleanse the data beforehand. In a data warehouse, you had to do all this before landing the data in storage. With a data lake, we can simply dump copies of the original data into object storage. This offers us greater flexibility and scalability.
Data lakes and data warehouses are typically used in tandem. Data lakes act as a catch-all system for new data, and data warehouses apply downstream structure to specific data from this system. The problem is, coordinating these systems to provide reliable data can be costly in both time and resources.
So like Rocky Balboa and Apollo Creed in the 3rd Rocky movie, the data lake and data warehouse inevitably joined forces - giving us a best-of-both-worlds data repository, the Data Lakehouse.
The data lakehouse merged the best aspects of the data lake and the data warehouse. That is, the ability to quickly land data in its original format in cheap scalable storage, while providing the data structure of a data warehouse.
Lakehouses utilize similar data structures from a warehouse, paired with the object storage component of data lakes. This gives companies the ability to access trusted big data quick. Lakehouses also support structured, semi-structured and unstructured data. This allows users to accomplish BI and complex data science or machine learning use cases.
Data lakehouses are somewhat similar to data lakes, at least at the start. Typically, however, data within a Lakehouse will be converted to a Delta Lake format, which is an open-source storage layer that brings reliability, metadata management, and ACID transaction functionality (like a data warehouse) to a data lake.
The Delta Lake framework is a bit out of scope for this crash course… but for those who want to learn more, check out their website: https://delta.io/
Example of a Data Lakehouse Architecture
Why should you care?
Because 70% of enterprises say that the majority of all analytics workloads will be on their data lakehouse within three years. These same organizations project a 75% cost savings with the lakehouse architecture versus their current data repository architectures.
It’s not some fancy new architecture that’s make data lakehouses all the rage… it’s the benefits this architecture brings:
Drastic cost reduction - By utilizing lower cost Cloud Object Storage, operational costs are drastically lower than data warehouses.
Scales better - With warehouses, compute and storage are coupled together. Since lakehouses decouple these two, folks can access the same storage while using their own compute.
Real-time support - With the continued rise of streaming and real-time ingestion, this is huge for enterprises
Improved governance - Normal data lakes lack governance. But with lakehouses, ingested data can meet defined schema requirements (eliminating data quality issues and data swamps)
Data Lakehouses yield cost-efficiency AND are easier to use. What a win-win!
In summary - while databases, data warehouses, and data lakes offer businesses a ton of value and will remain in use…
The Data Lakehouse is the data repository of the future.