- Data Dive
- Posts
- The Benefits of Open Table Formats
The Benefits of Open Table Formats
A quick summary + the 4 main benefits
If you work with data - you’ve probably heard of the term “Open Table Format”.
If you haven’t, or if you want to learn what Open Table Formats (OTFs) are and why they are all the rage, this post is for you.
What is an Open Table Format
You’re probably familiar with a table in a relational database. It’s a grouping of columns and rows of data that we can query.
Example of a traditional relational database table
So what’s the difference between this “Table Format” and an “Open Table Format”? Let me explain.
Remember that data lakes are composed of files (Parquet, ORC, etc.) in HDFS or Object Storage. These files are visual to us as the end user. Using parquet as an example, that file could be Example.parquet.
These file formats are different from Open Table Formats like Apache Iceberg, Delta Lake, or Apache Hudi. File formats and open table formats work together.
An Open Table Format is an abstraction layer on top of a data lake’s files/storage that introduces functionality traditional database tables have.
Reference architecture for Apache Iceberg, the most popular OTF
What functionality are we talking about?
Benefits of an Open Table Format
Schema and partition evolution + CRUD operations
Relational database tables allow for C.R.U.D. operations (Create, Read, Update, and Delete). In a typical data lake, however, users can only create objects (or files) and read them. Data lakes typically utilize object storage (or HDFS), which does not provide an easy way for users to update the data. These storage mechanisms are designed to hold immutable (unchangeable) copies of data. That is, until, OTFs came onto the scene. With OTFs, you can update columns/records, schemas, and partitions across object stores without completely reprocessing the data.
Improved performance
Open Table Formats allow analytical engines (Spark, Presto) to filter by metadata BEFORE executing a query. This drastically reduces the number of compute operations and records to read through for queries over large data sets. Quick example…
Let’s say our company, Rockford Corp, has customers aged 20-70. We store all purchase transactions in parquet files, partitioned by decade of age (20-29, 30-39, etc…). We want to analyze just those individuals aged 20-29. With an Open Table Format, we have the metadata of these files stored in our catalog, allowing us to search ONLY those files that meet the condition of age = 20-29. This allows us to skip over all other age groups, drastically improving our time to query and the performance of that query, versus if we had to query all records.
ACID functionality
ACID functionality, which stands for Atomicity, Consistency, Isolation, and Durability, are four key characteristics of a database table. These four properties together ensure database operations across groups of records can happen concurrently without issues. If any singular event or transaction fails, the entire process fails and the database reverts to the last stable state. This is extremely important for certain applications and use cases where multiple reads and writes are happening concurrently. ACID stands for
Atomicity - Guarantees all commands in a transaction either succeed together or fail together
Consistency - Guarantees all transactions are following the constraints or rules set
Isolation - Transactions run in an isolated environment, allowing two transactions to run concurrently
Durability - Transactions that are completed successfully are guaranteed to persist in the database.
For more on ACID functionality, check out this great blog post from the team at IBM.
Time travel
Open Table Formats utilize metadata to create each “snapshot” or version of that table and its contents at a point in time. Each snapshot is a grouping of metadata across the files and object stores. One unique aspect of OTFs is that because these snapshots are captured and kept, users can roll back to previous snapshots. This allows for “Time travel” as we can utilize older versions of these tables whenever we need.
Note - Between the main Open Table Formats (Iceberg, Delta Lake, and Hudi) each one works slightly differently than the others. We will deep dive into the key differences between these three in our next post.
The important point here is that without an open table format, data lakes DO NOT have the critical functionality listed above. But, with data files organized under a standardized table format, we get the full data warehouse experience on the data lake.
This is where the term “Data Lakehouse” comes from.
In summary, an Open Table Format is an abstraction layer on top of modern file formats that give us functionality such as Schema/Partition evolution, CRUD operations, better performance, ACID transactions, and time travel.
Thanks for reading! If you’ve read this far, I’d love to hear your thoughts on this article and would appreciate any feedback you may have.