Data Dive
Posts
Apache Iceberg: The Quickstart Guide

Apache Iceberg: The Quickstart Guide

2024 is the year of Apache Iceberg...

Matt Koscak
April 04, 2024

In case you missed it… 2024 is the year of Apache Iceberg.

Today we are going to discuss:

What is Apache Iceberg?
How does it work?
What are some real-world use cases?
The open-source community around Iceberg vs. competitors

What is Apache Iceberg

Apache Iceberg has quickly become the most popular Open Table Format. So what is it?

Apache Iceberg is a truly open-source table format for Parquet, ORC, and Avro files. Businesses can capture data fast and cheaply in these file formats in their data lake, and then use Apache Iceberg tables as an abstraction layer over those files to introduce the following functionality:

Schema evolution
ACID Transactions
SQL Querying on data lakes
Incremental processing of data
Consistent, reliable data states for all users
Time travel - querying current or past snapshots

How exactly does it work?

Iceberg tables don’t actually house the data. Instead, the data is kept in Parquet, ORC, or Avro files, and Iceberg is used as an abstraction layer. What does that mean?

Apache Iceberg utilizes a system of pointers and metadata files to keep track of CHANGES to the underlying data files. The pointers and metadata files comprise our Apache Iceberg table!

We will now dive into how Iceberg works under the covers, and learn how a SELECT statement would execute in this architecture.

Architecturally, there are three layers of an iceberg table format:

Layer 1 - The Iceberg catalog

The catalog is the highest level, and is the starting point for any interactions with Iceberg tables. It contains the current metadata pointer, which points to the metadata file of the current Iceberg table. If we have database 1 (db1), and table1 to represent our first Apache iceberg table, the metadata pointer kept in the catalog would be db1.table1. This points to the current metadata file.

db1.table1 represents database 1, table 1

Layer 2 - The Metadata layer

The metadata layer consists of a few reference files:

Metadata file - This file stores high-level metadata about your table at a certain point in time. The most important information contained in this file is the current snapshot. The current snapshot gives us the CURRENT table, which consists of a manifest list, manifests files, and data files (stay with me it will make sense by the end).

Manifest list - a list of manifest files. This list contains the path (the location) of each manifest file contained in a snapshot.

Manifest files - The purpose of a manifest file is to track the data files. Manifest files contain information about the underlying data files in object storage. Information like location, record count, and partition information are stored in the manifest file, and can be used to make querying more efficient.

Layer 3 - The Data (Storage) layer

Data files - This layer contains the actual data in your Iceberg table. These would be in either the Parquet, ORC, or Avro file format. These data files are managed by the files in the metadata layer.

Real-world use cases

Now you understand how Iceberg tables work under the hood. But what use cases are becoming more prevalent that make 2024 the year of Apache Iceberg?

Strict data privacy laws - There are data privacy laws that require deleting data after a certain period. If that data is kept in a data lake, Iceberg allows for easy deletion of relevant records or tables.
Updates at the record level - If you sell something and that transaction is stored in your data lake, and then the customer returns it… what do you do? In immutable data stores, we must reprocess the entire data set. With Iceberg, we can make data record level changes in the data lake.
ACID Transactions - allows data lakes to function as transactional data stores.

Open Source Community

The last point, and probably the most important, is that there are only a few open table format options for a modern data repository - Iceberg, Delta Lake, and Hudi. Iceberg, in many expert’s opinion, has the best open-source community contributing to its development.

Delta Lake (A competing Open Table Format) is open-source, but its two biggest contributors by far are Databricks and Microsoft. If your business works with those companies, then Delta Lake may be a good choice for your lakehouse architecture. But if you are looking for true open source, Apache Iceberg is likely the better option. Iceberg has an incredibly diverse and talented community across a plethora of companies contributing to its advancement.

It’s the most feature-rich, and the most open-source table format out there.

2024 is the year of Apache Iceberg. Are you ready for the iceberg takeover?