- Data Dive
- Posts
- The Big 3 AI File Formats
The Big 3 AI File Formats
The Jordan/Pippen/Rodman of big data
Jordan, Rodman, Pippen.
Parquet, ORC, Avro?
Big 3s are cool in basketball. Big 3s in file formats? Probably something you gloss over.
Today’s blog will convince you otherwise. Knowing the basics of these 3 file formats, and when to use which, will make you lethal in the world of AI.
So let’s dive in.
The main file formats we deal with in our day-to-day data job are typically CSV or JSON. They work for some use cases. But these file formats weren’t particularly designed to deal with BIG data. When files of this format encounter massive amounts of data and are used for analytical workloads, compute resources are typically used up much faster. Searching through millions or billions of records, value by value to find an answer... It just isn’t efficient for computers. Not with a basic file format at least!
To be specific, older file formats cannot work with compression algorithms, faster read/write speeds, and nested data structures often required of big data projects. Because of these shortfalls, more sophisticated file formats eventually came to be.
Today we will run through the big 3 file formats you’ll encounter in data lakes and the world of big data and AI. These file formats are:
ORC
Parquet
Avro
Before we overview each, we need to establish two important concepts.
First concept - The structure of a data file is very important.
Namely, row-based storage vs. columnar storage. The way your data is organized can change your time to answer a question or query from 6 minutes to 6 milliseconds. You must optimize your format to match your storage method and your data use case.
Second Concept - What is a Column Oriented Data File Format?
This is a fundamental concept of two of these three formats. A lot of folks think of typical row-based storage when storing records. But another helpful option is column-oriented files. In a column-oriented file, data is stored by, you guessed it, each column. So each column contains all the values for that specific attribute across all records. Let’s look at a quick example:
Let’s say our business, Rockford Corp., has customer records stored with some basic information as shown below…
A typical database table
In row-based storage, records are stored as shown below.
Row-based storage example
This is the storage format you are likely used to. However, in columnar storage, the data is stored by each column, as shown below.
Column-based storage example
Take a second a understand the differences in the way these are stored…
So now… who cares? You should! The columnar type of file format is often ideal for analytics and machine learning.
A real-world example of this would be finding the average age of Rockford Corp’s customers.
Let’s find the average age by using both the row storage and column storage file formats.
ROW STORAGE - Since the “age” attribute is stored as one value in each record containing 4 total values (Firstname, Lastname, Age, ZipCode), a computer must load all 16 total values into memory to calculate the average age of 36.25.
COLUMNAR STORAGE - Here the ages are all in the same ‘record’. Searching for the answer to this particular question would be a lot quicker, as we just load the 4 values into memory for our average.
That’s 4 operations instead of 16!
And thus goes columnar storage. While this is an overly simplified example with only a few data points, you can see how this drastically improves compute resources and efficiency over large data sets. Numerous machine learning and AI use cases are better accomplished with data in columnar file formats.
Other benefits of columnar file formats and storage include:
Better compression
Faster query performance
Scalability
Got it? Good stuff. Let’s move on to the big 3 file formats.
Apache Parquet
There’s a reason I put Parquet with Michael Jordan in the opening picture. Apache Parquet is the most popular file format on this list.
Parquet is an open-source column-oriented data file format that came out in 2013. The main draw of utilizing Parquet was improved analytical querying performance. That’s fancy speak for better data storage and data retrieval. It is extremely popular for Python-based projects, which is the most popular programming language in the world.
Parquet provides:
Efficient data compression - your data doesn’t take up tons of space
Ability to handle complex data in bulk
Available in multiple languages (Python, Java, C++, etc) - useful in lots of big data projects
Available to any project in the Hadoop ecosystem, regardless of the specific data processing framework, model, or language. This is huge as you aren’t locked into one framework.
Parquet is optimized for write-heavy workloads. It also has excellent support for complex nested data structures, making Parquet a great candidate for JSON and other nested data types.
Apache ORC
The second file format you’ll run into is Apache ORC, or Optimized Row Columnar. This file is again columnar-based and is designed for big data processing systems like Hadoop.
Inside the ORC file, data is stored in stripes (which is just a grouping of rows of data). Those stripes are chunked into smaller groupings of columns and then compressed into a much smaller storage. The result is a massive data set that doesn’t take up much space!
A visual of an ORC file that I found on the internet
Need proof? Facebook (Meta) uses ORC in their data warehouse to save 10s of petabytes of data versus other formats.
ORC also stores indexes and vast metadata in the file, so that certain query results can be retrieved quickly instead of searching through the entire file.
The Apache ORC file format is an excellent candidate for read-heavy use cases, especially streaming, with support for finding the required records with speed.
You’re probably thinking “Hey, this sounds a lot like Parquet.” The main difference to remember is you would use ORC for read-heavy workloads, and Apache Parquet for write-heavy workloads. It’s a little more complicated than that… but if you remember this one fact, you’ll be ahead of most Data and AI learners.
Apache Avro
Avro is a ROW-based storage format for Hadoop. Avro stores the schema as JSON, making it easy to read by almost any program. The data itself is stored in binary, which makes it compact. One important feature of Avro is support for data schemas that change over time (this is called schema evolution). Avro can handle schema changes like missing, added, or changed fields.
The Avro format also provides support for numerous rich data structures, and even support for multiple data structures in the same record. Avro is often recommended for Kafka, and when serializing data in Hadoop. Avro is splittable and compressible and is a really good candidate for the Hadoop ecosystem and for running in parallel.
One thing that’s special about Avro is that it is self-describing. Serialized data AND that data’s schema is bundled in the same Avro file. This allows different programs to easily deserialize messages.
So now to the important question... What use case is best for Avro?
Based on the file format’s strengths, Avro is an ideal file format candidate for your data lake’s landing zone. This is because data in this zone is typically read in its entirety downstream (row is better than column for this) PLUS those downstream systems retrieving that data can also easily retrieve the schemas (since they are stored with the file). Another great use case is standardizing data on Avro across your different systems as a consistent communication format.
And those are your big 3 file formats in the age of AI! Just like Jordan, Pippen, and Rodman dominated the 1990s, these 3 file formats dominate big data.
Quick cheat sheet for these 3 file formats
The main takeaway here is that your file format should match up to your downstream use case. Yes, that takes planning. But that planning will save you time and money 1000 fold in the future. And that, my data-driven friend, is what good engineering is.
Thanks for tuning in!
Be sure to subscribe to get our biweekly blog posts directly to your inbox.