The Rise of Object Storage

(and why you should care)

The Blob

The Bucket. 

The almighty Object Store. 

Just under 80% of the world’s data is unstructured or an object store. If you work with data, this affects you. You should know what object storage is.

For this blog post - give me 3 minutes of your time and I’ll explain:

  1. What is object storage

  2. Why should you care

What is Object Storage?

Object storage is the basis of the modern data repository. Yep, it’s that important.

My last post on Data Lakehouses (if you didn’t read it, you should) included this simplified architecture diagram. See what’s right in the middle?

A simplified view of a Data Lakehouse architecture

Object storage is where all of our unstructured, semi-structured, and sometimes structured files reside in a data lake or data lakehouse. Sure, there are other aspects of a data lakehouse, but object storage reigns king as the holder of information.

As a refresher, data lakes are where we dump data (usually unstructured) to potentially clean and analyze later. Remember how unstructured data is hard to work with? Lakehouses fixed that - think of a data lakehouse as a sort of queryable data lake. Seriously, you should check out The Modern Data Repository Crash Course.

So is object storage new or something? 

No. Object storage is actually like 30 years old. It was originally invented in the 1990s to help companies meet new compliance laws. During the 90s, a bunch of naughty companies were deleting or changing their financial records data. To stop people from doing this, new laws came into effect that changed how companies could store, change, and delete data for record keeping.

Object storage initially came at a time when companies needed:

  1. Auditable data trails

  2. Unchangeable data stores

And object stores are great for those two things! But they can also:

  1. Have expansive, customizable metadata

  2. Scale cheaply

All four of these benefits together are what make object storage unique versus other storage mechanisms like block storage. If block storage was a covered parking lot with valet service, object storage was economy parking.

But object storage wasn’t crazy popular at first. The need for good metadata wasn’t apparent. Object storage wasn’t yet available on the cloud.

That all changed around 2010, as enterprises increasingly needed a place to cheaply dump massive amounts of unstructured and other data to work with later. The data lake was born! It needed storage - Cloud Object Storage was the cheap, safe, and scalable option that worked best. It became the standard storage for data lakes.

Fast forward to today - Object storage dominates the world's data. According to the IDC, just under 80% of the world's data is unstructured or object storage. 

But object storage isn’t just popular because it is a cheap scalable storage option. A bunch of dirty unstructured data together? That sounds like a data swamp! We needed the ability to search and query our object storage.

Remember how “Expansive, customizable metadata” was a benefit of object storage? The need to make sense of all this object storage demanded a solution.

Enter the Data Lakehouse. With the advent of Data Lakehouses, and more specifically the open table format in 2013, object stores eventually became easier to navigate and analyze.

Open table formats made it possible to query data across object stores. There's a lot of work involved in getting there but that's the gist of it. Open table formats will be my next blog post so I’ll go more into how those work there.

In summary, object storage is cheap scalable storage with descriptive metadata.

It became popular in the 2010s because of the advent of Cloud Object Storage + the creation of open table formats. Together, these two advancements ushered in an explosion of unstructured data analytics and a dominant period for Object Storage.

Why it matters to you

This matters to you if you work with data. Specifically, unstructured data… think emails, audio files, photos, logs, videos, and other sources of “Information”. These files are likely landing in object storage to be analyzed later.

These sources don’t have rows and columns to get straightforward insights from. But they do have valuable data to analyze. Let’s look at a quick example of what I mean.

Let’s say you own a clothing store. You just released a new pair of blue jeans you think are great, but know you can improve to bring your business to new heights. You obtain an audio file of one of your customers explaining what they like and don’t like about your new blue jeans. This is great news to you, as this audio has information that can be analyzed to help you improve your blue jeans!

Unstructured audio data versus typical structured table data

In a perfect world, this audio file would be a spreadsheet as shown above. But it’s not. That’s not how the world works. Data is captured in all sorts of unstructured formats that you can cleanse into something you can query.

So how do we get from A to B in the above photo? How in the world do we “query” or analyze this? The answer starts with object storage. 

Object stores include both the audio recording file and metadata about that file’s contents. After cleansing this audio file and object, users can then query that metadata and the (cleansed) file contents to derive insights. 

Utilizing metadata about object stores also gives us improved query performance and cost. For example, if we wanted to query all audio files created in January, we could just query metadata containing a January timestamp, eliminating the need to search through 11 other months of results.

By utilizing metadata to pre-filter our query results, we drastically speed up execution time (less data to analyze) and, as a result of fewer computations, save on compute costs. Talk about good data engineering!

With object storage, a whole new data repository has been created right before our eyes. By piecing together object storage, open file formats, and open table formats, we have officially entered into the golden age of unstructured data analytics and AI.

Let me know if you have any feedback on this or other articles, I love hearing from Data Dive readers!