Last week Databricks announced that it is acquiring Tabular, the company and team behind Apache Iceberg. The transaction size is rumored to be in the $2B range and saw Snowflake, Databricks and Confluent bid for Tabular. There are a few unusual elements about the deal. The first, was its magnitude, which as I mentioned was rumored to be close to $2B. Second, Databricks already has an equivalent to Apache Iceberg in the form of it’s very own Delta Lake. So why then did Databricks acquire Tabular and do so at a seemingly high price? To answer this we should revisit the data landscape and take a look at a few disruptive forces that have been gathering steam over the past year or so.
Disruptive forces gather steam
The first disruptive force is that of the cloud lakehouse approach to storing and organizing data. I wrote about this earlier
cloud lakehouses, representing a modern architectural paradigm that harmonizes the flexibility of data lakes with the structure and performance features of data warehouses. Leveraging scalable cloud storage, lakehouses store structured, semi-structured, and unstructured data in its raw form without upfront structuring. The schema-on-read approach allows users to apply the schema at the time of analysis, adapting to diverse data sources prevalent in modern enterprises.
Lakehouses shine in handling big data, supporting advanced analytics, machine learning, and data exploration. Seamless integration with big data processing frameworks like Apache Spark and Apache Hadoop makes them a powerhouse for processing large datasets in distributed computing environments. A game-changer is their openness, allowing data storage in open formats, steering clear of the shackles of data warehouse vendor lock-in.
The second disruptive force is that of tabular format, which are one of the foundations of building cloud lakehouses. Earlier in that same post, I wrote this about tabular format
Now, let's talk tables, which are a critical component of the lakehouse stack.
Tabular data formats, such as Iceberg, Delta Lake, and Hudi, step into the scene to tackle the complexity of processing and analyzing large-scale data in cloud-based lakehouse architectures. Unlike traditional big data formats focusing on storage encoding, these tabular abstractions provide higher-level logical representations of data, managing storage, processing, and optimization under a unified framework.
This marriage of data warehouse strengths—structured schemas, ACID transactions, and SQL access—with the scalability and flexible schema of cloud data lakes is a game-changer. Analytics, reporting, and ETL workloads can leverage mature, database-style constructs, benefiting from nearly unlimited storage capacity. Iceberg, Delta Lake, and Hudi each bring their unique strengths and tradeoffs, catering to different use cases and scenarios.
The third disruptive force is AI and how lakehouses can enable this use case. The promise of a lakehouse approach is to consolidate all your data: structure, unstructured, semi-structured, into massively scalable storage systems and using open formats to store this data. AI is a prime example of a use case that is witnessing hyper growth and one that benefits from this approach. In addition to that, the lakehosue approach can allow different use cases, like business intelligence, to access the same data, without the need to copy or move data around.
The net result of the rapid adoption of cloud lakehouses and tabular formats results in more organizations building a data stack that looks like the one depicted in the diagram below. One of the key decisions to make when building an architecture like the one below is the choice of tabular format: Iceberg, Hudi or Delta Lake.
A few months ago I wrote an article reacting to Iceberg being prominently featured on Snowflake’s earnings call - Frank Slootman’s last earnings call as CEO of Snowflake.
Iceberg was mentioned no less than 18 times during the company’s earnings call, second only to “AI” (28) as per my very non-scientific analysis of the call. Now, why would Snowflake’s CEO, both new and past, alongside their CFO mention a somewhat obscure Apache project on the call? The answer is simple: Iceberg is moving data out of Snowflake.
We are forecasting increased revenue headwinds associated with product efficiency gains, tiered storage pricing, and the expectation that some of our customers will leverage Iceberg tables for their storage. Source: Snowflake (SNOW) Q4 2024 Earnings Call Transcript
The impact is not just losing the storage revenue alone, but the more valuable compute revenue too.
So, the amount of revenue associated with storage is coming down. But on top of that, we do expect a number of our large customers are going to adopt Iceberg formats and move their data out of Snowflake where we lose that storage revenue and also the compute revenue associated with moving that data into Snowflake. Source: Snowflake (SNOW) Q4 2024 Earnings Call Transcript
The fact that an open source project was prominently highlighted in the earnings call of one of the largest data vendors in the world, to me signaled that Iceberg had won. Iceberg was already disruptive to Snowflake. And if Snowflake is feeling it, then so is Databricks.
Iceberg had won the war. It is the de facto tabular format.
Iceberg is storage, but the opportunity is compute
One of the consequences of the cloud lakehouse approach is the separation of compute from storage. Compute engines like Spark, Trino, Dremio and others are now interoperable and can query or access the same data without the need to move or copy it. Tabular formats like Iceberg act as the common format to read and write data stored in lakehouses, and therefore allows for both the separation of compute from storage and compute interoperability.
With Iceberg winning the tabular war, it is now the centralized repository that stores all metadata associated with data in the cloud data lake. Said otherwise, Iceberg is now the storage layer (and transactional) layer for data stored in cloud lake houses. And owning the storage layer is highly strategic, for two main reasons.
First data is sticky, once you store and organize data in a certain way it’s very hard to move or store it differently.
Second, storage unlocks more valuable and margin rich compute, where Snowflake and Databricks derive most of their revenues and profits from.
Earlier I mentioned how compute in the form of AI is the third disruptive force. Historically the compute that would be used atop data stored in data warehouses or lake houses would be for analytical or BI purposes. That’s the core use case for Snowflake. One of the advantages of the lake house approach is it can enable different compute engines and use cases to access the same data. And in today’s world a lot of this compute is for AI use cases. Iceberg, therefore is the gateway to both analytical and AI compute, and the latter is a substantially large opportunuity.
Snowflake observed the impact of Iceberg through a decline in storage revenues. Snowflake customers opted to not store their data in Snowflake, but instead in cloud lakehouses, with Iceberg as the tabular format of choice. It wasn’t just the decline in storage revenues that Snowflake was worried about, but the potential of losing out on the much more lucrative compute margins. If data is stored outside of Snowflake in Iceberg tables, then it can be analyzed by other compute engines like Trino, Dremio and Databricks.
One can then see the motivation for both Snowflake and Databricks (rumor has it that Confluent was also bidding for Tabular) to own Tabular. Snowflake would own a highly strategic piece of the data platform and more importantly continue to lock Databricks into Delta Lake, which “lost” the tabular format wars. Similarly, Databricks owning Tabular pushes that asset away from Snowflake and still leaves the latter with an Iceberg and AI problem.
In both cases, Snowflake and Databricks are trying to own the storage layer that will unlock the most AI and business intelligence compute, and for the latter, that’s a lot of $$$.