A lake, a cloud, a house and a warehouse

The ever evolving data landscape

Nov 27, 2023

In recent years, we've witnessed significant strides in data platforms, with Snowflake and Databricks emerging as standout players in this evolving landscape. Snowflake has earned its stripes for the cloud-based data warehousing solution, presenting organizations with a scalable and flexible platform to efficiently manage and analyze extensive datasets. On the flip side, Databricks has taken the limelight with its unified analytics platform, anchored in Apache Spark, fostering a collaborative environment for data engineering, machine learning, and data science. A noteworthy concept, "lakehouses," has also surfaced and is heavily championed by Databricks. Lakehouses provide an alternative to traditional data warehouses by combining the strengths of data lakes and warehouses, as we will see shortly.

In this post, I'll break down the distinctions between two approaches: data warehouses and lakehouses, shedding light on the major players shaping this emerging space, including Snowflake and Databricks, among others.

Data warehouses

Data warehouses have been the backbone of data management for decades, whether on-premises or cloud-based repositories. These warehouses excel at storing structured and organized data in a tabular format, ideal for quick and efficient querying. Designed for business intelligence (BI) and reporting applications, they follow a relational database model with tables and columns. While proficient in handling structured data, their limitation arises when dealing with unstructured or semi-structured data.

The structured nature of data warehouses is a double-edged sword. On the one hand, with a predefined schema, they excel in organizing and storing data, making them ideal for BI and reporting. On the other hand, this structure can become a constraint when faced with unstructured data. Finally, the potential for vendor lock-in due to how data warehouses store data in their own proprietary format, poses challenges in adapting to evolving business needs.

Cloud lakehouses

Enter cloud lakehouses, representing a modern architectural paradigm that harmonizes the flexibility of data lakes with the structure and performance features of data warehouses. Leveraging scalable cloud storage, lakehouses store structured, semi-structured, and unstructured data in its raw form without upfront structuring. The schema-on-read approach allows users to apply the schema at the time of analysis, adapting to diverse data sources prevalent in modern enterprises.

Lakehouses shine in handling big data, supporting advanced analytics, machine learning, and data exploration. Seamless integration with big data processing frameworks like Apache Spark and Apache Hadoop makes them a powerhouse for processing large datasets in distributed computing environments. A game-changer is their openness, allowing data storage in open formats, steering clear of the shackles of data warehouse vendor lock-in.

However, lakehouses aren't without their challenges. The decentralized nature of data storage poses data governance hurdles, demanding robust practices for metadata management, data quality, and security. Analyzing raw data can be challenging for users accustomed to structured data, and the performance relative to data warehouses might lag, particularly when stored in slower cloud data storage platforms like S3.

Tabular data formats

Now, let's talk tables, which are a critical component of the lakehouse stack.

Tabular data formats, such as Iceberg, Delta Lake, and Hudi, step into the scene to tackle the complexity of processing and analyzing large-scale data in cloud-based lakehouse architectures. Unlike traditional big data formats focusing on storage encoding, these tabular abstractions provide higher-level logical representations of data, managing storage, processing, and optimization under a unified framework.

This marriage of data warehouse strengths—structured schemas, ACID transactions, and SQL access—with the scalability and flexible schema of cloud data lakes is a game-changer. Analytics, reporting, and ETL workloads can leverage mature, database-style constructs, benefiting from nearly unlimited storage capacity. Iceberg, Delta Lake, and Hudi each bring their unique strengths and tradeoffs, catering to different use cases and scenarios.

In summary, the choice among these major tabular formats hinges on specific organizational requirements, evaluating existing technology stacks. Iceberg excels in schema evolution and consistency, making it ideal for data warehousing. Delta Lake, with its integration with Apache Spark, unifies batch and streaming for diverse workloads. Meanwhile, Apache Hudi's focus on efficient upserts and indexing suits incremental data processing needs.

Architectural overview: data warehouse vs. lakehouse

Let's delve into how these two platforms differ architecturally.

The diagram below illustrates the building blocks of a lakehouse stack. The storage tier, often cloud data lakes like S3, stores your data in various formats, such as Parquet, JSON, and Avro. The tabular format layer offers flexibility, and the engines in the layer above facilitate querying and analysis. The topmost layers represent the applications accessing the data through the engines.

Contrasting this with the architecture of data warehouses, illustrated in the diagram below, reveals a fundamental difference. Data warehouses encapsulate both storage and computation (querying) of the data, tightly controlling how data is stored. This structure aids performance but sacrifices flexibility and future adaptability.

An example might help illustrate the contrast between both approaches, especially that of vendor lock-in. Consider the case where you stored your data in a Snowflake data warehouse for a BI use case. A few months later you realize that you have a different use case that works better on a different product - data warehouse or otherwise. Because your data is stored (read locked) into Snowflake, using a different product for this use cases will require you to copy (ETL) the relevant data out of Snowflake and into that different product. That’s cumbersome and costly - storage wise and time/effort too.

Now consider that instead of storing your data in a data warehouse you had it in a data lake (S3) and are using Presto as the engine. Moreover, you decide that for that additional use case you decide that Spark is better suited. Because your data is stored in open and non-proprietary formats you can actually enable both use cases on the same data without having to move or copy it around. You can layer in different engines over your data, and tabular formats, depending on your use cases. That’s powerful.

The main players and how to choose b/w them

When evaluating the players in this space I tend to differentiate between the data platform vendors like Snowflake and Databricks and others like Dremio and Starburst - more generally the Presto-based vendors - that are more narrowly focused.

I am deliberately ignoring the cloud hyperscalers’ products (Redshift, BigQuery..etc) and instead just focusing on pure-play third party vendors

The 800lb gorillas: Snowflake & Databricks

Both Snowflake and Databricks aim to become the singular enterprise data platforms of the future - the former branding itself the "Data Cloud" and the latter proposing a unified data lakehouse model. Both claim to support various workloads ranging from analytical, transactional and AI - store all your data for all use-cases with us. Snowflake represents the cloud data warehouse approach, although it has recently been making inroads into supporting lake house scenarios. Witness one of Snowflake’s job postings below which shows how it extending into supporting (querying) data stored outside of Snowflake; Iceberg tables.

On the other hand, Databricks is tooting the horn of the lakehouse approach and espousing its virtues - openness and efficiency - relative to data warehouses (read Snowflake). Databricks offers a more comprenhensive platform which leverages its heritage in Spark along with its more recent innovations with Delta Lake and Databricks SQL. This enables Databricks to support many, if not all, of the major data use cases from BI, machine learning and ad hoc querying. Databricks’ recent acquisition of MosaicML enables it to support the very in-demand and lucrative generative AI use cases too.

For an in depth comparison on those two, I highly recommend this most excellent post.

Dremio

Dremio takes a pioneering approach by directly querying data lakes without the need for ETL, caching, or optimizing data. Its columnar cloud data lake format and query acceleration technologies deliver high-performance BI and ad hoc queries directly against data stored in cost-effective object stores. Dremio's support for querying multiple data sources in a federated manner makes it a fit for self-service analytics, building dashboards, and data science.

Starburst

Starburst, building on open-source Trino, offers a distributed SQL query engine for data lake analytics. Supporting querying across multiple data sources in a federated manner through the ANSI SQL interface, Starburst adds enterprise security, access controls, and tooling for Trino. Ideal for centralized BI analytics across a multi-cloud data estate and bridging various data silos, Starburst suits organizations seeking standardization on ANSI SQL-based analytics.

Ultimately, the choice between these players depends on use cases, data types, volumes, and required query performance. Data warehouses excel at governed SQL analytics on refined data but may lack flexibility. In contrast, lakehouses offer versatility but require additional optimization.

Therefore, many organizations today find value in using both, leveraging dedicated data warehouses for critical analytics and enabling exploration, as in ad hoc analytics, on flexible lakehouses. Lakehouses also appear to be the better choice for computational workloads that require massive amounts of data like AI or machine learning.

Looking forward into the future, I suspect that the lakehouse approach will be the dominant model. The benefits this model offers - openness, lower cost, & interoperability - far outweigh its current limitations of complexity and performance relative to data warehouses. The major vendors in this space will continue to make it easier to adopt their products and likewise continue to enhance their querying engines, caching and other optimizations to allow them to offer the same performance as a data warehouse. Please refer the last two articles in the resources below for a fun read on the performance spat b/w Snowflake and Databricks :)

“With a data lake, data is stored in an open format, which makes it easier to work with different analytic services. Open format also makes it more likely for the data to be compatible with tools that don’t even exist yet. Various roles in your organization, like data scientists, data engineers, application developers, and business analysts, can access data with their choice of analytic tools and frameworks.
You’re not locked in to a small set of tools, and a broader group of people can make sense of the data.” Source: Werner Vogels

Cu(m^2)ulative

Discussion about this post