Towards a modern data platform

Open, scalable, fast and flexible

Oct 03, 2021

Hello, and welcome to my newsletter!

I’m Karim, and every ~2 weeks I tackle questions or problems I’ve witnessed in startups from the very early stages up to late growth stages.. Much of my startup experience has been in leading engineering organizations, but I cover topics outside of engineering as well.

Send me your questions or suggested topics and in return, I’ll try and answer them through a post to this newsletter.

If you find this post valuable, check out some of my other popular posts:

To receive this newsletter in your inbox every ~2 weeks, consider subscribing 👇

Over the years I have witnessed several evolutions of the data infrastructure landscape, or more broadly speaking how you store and use data. Not only is this space fast evolving and fascinating, it is one that most companies have to grapple with. We’re becoming, if already aren’t, a truly digital species. Much of what we do ends up consuming data and creating more of it. This in turn results in companies offering us products and services having to store and use this data somehow. If software is eating the world, then data is its fuel.

This post offers my perspective of the evolution in this space starting from traditional data warehouses, onto data lakes and finally the impact cloud had on the data space.

Let’s dive in.

Web 1.0 Data warehouses

I started my career in 2000 and back then if you wanted to store data of meaningful size (and importance), you would likely default to a data warehouse, probably an Oracle product. Data warehouses excelled at storing, and processing large volumes of data with optimized performance. Data warehouses provide strong performance for structured data with methods such as columnar storage, query optimization and caching of frequently used tables. The table-based schema of a warehouse enables ACID transactions, record-level mutations and granular privileges (at the table and column levels). The typical use case for data warehouse was, and still is, reporting and business intelligence (BI).

The data warehouses of this era would be delivered as software running on specialized hardware. When I started my career, you would receive your Teradata or Oracle database on the loading bay of your data center! In addition to the tight coupling of software with hardware, these data warehouses tightly integrated compute with storage. Finally, these data warehouses stored data in a proprietary and optimized format, which was ultimately done to attain good performance.

There were two significant implications to the tight coupling of software & hardware and storage & compute. First, it was quite difficult to scale storage or compute quickly and cost effectively. In either case you needed more hardware, oftentimes custom hardware that you would have to buy directly from the database vendor. Second, because the database ultimately stores the data in some proprietary format, you ended up being locked-in with a vendor. It was very difficult to move data to another database. Databases of that era also suffered from poor performance when handling semi-structured or unstructured data,

The elephant in the room: Hadoop

Apache Hadoop was introduced as the first iteration of the data lake, to tackle the data warehouse challenges of high costs and handling semi-structured and unstructured data. Hadoop was largely an on-premises solution but provided an open format file storage at a much lower cost than data warehouses. The promise of lower-cost storage had organizations investing heavily in migrating their data out of the traditional data warehouse and into the Hadoop File System (HDFS).

However, it became evident that transforming, managing and analyzing data in HDFS is not easy, requires an extensive set of tools and does not easily provide the performance that is required for common use cases such as interactive queries and BI. Although the flexibility of file storage was an advantage, the lack of data management capabilities such as transactions and data mutations inhibited the success of many on-premises data lake initiatives. Ultimately, I believe that the difficulty in managing Hadoop environments along with the major impact that cloud had on this space ultimately led to the demise of Hadoop.

And then came the cloud

Photo by Fausto García-Menéndez on Unsplash

The rise of the cloud had significant implications to both the traditional data warehouse and data lake space. On the data warehouse side, public clouds like AWS and Azure enabled the separation of compute from storage along with the promise of infinitely scaling both. This resulted in a retooling of the traditional on-premises data warehouse and offered more flexibility to separately scale storage based on data volumes and compute based on query workloads.

Nowadays modern cloud data warehouses such as Snowflake, AWS Redshift and Azure Synapse provide standard data management capabilities such as transactions, record-level mutations and time travel, whilst also separating the compute from the storage layers. However, the data that is stored in these databases is still in a proprietary format.

These data warehouses, much like their predecessors, also required that any data that needed to be processed by the databases’ querying engine be stored in the database. This in turn resulted in lengthy and cumbersome ETL processes to move data from where it initially resided, ultimately into the data warehouse.

In short, this new breed of databases offered the flexibility of scaling storage and compute independently, yet still hadn’t solved the vendor lock in problem and you still had to deal with cumbersome ETL pipelines, which also resulted in the need to copy and move data around.

The impact that the cloud had on the data lake space, specifically Hadoop, was immense. Products like S3 offered a much cheaper, easier to manage and “infinitely” scalable alternative to HDFS; Hadoop’s storage layer. The early cloud data lake service providers like AWS EMR & Azure HDInsight focused on bringing the Hadoop stack to the cloud, providing cloud-based provisioning of open source compute engines such as Apache Spark, Hive and Presto.

While these services reduced the effort of creating a data lake in the cloud, they did not provide the performance or data management capabilities required for common use cases such as data warehousing and BI. They also lacked transactional support too, something that data warehouses excel at.

However, they extended on the notion of separating compute from storage by taking it one step further and separating compute from data. Decoupling the architecture further in this manner makes it easy to scale the system up or down based on the volumes of data and workloads while preserving the flexibility of open file formats from traditional data lake architecture. It also meant that I could store data of different formats such as Parquet files, CSV, JSON and more all in the same data lake storage tier (e.g. S3).

What was different in this evolution was that I could then choose the processing engine that analyzed this data, based on both the data type and use case. That meant that even though I stored JSON and Parquet files in the same storage tier, I wasn’t necessarily using the same processing engine against both.

A modern data platform

Right now, I think we are in the midst of another evolution of this space. One in which we can design data platforms that leverage the benefits of the cloud, namely the separation of compute from storage and data without having to compromise on performance or the ACID guarantees of a traditional data warehouse (cloud or otherwise). In particular, the new platform that I see emerging is one that offers the following.

Decoupling of compute from data
Low cost & high performance
Open storage format
Transactional support
Limited to no data copies.

The diagram, or four layered cake, below illustrates one way of building a data platform that satisfies the above criteria. The bottom-most part of the cake is the data layer, which is based on open file formats such as Parquet, JSON, Orc and others. This data is stored in a cloud-based file-system such as S3, Azure Data Lake Storage or Databricks’ Delta Lake.

The layer above storage is that of the engines. These are the compute engines that process data stored in S3 and the like. Some of these engines, such as Dremio and Databricks, offer the ability to transactionally modify data stored in the storage tier. Dremio does that by leveraging Apache Iceberg1, which enables DML (insert, update, delete) operations, transactions and time travel on a storage tier, such as S3, that doesn’t necessarily support these semantics. Finally the topmost layer is that of the applications, such as traditional BI tools along with machine learning frameworks.

This architecture offers several advantages. First, it leverages open file standards, thereby eliminating the risk of vendor lock-in. Second, it leverages the reliability, ease of use and flexibility of cloud file-systems, which can scale as your data grows. The scaling of the file-systems is dependent on your data footprint and is not necessarily tied to compute or other resources. Third, it eliminates the need to copy or ETL data. Data is stored in a cloud file-system and processed directly by an engine such as Dremio, thereby saving time and money. Fourth, it allows for the decoupling of compute from data. You are now free to use many engines and unlike data warehouses, not tied to just one. The choice of engine can be based on the data format and overall use case. For example, Spark could be an engine of choice for a machine-learning use case, while Dremio could be used for a BI use-case.

Finally, this architecture doesn’t preclude one from still using data warehouses. You can still provision a cloud data warehouse like Redshift of Snowflake and have it process the data that is stored on your cloud data lake. You can have your cake and eat it too :)

You should also check out Tabular, which is building a new data platform leveraging Apache Iceberg

Cu(m^2)ulative

Discussion about this post