In a previous article, I laid out a simple framework to navigate the various challenges involved with using data to build machine learning (ML) models. This framework is illustrated in the diagram below.
The previous article covered the challenges of data collection and presented a relatively simple infrastructure stack concerned with the data collection problem. We’ll be picking up from where we left off, more specifically trying to navigate the storage challenges for our cat recognition application. I will also describe the storage layer we use at Kheiron
But first, we go back to our cat recognition app!
Chances are that in the earlier days of our cat recognition endeavors, storage wasn’t really a big concern. We’d get a batch of images and either store them somewhere locally, perhaps on a shared network drive or maybe in the cloud in S3. One of the first storage challenges, specifically in ML, that I see is having islands of data. An example might illustrate this point.
You might recall that we started off by offering a simple cat recognition app. In time, our app became quite successful and the business asked us to augment it by offering a new feature: to highlight the location of a cat in a photo (localization). This feature will require some modifications to our existing dataset and building a new ML model, one that recognizes and localizes cats in images. Later on, the business asks us to build models that can recognize images of cats which are common to Asian countries.
When these new requests come in, the path of least resistance is to simply copy the data that we have and use it to build the newer models. One replica will be used for the original ML model that recognizes cats. Another replica, with additional data, will be used to build the new ML model with localization. Yet another copy will be later on used for the Asian cats model. Additionally, each of these requests comes with an incremental data change. The localization model needs data beyond the raw images - it needs the coordinates of cats in its training dataset. Similarly, the Asian model will need images of cats common in Asia and some metadata.
The thing about this replication approach is that there’s nothing inherently wrong with it. We do it all the time in software: we branch and fork codebases. The reason we do this is to give a feature development team, or a release it’s own silo-ed playground so to speak. However, traditional software (Software 1.0) has a very rich ecosystem of tools that facilitate this. We have Git and other version control systems that help with branching and version control. These tools also provide these services, namely branching without having to do a full copy. I should note that there are some new tools, like DVC, that are trying to solve this problem. I haven’t used them and have no experience with them - yet. The main benefit of this replica approach is speed of prototyping and development, while the obvious challenge is that of cost, both in terms of storage and manageability. So what is one to do then?
Before I dive into the details of our approach at Kheiron, I’ll first provide a bit of context on what we do and some of the constraints that we relaxed. Kheiron builds ML models for cancer detection, more specifically breast cancer. Our dataset is comprised of mammography images and metadata about those images. The metadata includes demographic and clinical data like biopsy results, history of cancer, previous breast cancer examination and so forth. The former is critical to understand the efficacy of our models across various demographic stratifications. The latter forms the foundation of our ground truth, which could be simplistically viewed as trying to answer this question: “does this image contain a malignancy” When we build ML models we rely on both the raw images as well as the metadata.
When thinking of the storage, we came up with a few requirements and constraints that helped reduce the complexity of the problem. These are the following:
We’re going to consolidate all of our data, both unstructured (images) and structured (metadata) on AWS.
We will set read-only permissions to this data to anyone outside out of the data team.
We don’t care about versioning data, more specifically images.
The first constraint is fairly obvious. Our storage infrastructure of choice will be AWS. I’ll dive into the specifics of which storage medium we chose later on. The main motivations for choosing AWS are not needing to manage any on-premises storage infrastructure and the ease of scaling infrastructure on AWS. We also want to have our data in close proximity to our computation cluster. We’ll cover this topic in the next article.
The second constraint sets the interface between the data storage layer and all of its consumers: read-only. Consumers, which are mostly ML engineers building models, can only read the data and not modify it. Any modifications that happen post reading from our data layer and ephemeral and not persisted. The most common examples are image transformations (windowing, filtering, rotations and so forth). If we do encounter common transformations to the data, we try to do those before persisting the data in our storage layer. Our goal is to make the data we store to be readily available for ML training without any modifications.
Of those, the most nuanced one is the third constraint. It eliminated the need to solve the versioning problem. The main motivation for ignoring data versioning - at least for now - is an important assumption: more data implies better ML models. Simply said, we’re making an assumption that if our dataset changes over time, mostly by data additions, then our models will improve. So long as this assumption holds true, then we need not worry about versioning, snapshots and going back in time. The most recently available data, should yield the “best” model. More on model assessment and selection in the next article. There are some cases where we actually do care about the dataset used to train a model and want to have that set fixed over a set of experiments. For example, we might be evaluating various models and want to fix the training and test datasets to compare models without worrying about the effect of different datasets. We can accomplish this without making replicas, as will be evident shortly.
With these constraints in mind we decided to build our storage stack using S3 and Redshift - surprise surprise. We use S3 to store all our unstructured data, which are mostly images. Redshift is our data warehouse of choice for storing all the metadata about our images.
It is worth noting that we also use S3 in the raw data storage layer, which is the topmost layer in the stack below. Raw data, as the name suggests, is data that we haven’t mapped or transformed into data that we can use to train and build ML models with. This data comes in a variety of formats, depending on how it was sourced. The layer beneath the raw data layer - the wrangling layer - is responsible for cleaning, mapping and transforming it into a format that we can train ML models with. These transformations are done using a combination of lambda functions and Apache Airflow. We also visually inspect the raw data using a custom tool that we developed to catch any anomalies, which can happen quite frequently with images. In short, the wrangling layer is responsible for mapping messy data (raw) into some common format and schema that is easily consumable by the ML training pipelines. We’ll cover that topic in the next article, which is concerned with using this data for analysis and ML training.
The storage stack I just presented is simple. Probably too simplistic, but that’s the point. We wanted to start with the simplest stack possible and did so by relaxing our requirements. The current stack does the job it is intended to solve, which is to facilitate the training of ML models.
If you’re enjoying this, please share my newsletter with someone you think will enjoy it too.👇🏽
Main photo by Sajad Nori on Unsplash