The past decade witnessed a Cambrian explosion of SaaS and cloud services, which has resulted in a very wide and fast growing adoption of both. The recent pivot to cost optimizations, has however put this spend in check. Companies large and small are re-evaluating their spend on both SaaS products and cloud overall.
For software engineering organizations, the majority of the spend tends to be allocated to cloud services, as illustrated in the chart below from ICONIQ
When thinking about optimizing cloud costs, I tend to assess this problem in one of two planes. The first is to evaluate usage across cloud service type, which assess usage (and cost) across all compute, storage and networking. The second evaluates usage at the account level and is very useful if you decompose your cloud usage by use-case and perhaps at the functional level too. For example, you might allocate one or more accounts for dev/test, another for support, one for production and so on. In this post I will offer some of my heuristics in optimizing usage and costs across these two lenses. The post will be AWS specific, but the principles apply to any cloud. I’ve applied those to GCP in the past as well.
For a more general read on cost optimization, I recommend the article below from
Service Type Optimization
The services that tend to consume the most resources will be idiosyncratic to each organization, but they will most definitely be one of compute, storage or networking (duh). I’ll omit networking for brevity and because it tends to be the most difficult to reason about and optimize. Suffice it to say, ingress is cheap, egress not so much.
Compute
Compute, in my experience, tends to be the most wasteful resource. There are a few reasons for this. First, compute is the easiest to misallocate. We tend to over-allocate compute resources, or plan for the max usage. Second, compute is easiest to leave running for days, weeks and sometimes months from last usage. Below are some of the tactics I apply to stem the waste
The grim reaper
Once a week, I will look at all compute instances running and find ones which are left behind. These are easy to spot in dev/test and support environments. I will then kill them. Destroy the instance. The first time you do that, you might get yelled at and to be honest, I ask the first time before I destroy. I don’t ask the second time. Try it, it’s cathartic.
This sets the precedence of not leaving compute resources beyond their intended lifetime. Long running compute jobs are tagged as such to prevent accidental destruction.
Auto-scaling
The incoming load and demand to your cloud services will vary. This applies to production and non-production environments alike. Hence, allocating compute resources under auto-scaling groups helps with right-sizing them as the load varies. You don’t need to run 100 EC2 instances for your CI/CD pipeline 24/7/365.
Reserved Instances
This step should be done after the two previous ones, or more generally when you have a steady state of demand that you can reliably predict. Once you are able to predict your compute usage, purchasing reserved instances becomes a no-brainer. The risk to not being able to predict demand is over-buying, which is wasteful.
Storage
Storage is another area that can rapidly result in huge runaway costs. The primary reason is that data is created and never destroyed. We just store data forever, regardless of its utility. The tactics I apply here are similar to compute.
The grim reaper
Similar to compute, I will scan for lingering databases, which tend to pop up in dev/test and support accounts and zap them. S3 is another area I will scrutinize across all accounts. AWS gives you good tools to track the size (and usage) across S3 buckets. I’ve purged petabytes of data (no joke) that was created once and never read. Sometimes this data was generated erroneously. ETL pipelines are notorious for this, so are storing historical logs of all sorts who inevitably lost any value.
Reserved Instances
Similar to compute, only apply reserved instances (to databases) once you are able to reliably predict your usage for them.
Account Type Optimization
The optimizations by account type are nothing but applying the tactics mentioned earlier at the service layer. The distinction at the account type is in the applicability of these tactics to each account. For example, applying a grim reaper approach to your production environment is probably not a good idea. On the other hand, doing that for dev/test and support accounts is totally permissible (modulo the yelling at part)
There’s another distinction with accounts and that has to do with whether the account should exist at all or be collapsed. Dev/Test accounts are a great example of this conundrum. Dev/Test accounts exist to allow developers to test their code changes before they commit them to the code-base. These environments tend to be replicas of production. A developer might want to run tests against a build that contains their local changes. Once she has verified that her changes work well, she commits the code and the CI/CD pipeline takes over. There are two issues with this approach.
The first is scalability. You cannot have a single dev/test account that is shared across all developers on your team. Therefore, you will need more than 1, oftentimes a lot more than that depending on the number of developers on your team. As your team grows, so will the number of these environments and therefore the cost. That’s not sustainable.
The second is usage. Most tests do not necessarily require having an environment that is identical to production. Unit tests don’t need that and most end to end tests can run on a scaled down version of your production environment. Alternatively you should be able to run all tests (maybe not all, but a huge portion of them) in an ephemeral environment. That could be locally to your dev machine and/or spun up in the cloud and collapsed immediately upon use. That’s where having a Platform team comes in play - they can build these environments for your development teams.
Putting it all together
The table below summarizes which tactics I tend to apply and to what account types. I also outline the impact of each tactic as well.
* Auto-scaling and reserved instances are great fits for compute resources need for CI/CD pipelines (e.g build agents)
There are a few other best practices that I strongly recommend.
The first, is to look at your cloud costs, understand them and optimize them consistently. That might be once a week, once a month, or quarter. Pick something that works for you and start doing this work.
Second, don’t go overboard with cost cutting. You want to optimize spend but not cripple your team. Don’t be penny wise and pound foolish.
Third, try and delegate that task to other senior leaders on your team (the head of infrastructure is a good candidate). Have them own that problem, but only after you have understood it first and set the tone for the rest of the organization.
Fourth, is to introduce a thorough review process for the adoption of new cloud services, especially in production. Adopting a new service without modeling its usage over time will backfire. For some services (databases), picking the wrong choice can be very difficult to undo.
Fifth, is to share with your team overall cloud costs, usage and why optimizing that is important. Your team probably has no idea about the scale and magnitude of how much you spend on cloud infrastructure. This transparency will help alter behaviors and give your team context on this spend.
Lastly, leverage the cloud tools to help with this effort. AWS Cost Explorer is a great tool. So is AWS Cost Anomaly Detection.
The tactics you apply might differ than the ones I outline here, but the onus is to consistently understand and optimize your costs. Investing in this effort also sets the tone to the rest of the engineering organization: optimizing usage is important. It’s an area of continuous scrutiny.
My anecdotal evidence shows that applying these tactics has resulted in ~40% reduction in cloud costs. The reduction will be observed the first time you apply these tactics. Continuing to apply them will ensure that your costs remain flat or growing modestly.
40% is what I have observed every time a push to monitor and cleanse cloud usage. The challenge is in maintaining a steady state of usage. That requires continuous hygiene and observability (tooling + people). The cloud makes it ridiculously easy to spend more money, especially if as an org you reduce the barriers to cloud usage (for software developer speed/productivity)
40% is a huge opportunity! And a great way to get promoted.
Thanks for the mention too.