A few weeks ago I came across a tweet from Turner Novak on Datadog’s mystery customer who apparently had a $65M Datadog bill.
I have no insights into whether the spend was warranted, although I will wager that it wasn’t. My focus in this article is on how to prevent runaway (infrastructure) costs and what checks and balances need to be applied to not find yourself in a $65M hole.
First, I need to get a few things out of the way. I abhor Datadog’s pricing model. It turns out I am actually not the only one. Just hop on Reddit and see for yourself!
Second, in spite of Datadog’s complex pricing model. the onus shall always remain on me - the buyer - to understand and model it. If I can’t model it, or at least try, then I shouldn’t buy it. Every product - illustrated in a honeycomb - below has its own pricing model. Sometimes pricing is per host. Other times it is by data ingested, or per test run, or per session, or custom metrics, or spans. There’s a lot to understand and model. The gist of it is the product portfolio and pricing is complex, but the tooling to help with modeling is available. It’s not easy to model, but certainly doable.
Preventing runaway costs like this example, require both top and bottoms up oversight. Bottom’s up means the software developers and DevOps teams who use, in this case, Datadog, must model and understand their usage. Top down is applying continuous (monthly) reviews on actual usage vs planned.
Bottom’s up: Budgeting
Before your team adopts a product like Datadog, or broadly speaking products that are priced based on usage, they must both understand and model their usage. Yes, that means that your software developers must be able to model how much they expect to use Datadog and how much they will spend for it on a monthly basis. Below are some of the few unknowns that the team must solve for.
How many hosts will this run on?
How many custom metrics?
How much data will we ingest?
How long should we retain the data?
Not only should they project the usage across the variables above, but should answer two additional questions. “Why” and “what if”
“Why” provides a business justification and the rationale for collecting this data. What is the value of these custom metrics? What are we going to derive out of them or use them for? Having your team think and articulate the value of the data they are collecting, the metrics they are creating and so on, enables better usage and utility of the product. The value derived out of the product after going through this exercise is far greater than using it without thinking about “why”.
Similarly, “what if” tried to project usage at an order of magnitude larger than what is currently projected. What happens if our metrics 10x? Or data ingested? Going through a “what if” scenario highlights the very volatile nature of these products. Usage can spike linearly, but costs not! That in turn builds awareness within your team. Developers will understand how runaway usage can result in huge bills. This awareness is your best defense in stemming runaway costs and finding yourself with an $65M bill.
If you can’t model it, you don’t understand it and thus you shouldn’t use it
Top down: Validation and tracking
Your team has done their homework and forecasted spend. The next step is in validating these projections and tracking them over time.
In my opinion, usage-based products should be subjected to more scrutiny from finance and procurement organizations. They must demand that they buyer, in this case R&D, supply a budget and mitigation plans for preventing runaway usage. The latter shows that the buyer is aware and has thought of the volatility of these products: they are aware that if usage isn’t capped, spend can balloon.
With your budget approved and procurement completed, comes the final step: budget to actual. This is a monthly review in which your actual spend is compared to what you projected. The review is meant to validate assumptions, of which there are many when projecting spend for a product like Datadog, and reacting to any variance. Tracking actuals vs budgeting over time allows for catching variances early and often.
Putting it all together
The bottom’s up and top down approach that I advocate for can be illustrated in this simple flow. Each step is owned by a different persona. The budgeting exercise is owned by the R&D team - the end users of these products. The steps beyond that are owned by the VPE with oversight from the CFO.
The value of applying this exercise at both elevations: top and bottom is that it drives ownership and awareness across the entire “stack”. The most important being that R&D teams are aware and understand how these products should be used and their implications on spend.
Cumulative is a reader-supported publication. If you enjoyed this post, you can tell Cumulative that their writing is valuable by pledging a future subscription.