Different Ways to Aggregate Nines
While working on SLOs, SLAs and SLIs I have found that there are only so many ways to aggregate service metrics. I have not yet found somewhere that attempts to review the different aggregation methods and what their relative strengths and weaknesses are.
Let’s do that below so I can go back and reference this post in the future.
One Big Bucket
All of the successes go in the numerator of a fraction, the total requests go in the denominator. This is an approach outlined in the Google Book.
SLI = (good requests) / (slow requests + server errors + good requests)
Pro
Dead simple to aggregate. You can do simple division for any period.
Easy to combine metrics. As you can see in my example above, latency can be counted along with error rate or whatever else.
Works for (relatively) low request volumes. The calculation is simple so it makes it easy to compute, even for low request volumes.
Con
Everything counts the same. There is no way, for example, to show that a Write is more important than a Read. When modeling customer behavior this can be very important. If I can’t view something I might just reload the page, but if I spent 10 minutes working on a document and I can’t save it I leave frustrated.
Time is not important. Using this model we can’t measure the most important thing to a typical user: “How long was it down for?” Many times this is more important than “How many requests failed?”
Overly optimistic. This bucket approach tend to be overly optimistic because it weights everything the same. If you choose not to combine metrics then you’re left with a high number of SLOs and no way to have a single number.
Good and Bad Minutes
When I hear people talk about 9’s, this is the metric I want them to talk about. This is also typically the way companies publish their SLAs.
For every minute we measure some metric. Sometimes we define a lookback period (for example, over the past 15 minutes). Then we create a fraction of good minutes over bad (in DataDog this is a monitor based SLO).
We can then combine metrics by measuring minutes when any minute based metric was “bad”.
latency minutes = minutes where 99% of requests were fast
error minutes = minutes where 99% of requests were successful
good minutes = good latency minute && good error minute
SLI = good minutes / total minutes
Pro
Allows for cross service comparisons. Time can be compared across services, regardless of what type of requests were made. For example, you can compare a platform SLO with a user facing service and see if both had impact at the same time. This can be powerful for dependency analysis.
Shows the impact of picking multiple metrics to include in an SLO. Every metric added has a significant negative impact on the availability measurement.
Most customers care about time. This measures that dimension specifically.
Still pretty simple. Each measurement over a given minute is pretty easy to reason about. Their combination is a simple “and” so there isn’t much needed to understand the underlying calculation.
Con
Can be pessimistic. There is a significant “punishment” for just missing a target on any given minute. For example, if your latency target is 99% of requests and you hit 98% then the minute is “bad”.
Must have a high volume of data. The time buckets for this type of SLO need to be small (otherwise large chunks of time will be “red” every time there is an issue). This means you need a high volume of data to be sure it’s accurate and not flappy.
Does not represent customers impacted. This is a time based metric, not a volume based metric so it does not mean “how many people were impacted”, it means “how often did we meet the standard”. This is a nuance that can be confusing, particularly when it is first introduced.
Weighted Average
This approach is very similar to the “One Big Bucket” approach, except that instead of only combining good / bad you weight different types of metrics with a fudge factor.
Using weights helps account for the problem where a certain type of request may be more important than another and gets drowned out by a lot of traffic in the “One Big Bucket” approach.
SLI = ((0.3 * successful write requests) + (0.7 * successful read requests)) / ((0.3 * all write requests) + (0.7 * all read requests))
Pro
Helps account for “more important’" scenarios. Using weights will help you account for important scenarios (like Writes) while still allowing you to combine metrics together into a single number.
The end result still looks like a percentage of requests impacted and not time. When people reason about the number they can think of it as “total customers” and not “did we meet the standard” (as compared to the time approach above).
Con
Fudge factor is arbitrary. To me this is a big con, since the weight you choose is essentially arbitrary. It means that the final number does not have much practical meaning. The more metrics you add to the equation, the worse this gets.
Can be overly optimistic. Because this uses weights, a single metric going to zero will still only impact the final result at the weight it has been included, so for example if you are only weighting something 0.1 and it drops to 0 then your final calculation will still only ever drop to 0.9.
Customers Impacted Over Time
This is the least common approach I have seen, but it is well documented over on Brian Harry’s blog. It is a version of the minute based approach above, except that instead of measuring a metric you measure the number of customers impacted in a time period.
This version of a time based metric has all the Pro’s of a time based metric without some of the Con’s.
latency minutes = minutes where 99% of customers did not have a slow request
error minutes = minutes where 99% of customers did not encounter an error
good minutes = good latency minute && good error minute
SLI = good minutes / total minutes
Pro
Measure what matters (customers). I like this approach because it measures total customers and not request rate. Request rate can be flappy because a single customer could encounter an edge case and retry a lot of times (particularly if you don’t have good rate limits).
Con
Does not represent customers impacted. This is still a time based metric, although maybe a less flappy one. This is a nuance that can be confusing, particularly when it is first introduced.
You need to have enough customers. Just like you need to have enough data, if you have a small number of customers this metric will be too low volume.
High data cardinality is needed. In order to calculate customers impacted you need to tag every request with a customer and be able to aggregate those requests over a given period. Tools like DataDog have trouble with this, so depending on how you aggregate and store metrics this might be hard to do.
Conclusion
Although there may be many ways to measure service health, there are not that many ways to aggregate and combine that data. Each has its own drawbacks and positives. In a future article I’ll use this survey to make some more practical recommendations on which approach to choose and why you would even want to aggregate metrics in the first place.