All credit for the rule of 5 errors goes to Alex Palcuie during his SRECon talk in 2022. It has been one of the most useful ways to frame error rate that I have seen. I have found myself referring to it multiple times and there isn’t a standalone post on it, so here we go.
The Rule of 5 Errors is simply a heuristic that says we don’t start tracking errors on a given metric unless it has 5 more or more errors over a given time frame.
To make this easier to visualize, here is a table that illustrates what error rate percentages look like for 5 errors at different request volumes:
I have found this chart to be an awesome illustration of the concept that low volume data tends to be noisy. Yes, that sounds intuitive, but this chart gets the point across way better than words will.
For example, looking at the table above, if you only get 100 requests, 5 errors will represent a 5% error rate, something you may well alert on if you are trying for 99% availability. Monitoring and alerting on that data volume will quickly cause alert fatigue.
In general you can use this chart to evaluate whether the minimum data volume for any given metric is reasonable to alert on.
My general rule of thumb is that if you don’t have 1,000 or more requests during low periods (for example, the weekend) then you may want to reconsider how you are building the metric or turn off the alerting when you have less than that volume. In Datadog, you can implement a rule like that using composite monitors.