Last time I wrote about what your 9’s are worth and how to cheat them. Let’s go nerd out on some SLAs in the wild and see what we find. The usual caveats apply:
I am not a lawyer, so take this up with your own lawyer if you’re using this advice in an actual contract.
Language copied here was reproduced at the time of this writing and analysis is my own interpretation.
Let’s get started…
AWS API Gateway SLA
This SLA is for the API gateway for AWS. It’s a large service, lots of people use it, and (most importantly) they publish an SLA.
Relevant Sections of the SLA Language
AWS will use commercially reasonable efforts to make API Gateway available with a Monthly Uptime Percentage of at least 99.95% for each AWS region…
“Availability” is calculated for each 5-minute interval as the percentage of Requests processed by API Gateway that do not fail with Errors and relate solely to the provisioned API Gateway APIs.
An “Error” is any Request that fails due to an API Gateway internal service error.
“Monthly Uptime Percentage” for a given AWS region is calculated as the average of the Availability for all 5-minute intervals in a monthly billing cycle.
Analysis
In English: On average, we will not throw internal service errors 99.95% of the time, measured every 5 minutes.
Is this good?
On the one hand, averaging together the availability in 5 minute buckets (rather than good/bad) is pretty rigorous. It means there can’t be a few really bad minutes but the rest of the time you were good (because the average will still go down), so this might be a better approach than good/bad minutes.
However, this SLA does not include latency and only includes internal service errors so the definition of availability seems very narrow.
Verdict: Not bad, but a narrow definition.
Slack SLA
Many of us use Slack and it’s an interesting service to consider for an SLA. On the one hand, some customers probably use it for mission critical incident communication. On the other hand, many probably use it to share cat pictures.
Relevant Sections of the SLA Language
Downtime is the overall number of minutes Slack was unavailable during a calendar quarter (i.e., January 1 through March 31 and every three-month period thereafter). We calculate unavailability using server monitoring software to measure the server side error rate, ping test results, web server tests, TCP port tests, and website tests.
…
Uptime is the percentage of total possible minutes Slack was available during a calendar quarter. Our commitment is to maintain at least 99.99% uptime:
[(total minutes in quarter - Downtime) / total minutes in quarter] > 99.99%
Analysis
In English: For 99.99% of minutes every month you will be able to ping us or our internal measurements will tell us you are not receiving errors.
Is this good?
This is also a minute based definition of availability, so it might be pretty rigorous.
On the other hand, the definition of downtime is extremely vague. Does downtime mean that there was a single error during that minute? What error rate is used to calculate “unavailability”? Does it include latency? What “web server” tests are being run?
Verdict: Too vague for my taste. Maybe the internal definitions of uptime are strict and this is great, but maybe not. I went to look at slack’s status page just now, for example, and their uptime still says 100% even though they they have statused for logins. Whether that impacts their SLA, I’m not sure.
Google Cloud SLA
Let’s go back to big tech and see what Google offers for their SLA.
Relevant Sections of the SLA Language
the Covered Service will provide a Monthly Uptime Percentage to Customer of at least 99.95% … Customer must also provide Google with log files showing Downtime Periods and the date and time they occurred “
"Downtime" means more than a one percent Error Rate.
"Downtime Period" means a period of one or more consecutive minutes of Downtime. Partial minutes or intermittent Downtime for a period of less than one minute will not be counted towards any Downtime Periods.
Analysis
In English: 99.95% of minutes we will have a 99% or less error rate.
Is this good?
This is a worse guarantee than AWS, since the minutes are not averaged and 1% is the specified downtime error rate. This is also a narrow definition of downtime, since only errors are included (not latency).
Verdict: Fine but not great. 99.95% is somewhat deceptive.
GitLab SLA
Relevant Sections of the SLA Language
For each service and feature described above, GitLab measures two service level indicators (“SLIs”)…
The Error SLI is an indication of requests that are successful, (i.e. not returning a 5xx error).
The Apdex SLI is an indicator of requests that complete with a satisfactory latency. Apdex is defined using the industry definition with two latency thresholds: satisfactory and tolerable. For Dedicated, satisfactory requests take less than 1s to complete, tolerable requests take less than 10s to complete.
Service Level Availability is then calculated using the following measurement:
For each calendar month, we calculate the sum of the combined SLI scores for all requests in that month, excluding any requests made during maintenance windows, and divide this by the total number of requests during that period (again, excluding maintenance windows)…
GitLab’s current monthly service level objective for GitLab Dedicated is 99.5% (the “Service Level Objective”).
Analysis
In English: 99.5% of requests made to our service every month will be error free and faster than 10 seconds.
Is this good?
This is the most descriptive SLA and kudos to GitLab because they are open about everything. For that they get a lot of credit.
They are also the only ones above mentioning latency as part of their SLA, which I also think makes this SLO “worth more” even if it has a lower guarantee.
Verdict: High marks for including latency, but a single bucket approach lowers the value of this SLA because without measuring per minute, a high volume of successful requests can drown out “real” errors and outages on longer time scales.
Conclusions
We can learn a lot about how to calculate an SLO/SLA by looking at what’s publicly available from some major vendors (as long as we are willing to wade through some legal jargon).
Keep in mind these SLAs are the lowest possible measure of availability (when the company must give back money) so I expect the internal SLOs to be more rigorous than these.