The 3 Pillars of Service Availability
Basic things you need to drive accountability for availability.
Last time I talked about making sure we baseline what it means to operate a service with the virtuous cycle. Once people have this concept internalized it’s time to take the next step and baseline how to implement some of the cycle.
Clearly I am all about the branding when it comes to communication throughout the engineering organization, so here is another thing you can say 7 times over…
The 3 Pillars of Service Availability
Like three massive stone columns holding up a greek monument, these are the main things you need to implement to keep your service reliable. They are the way we keep teams and services accountable for our customer experience:
Incident Response. If you do nothing else, make sure you have an incident response lifecycle that standardizes response, tracks incidents that happen, and perform reviews with follow up incident repairs for your service (how do we do better next time?).
SLOs. Much has been written about SLOs. Ultimately, you need to measure what the customer experience is for your service so you can understand when it starts to degrade.
Operations Review. This is not a PRR. An operations review is a forum to review the first two pillars (and maybe more metrics) on a monthly cadence. It is a way to look back at history and create accountability to make sure the right things are happening. Usually you’ll want to do an operations review at the team or director level (line manager or one level up from line manager level). If you go any higher in the management chain you’re not exposing the actual people on the ground to the data. Executive focused meetings are still important but should be different than the operations review.
You don’t need to do these things well to get better at availability. It’s important to remember that implementing these pillars is more important than doing them well. In order to raise the bar for availability you need the visibility and reporting to start, even if it isn’t that great. The visibility of the data is what is most important.
Who is running these processes is as important as what they’re doing. Understanding who is doing this stuff is as important as understanding what it is. The engineering team operating the service needs to own and operate the 3 pillars. This is how accountability can be driven within the team. The job of your SRE or platform teams should be to provide the tooling and templating necessary to enable teams to implement the 3 pillars, not to actually lead the reviews or implement the processes.
Implementing the 3 Pillars
If you don’t have these kinds of things, or you want to validate what you have, I’ve provided some resources below to create a small toolkit.
Incident Response
The most comprehensive place I can point you to is the PagerDuty documentation on incident response. After reading that you can check out the set of resources they publish for all phases of the lifecycle. All of these things are useful regardless of whether or not you actually use PagerDuty.
SLOs
The Google SRE book is a great source of wisdom on implementing SLOs. I think it is important to take it with a grain of salt, since it is based on Google’s experiences. And let’s face it, you’re probably not Google.
Another great place to start on the SLO journey is the Implementing Service Level Objectives book. It has quite a lot of great practical guidance on the subject.
On the digital side of things you can also check out the SLO development lifecycle.
Operations Review
This is a trickier place to find a good toolkit. Here is a link to a very basic sample set of slides for operations review. The point of this meeting is to make sure someone is thinking about themes and trends and raise the visibility to the rest of the team. I’ve included some fake metrics and data points to give an idea of what you might want to include in these types of slides.
This is not the end.
Keep in mind, these are only the very basic things you will need. There are many more tools available to you in the service availability toolkit (for example, service tiers and a maturity model).