Tiered Availability Review
Presenting availability at different levels. Why it's important and how you can do it.
Last time I talked about the 3 pillars of availability. One of them was operations review. While I provided some detail, I didn’t go too far into specifics. This time I want to talk about a more specific set of forums you can create and use to present availability to a variety of audiences.
First a caveat: the size of your engineering organization matters. At a startup you may only have a single availability review. At a massive company there may be many layers of complexity that already exist. For the purposes of baselining, I’m talking about a ~1500 engineer organization (mid-tier), which is the size that GitHub currently operates at.
Of course, to go back to the 3 pillars of service availability, in order to have these forums you need to start collecting the data.
Why are these forums important?
Like I mentioned in the virtuous cycle, it’s not enough to have a bunch of cool looking dashboards and metrics. It’s not even enough to have emails and reports go out with the numbers and metrics (even if they go to executives).
You need to have forums within the engineering organization that drive accountability for availability. These are how you raise the cultural bar that says “this stuff is important”. The forums are not set up for the purpose of drudging through numbers (though they contain numbers for accountability sake). They are set up so we can constantly discuss these questions at each level of the organization:
Do our availability metrics actually represent the reality our customers are experiencing?
Are we prioritizing enough availability work (debt) in relation to our feature work?
Are there any common themes to the data that tell us where to invest our limited resources?
For all of these meetings, once a month is probably the right initial cadence. Adjust as needed.
Engineering Team Availability Review
Meeting Audience: Individual Engineering Team
Purpose of the Meeting:
Making sure everyone understands service operation concepts.
Emphasizing the cultural importance of service availability vs. feature delivery.
Constantly thinking about debt for planning purposes.
Important Metrics: Incident counts, service SLO, incident repair items, bug count
This might be the easiest meeting to set up. Chances are there are teams in your organization that already do this well. You can usually learn from (and promote their behavior) to create templates and tools for the rest of your engineering organization.
The main goal of this meeting is to make sure that all of the engineers on the immediate team internalize the concept of availability.
Does the junior engineer understand how we measure service operation? Is the senior engineer reminding folks during a stressful project that it’s still important to test everything and ship with feature flags? This meeting should help keep these things on their radar. It does not need to be a long meeting to be effective.
Director Level Availability Review
Purpose of the meeting:
Sharing best practices across engineering teams.
Creating (positive) peer pressure among engineering managers to hold a high bar.
Standardizing an availability mentality across teams.
Important metrics: SLOs for individual services, incident count and recovery time, repair items and debt, major initiatives (availability and disaster recovery)
Having meetings at the team level is great but typically those meetings don’t result in broad cultural buy-in. For that you need a space for teams to talk to each other about what they’re doing and how things are going. This meeting is to provide a safe space for the line managers to exchange ideas and standardize on concepts they will promote with their teams. It’s also a good place to promote major availability initiatives (the stuff that may not fall under incident repair).
The important thing about this meeting is that it isn’t an executive focused “justification” type of meeting. The accountability that happens here is among peers and colleagues, not a top-down presentation, which makes it more likely that people will share failures and hold each other accountable (rather than try to paint rosy pictures or promote their work).
Executive Availability Review
Meeting Audience: Head of Engineering and direct reports
Purpose of the Meeting:
Provide high level themes and trends to executives about the current availability picture (telling a story).
Provide numbers to executives to foster discussion about where the company is at and how much investment is needed.
Requests for action in trouble areas.
Important Metrics:
Product SLOs, Incident frequency and repair per organization (MTTR, MTBF, etc), Incident themes across all organizations, customer support ticket numbers
The main point of this meeting is to build executive fortitude around availability. This meeting is the forum to remind them of common weaknesses in the organization and let them discuss the numbers and trends you present.
It is also to present any requests you have for action on the availability trends. Sometimes the discussion may lead to actions you didn’t anticipate but will help you align more with your engineering leadership.
This is a tough meeting to get right. There are a lot of psuedo-rules to these types of presentations if you want the right outcomes. Wil Larson has done a better job writing about how to run one of these meetings than I can.
The most important thing about this meeting is to bring value to the meeting for the executives. Whatever trends and insight you can bring to them (that they don’t already know) is what is going to keep them coming back to this meeting. If this is a boring by-the-numbers presentation they are not going to sacrifice their precious time to attend.
Building Momentum
Laying out a blueprint like this might sound simple, but the main challenge with creating these forums will be sustaining the momentum around them. In order to do that you will need:
Engineering team buy-in. While this is the easiest thing to achieve on some teams, other teams may not see value in this meeting. Including on-call health (how many pages did we get?) is one good way to focus the meeting on something individual engineers care about. Another way is to establish a few beachheads of teams that do this well and promote them by raising their visibility in the organization.
Executive buy-in. Getting executive buy-in on availability work is going to be your greatest challenge. Often times you get this “for free” after a catastrophic outage or a series of embarrassing ones (of course, this comes with its own challenges). If you don’t have the buy-in, the best thing you can do is start preparing numbers and raising visibility to these themes through other means. Often times this may eventually bubble up to the executive suite and get you the buy-in you need (or eventually there will be a catastrophic outage and then people will be looking for these metrics).
Director buy-in. If you’ve accomplished the first two bullet points, this one will come from executive priorities or teams pushing their management chain from the bottom up. Similar to how you may use a particularly good engineering team as an example, you can do the same with a director or group in order to encourage others to follow.
All of this will take time and a particular skill set from your platform or SRE team but that probably deserves its own separate post.