The other day I was sitting in a meeting with an engineering team. I had my slides up with our SLOs that month, how many incidents each service had, and what types of repairs their teams were working on for various services when it dawned on me….
We are so focused on what we’re doing with these reports we may have lost why we’re talking about it in the first place.
I mean, ultimately everyone understood that incidents == bad and we want to fix them and measure things. But many of the people in the room probably thought of these metrics and presentations as a tax on their ability to ship features.
The question I needed to be able to answer for them was: Why are we doing things this way?
When you are working on availability 24x7 it can become easy to forget that most people have not read the Google SRE book, are not steeped in SLOs and are mostly trying to ship code on a day to day basis.
Back to the Basics
One of the main jobs of SRE at any company is to raise the overall bar for the engineering teams in terms of how they think about availability. Often raising that bar comes back to massively over-communicating things that seem very basic to us. I have certainly been guilty of thinking “of course everyone knows this”.
The bar we are raising should be easy to briefly explain to anyone from a junior engineer all the way up to a senior executive.
And the most important, most basic thing we all need to communicate is: Why?
The Virtuous Cycle
If you have operated a high volume production service before, I am willing to bet you already have a picture in your head about how to operate one. I like to think of it as a virtuous cycle.
The cycle itself is very simple. It doesn’t need a lot of fancy terms and it gets to the core point about service operation that most SREs already natively understand but may fail to externalize to everyone else.
It is the why that everyone needs to understand in order to really understand what the incident reviews, SLOs and other processes are all about.
Here is my own simplified diagram with only four stages (similar to those described in Seeking SRE or the SLO development lifecycle):
Measure. We measure metrics that tell us when customers are impacted by problems in our service. If you’re a platform service you still have customers, they’re just internal. Usually these are SLOs.
Respond. When the metrics tell us something bad happened we create incidents and respond to them so we can fix the things that break. We do it in a standardized way so we can be as fast as possible.
Review. Once an incident is over we go back to it and figure out why it happened and how we can do better.
Improve. Based on what we learned we improve our metrics or improve our code so the failure that happened has less impact on our customers the next time.
Keeping It Simple
Why are we doing this? is the most important question we can answer in SRE. And even though it may seem obvious to those of us who are staying up at night reading books on this stuff, for many others in the development trenches it is not.
There are a lot more advanced concepts we can build into this simple model but the basic concept needs to be in everyone’s brain first. Level setting this very simple mode of service operation through repetitive communication is one of the most important things we can do as SREs in order to raise the collective consciousness from the earliest in career engineer to the most senior executive.