I recently watched Lorin Hochstein's LFI talk "Your Understanding of Reality is Wrong" . It made me want to write about the key function of SRE being to help shape engineering’s perception of reality rather than act as a gatekeeper.
An engineering organization’s perception of reality is important to understand what it does and does not prioritize. SRE is the main lever in driving cultural accountability for availability (just like the security team should do that for security).
How should SRE use their unique position to shape the software engineering organization’s perception of reality? Here are some of the things we do, gathered from my time at GitHub.
SRE is the sunlight.
SRE should be shining light on availability issues through data and storytelling.
Providing availability data. SRE is the team that provides accurate and consistent reports on availability. The data should be correct (this is harder to do than it may seem). It needs to be published consistently (ideally in real time). The reports also need to be easy enough to use to glean actionable insight.
Providing availability examples. It’s not enough to provide reporting and tooling. SRE will need to perform a certain amount of manual curation so they can present incidents as part of a story and manually identify deeper trends on a monthly basis. This is the storytelling value SRE can provide that no one else can.
Being visible. SRE needs to establish a platform in the company for being consistently visible. For instance, I write a weekly internal newsletter that highlights availability work around the company and shamelessly steals from SRE weekly and Software Lead Weekly. This provides a constant reminder to people around the engineering teams that availability is important and gives us a platform to highlight changes.
SRE is the lifeline.
SRE should be the lifeline team available for other engineering teams to utilize when they have problems.
Lending SRE’s as a priority over other projects. If a team asks for an SRE’s help and you have any chance of sparing an SRE to help, do it. Developing a close relationship with a team, and helping them with a challenging problem, is something that will give you a lot of trust to spend in the future with that team.
Spending time with junior engineers. Sometimes there tends to be a barrier to cross team pairing with more junior engineers. However, spending SRE time mentoring more junior engineers on another team is a great way to spread cultural change. Many times, more junior engineers are more open to new information and don’t yet have a firm stance (or history) on the way things are done. Teaching them early can help propagate new cultural ideas quickly.
Taking a lot of meetings. Not every availability challenge is a technical one (most are not, in my experience). Listening to others’ problems and plans is a great way to understand the organizational challenges you are likely to have to overcome at some point.
Management is the gatekeeper.
This last section is about what SRE is not. Error budget enforcement, availability review follow up and asking for resources on availability tasks should not be things the SRE team does. These things just cause the problems to go underground. Management chains of the respective engineering teams need to perform these tasks using the data provided by the SRE team.
Creating information conduits with executives. Just because SRE doesn’t hold the “stick” in this model doesn’t mean it doesn’t help push the levers. It is the job of the SRE team to create and maintain good information conduits with the right executives. This may mean making sure to have direct messages with the right VPs or it may mean special executive focused meetings. Becoming a trusted resource to the executive team is crucial to SRE being successful.
Understanding that maybe availability isn’t the priority right now. I know… an SRE team saying that?! SRE will have to be willing to accept that leadership in a particular organization may decide that availability isn’t the most important thing.
Maybe this is the right decision for a variety of reasons (revenue, customer expectations, etc) or maybe it’s not (“we are too busy”, noise around feature development). Here is my (possibly) controversial opinion:
It’s better to let a team fail and help them back up than it is to “cry wolf” about availability to a team that doesn’t think they should prioritize it.
When a team that SRE has provided data to fails, this is usually in a public (and painful) way. If SRE has been supportive and trustworthy that team will usually reach out and become an advocate for availability in the future. It is better to let them learn that lesson on their own than create an adversarial relationship.
Helping executives understand how to use the “stick” and what it might do. Since SRE has seen it before, helping coach executives on what a “red is good” mentality means and how putative action instead of support can cause watermelon metrics is a key role. This is another unique value SRE provides that other teams may not be able to.
Perception shaping takes time and can seem less impactful at first.
Many of the reality building work you undertake is going to take a while. In this recent SRE in the real world post, Niall Murphy calls out benefit timelines:
You might be hired as the vanguard or the hindmost, but the key point is that the benefits don’t tend to be felt until 18-24 months in.
While I think you can (and absolutely must) show incremental value along the way, the real benefits of shaping reality will take a long time to show up on the teams you work with.
The goal should always be to roll the availability boulder up the hill just a little further and not undergo massive seismic shifts or block teams from having their own accountability models.
This will take longer. This will involve cultural fortitude on the SRE side (being willing to work with “okay” and not “perfect”). This will involve management buy-in. Ultimately though, this will be a more successful approach in the longer term.