Over the last year I rebuilt an SRE team. It made me start to think a lot about what an SRE is and, maybe more importantly, what they are at GitHub specifically.
What the heck is a Site Reliability Engineer?
Unfortunately, I don’t think anyone actually knows what an SRE is. I certainly don’t, even though it’s in my blog description and I just said I built a team of them.
Depending on the company it can be anything from Ops+, to Developer+, to 24x7 on call incident responder (I don’t recommend the last one). It can be a centralized role across many teams or a distinct role on a specific team. More details on various ways to set up SRE here.
In fact, the scariest part for me about calling myself an SRE is wondering if people will make assumptions about my skillset that aren’t true because their thoughts about what an SRE is are different than my own.
What I do know about Site Reliability Engineers, though, is that they are specifically focused on reliability and, at least at GitHub, they are developers.
What makes a successful SRE?
Regardless of the actual job, I do concretely believe one thing about it:
In order to be truly great at their jobs, SREs need to be great evangelists.
Good SREs can be great at a lot of other things: writing code, debugging the network stack, deploying applications, understanding the build system… but to be truly great you will absolutely need to be a great evangelist.
Maybe not quite like this guy, but with some definite communication and passion.
Availability is boring.
Coming from the world of feature development I can tell you firsthand that there was not a lot of glitz and glamor in making things more reliable. As much as we would like our product managers to focus on the business impact of low reliability, the reality is that even if you have good metrics to justify reliability improvements a lot of the time features are just more fun.
How often do you do a launch party for a successful month of availability? Usually availability involves writing up an incident post-mortem or some other amount of paperwork and meetings. Sounds… great.
One of the main challenges for an SRE in cultural situations like these is to make reliability more interesting. For those already in the space it actually is interesting but we need to be able to externalize (dare I say evangelize?) that feeling.
SREs must be force multipliers.
Most of the time your average feature team is not going to be jazzed about making reliability improvements. Even when there is some important and obvious reliability work to prioritize, the team may have a lot of other external pressure to ship features.
What’s more, there is typically a very small SRE to team ratio. Maybe there is one SRE per team (if you’re really lucky). More likely there is one SRE to many teams. There is no way that single SRE is going to be able to accomplish the reliability work needed by a highly available, large scale service.
And here is where the evangelism comes in…
The real trick is not in doing the work yourself (even if you are great at it), but in making sure the rest of the team/org/company internalizes the value of reliability so that they start to work on it too (and maybe even think it’s cool?).
Developers are usually bad at evangelism.
Okay, not always, but the first book you read in your freshman computer science curriculum was probably not How to Win Friends and Influence People. Some people are naturally great at working with (and motivating) others, but most of them don’t choose a career of quietly sitting in front of a computer every day.
Working with teams in an organization, understanding what their motivations are, and helping to nudge them in the right direction is a learned skill. And it is likely one that you will fail at the first time you try it.
More than most other technical positions though, SREs need to have some ability in this area (even the more junior ones).
Examples of Evangelism
Incident Review. There is certainly an art to working with a team on writing up an incident review. It can be as simple as helping them understand the goal of the write up (hint: it’s not just paperwork) or as complicated as figuring out how to get them to be the ones to suggest the right repair so they feel like they own it.
Executive Escalation. The reality of working in SRE is that there are going to be times that things break so badly that you will need to deal with an executive (or figure out how to escalate a problem to an executive). This is not a “normal” developer skill. In fact, I think this is one of the perks of working as an SRE. You are exposed to a lot more people outside of the team than you would be in typical software development.
Debugging Complex Failure. Many of the hardest reliability problems to solve involve complex system boundaries (rate limits, multiple systems intersecting to fail, unexplained resource contention). In those cases it is very likely you will need the ability to get multiple people from different teams in a room (sometimes those teams may not have the best relationship) and figure things out.
Implementing SLOs. Does the team understand why they’re doing it? Are they considering the right customer scenarios? Do they know the implications of the SLO being breached (are they getting paged? is an executive looking at the report?)? These types of communications are an art more than a science.
In all of these scenarios (and more) in order to be truly great at being an SRE you will constantly need to understand how to work with people in the organization, how to set expectations and how to move the needle on people’s understanding of reliability.
It is extremely likely that not everyone on an SRE team will be an evangelist, or that they will have varying levels of evangelism skill. Even for the best evangelist, of course, there is always the opportunity to improve.
What that means for SREs (and for their manager) is helping to train them by giving them opportunities to practice. This means a safe space to work on communication skills (and fail). Whether you’re a manager looking for ways to help your team level up or an SRE looking to improve, I recommend:
Getting opportunities to present to groups (technical presentations or lighting talks are great for this). Make sure to get feedback.
Doing softer work with “safe” teams. There are going to be teams within your organization that you have a higher degree of trust with. Work with the “safe” team on a softer project (less technical, more communication) and make sure to get feedback.
Attending meetings with executives. Even if you are just there to observe, watching a few meetings with execs can be extremely enlightening in terms of expectations and how those meetings go. Bonus points if you go to the prep meeting beforehand (there almost always is one).
Pairing with good evangelists. Maybe this one is a bit more obvious, but pairing two SREs and letting the evangelist lead by example is another good way to pick up habits and techniques.
Ultimately, you may find there is a ceiling on evangelism. In my experience this is due to… a desire not to be an evangelist. People who are not evangelists can definitely still be very successful SREs, but I think their success will always be capped in this role because of its unique requirements.