Managing Risk as an SRE
Last time I talked about evangelism as an SRE and how it is an important and additional skill an SRE needs to possess compared to an infrastructure or software engineer.
Evangelism isn’t the only skill an SRE needs to have that a typical engineer may not have. In fact, risk management is even more essential to success as an SRE than evangelism. While evangelism can be a cap on advancement as an SRE, risk management ability is a hard requirement.
What is Risk Management in the SRE world?
Chapter 3 in the Google SRE book is actually called “Embracing Risk” so it may come as no surprise this is a skill to highlight for the job. That chapter only describes one type of risk management while there are a number of other areas I’ve identified while working in SRE:
Understanding Risk of Failure. This is one of the most obvious categories of risk management (called out in the Google book) and relates particularly to SLO implementation and the math behind understanding risk tolerance of a service. That chapter goes into a lot more detail than I will in this article.
Real Time Risk Management. Perhaps another obvious risk management situation for SREs is during the high pressure of an incident. An SRE should help design the decision making process for these incidents (which means understanding the risk trade-offs). They may also play a pivotal role in deciding what trade-offs to make during an incident. Many times they will have to help make the hard calls that sometimes must be made: Security or availability? Investigate or mitigate? Break some customers to fix a feature for the majority?
Balancing Risk Investment. SREs also need to be able to build risk registers (or their equivalent) to understand technical investments to improve. As advisors on availability they will inevitably be put in situations where their opinion on what to do next will be asked. They should be able to provide sound judgement and use data to help decide which investments to make.
Reputation Risk. One major job of SRE is to bring visibility to availability (many times through numbers and reports). Of course, the first question you’ll get from teams and execs in these cases is, “Are these numbers right? How right?” There is a certain level of attention to detail and understanding of the implications of what we publish that needs to come along with being an SRE. An SRE needs to understand the risks associated with publishing this type of data and how to communicate its potential flaws.
Teachable Risk Management
As with evangelism, much of risk management is a learned skill. With more junior engineers you will need to invest in order to see them grow these skills. Here are some strategies to teach the above skills:
Understanding Risk of Failure - This is probably the easiest skill to teach. Simply operating a service for a period of time, or participating in an SLO workshop with another team who understands the service, can teach a more junior engineer a lot about service operation and the risks associated with it.
Real Time Risk Management - Experience is the best teacher. Your SREs need to be on call and put into high pressure situations. At GitHub we run a number of game days with fake incidents and user personas to simulate incident experiences for new engineers. This helps them gain experience in a low stakes environment.
Balancing Risk Investment - This is a much harder skill to teach and usually comes with experience and time. In order to make strategic decisions you will need to have been exposed to those types of trade offs before. One thing we have tried at GitHub is having SRE’s do weekly presentations on risks across GitHub and creating a risk register based on those presentations. This gives every SRE the hands on experience of evaluating a risk they may be less familiar with and lets them watch their colleagues go through the same process.
Reputation Risk - The main teacher in this area is team culture. This means making sure the team itself runs reviews of the things it’s putting out and that everyone is aware of the expectation of an eye for detail, particularly when it comes to statistics and data being presented. There should always be questions when a new mechanism is introduced for measurement and pro’s and con’s are discussed. One way to think about this is using the concept of completed staff work.
There are real consequences to the decisions we make in the SRE realm. This is one of the things that makes the job attractive to those who enjoy making larger impact. Because of that there is not only a lot of responsibility in what we do but also the need to exercise good judgement when it comes to risk.