If there is one thing I hear about more than ever in the past two years of working on "availability all the time" it's the struggle to get a company or a team to prioritize a specific availability investment.
These are some common refrains you may hear from engineers on your teams:
"We always focus on shipping new features and we don't have time to address debt."
"I have been asking for us to do this for X amount of time and we never do it even though it's clear we should."
Understanding the benefit of an availability investment, and then communicating it, is not an easy thing to do. It can be relatively easy to suggest a new feature because a competitor has it, or because it may sell a particular large customer. Justifying an availability investment (especially a complicated engineering one) can be a lot harder.
Here is my particular playbook for understanding and then selling new availability investments:
Gather concrete metrics on the problem.
Measure the benefits of fixing the problem.
Propose a solution with alternatives.
Present a conclusion supported by data.
Review your solution with a trusted group.
Take your proposal on a road show.
Over the past few years, having and following this type of playbook has helped me advocate for availability investments.
Without any more ado let’s look in more detail at these steps.
1. Gather concrete metrics on the problem.
Without concrete numbers related to availability, justifying an investment is going to devolve into a "probably/maybe" conversation or worse, a holy war about productivity, which will ultimately further entrench people in beliefs they held before the discussion.
If you really want to convince people to invest, you need cold hard numbers and a discussion about facts to prove your point. Here are some things you can measure:
What is the overall customer impact (SLOs)?
Overall customer impact should be something you can easily get numbers on, whether that is something like measuring impact on your SLOs or (if you don't have any), counting raw statistics for a given error or failure mode.
Which large enterprise customers experience the problem?
While raw numbers are important, which customers experience the impact is an often overlooked part of this equation. There is a reason why large enterprise customers who complain get a lot of attention -- they are the ones who make money for the business. Having a few strategic enterprise customers who are impacted, even if the total impact is small, is a good reason to make an investment.
How many customer support tickets does this generate?
If you have hard time quantifying impact with metrics the other place you should be able to go is your customer support organization. Having statistics on support ticket volume around a particular type of issue can yield a lot of valuable data, even if you don't have good production measurements.
How does this impact our ability to respond to incidents (or how many incidents has it generated)?
It's tricky to measure this, but if you are tracking incidents with postmortems and follow ups you should be able to look at how many were related to the issue you are interested in. Sometimes the incident may not be the same as the issue, but the type of failure may be related and should be "counted" as similar.
2. Measure the benefits of fixing the problem.
Hopefully by now you've got some data on the impact of your problem. Now it's time to spell out what the benefit of fixing it will be. You need to answer the question "If we fix this problem what will we gain?"
How will this improve customer experience?
Sometimes this will be an easy 1:1 answer where you can plug in customer impact numbers or call out some large customers. Other times this might be related to how many hours your engineering teams are spending on this issue and could spend on other things if this problem went away.
How will this improve the internal developer experience or save on support hours?
A truly painful availability issue may not have any direct customer impact, but it may still have a ton of impact on the engineering team. An alert that pages your developers after hours because exceptions are too noisy in production might be an example of this. Needing observability improvements because your time to mitigate incidents is long is another example of a way to quantify impact based on developer experience.
3. Propose a solution with alternatives.
You've gathered some data and now you have an idea of the benefits of a solution. Still, this will not be enough to justify an investment. Now you need to understand your proposed solution, along with any alternatives:
Some of your analysis during this phase needs to answer questions like:
What if we live with this problem for a while?
Are there cheaper (or more expensive) fixes for this?
You may even find during this phase that the solution you thought was a good one isn't that impactful. Maybe it's not the right time to invest.
Make sure you aren't approaching the problem with a solution already in mind. I often see engineers come with a solution and then build the problem around it. This can often lead to rabbit holing on one solution when many alternatives might be "fine" but not "perfect".
4. Present a conclusion supported by data.
After you have gathered data, been able to show benefit, and have a solution with a list of alternatives you are ready for a conclusion.
If you are writing this all up in a document (and you should be) put a single paragraph at the top of the document with the conclusion first. Maybe add some bullet points with your data. I can’t stress enough that even the smartest people don’t read things.
People will skim a document unless the first 3 or 4 lines have a good "hook" so put those first.
5. Review your solution with a trusted group.
It was a lot of work to compile all this data and consider alternatives. You might be tempted to blast this solution out into the world, or immediately take it to your review committee or planning session. Avoid the temptation! Make sure you get at least a few outside opinions from people you trust.
You only get one chance to make a first impression so spend an extra day or two to get critical feedback before you take this idea on the road. Having a confusing presentation, or one that doesn't have enough data, is enough to sink an idea at a company for months or even years.
6. Take your proposal on a road show.
Often, it will not be enough to bring your idea to your direct manager or organization. Many availability projects I have seen take multiple teams to accomplish or are larger efforts. If this is the case for your idea, you will want to seed it in as many places as possible.
This step can take a lot of forms but here are some ideas for where you can present your proposal:
"Local" presentations to a few teams who are interested.
Company lightning talks or office hours.
Architect or planning meetings.
Send your proposal to influential people in the area and schedule 1:1’s with them to discuss it.
The most important part about this step is to get people talking about the idea and ideally having someone else championing the idea (it helps if the person championing it has a track record of success).
Good ideas will snowball and you will find many paths clear before you even encounter them. If you don’t find this happening then you should re-examine your communication or your idea.
Following a Playbook
Even at companies with established processes for planning and communication, I have found that following the above playbook, and sometimes going outside “normal” channels, is the most effective way to generate enthusiasm (and ultimately investment) in an idea.
The above playbook has served me pretty well over the last few years and I would recommend adopting something similar, especially if you are a more senior engineer.
With availability investments in particular, how much data you bring to the table and how you show the benefit of the investment can make or break a project.