When I first read the Google SRE book, I assumed it was a template for everyone. As an engineer I thought that the way people measured and calculated availability for similar technologies would be the same. Create an availability template for a web application hosted in kubernetes and stamp it everywhere…
Now that I have seen the perspective of a bunch of different companies in the wild I have come to a different conclusion. Companies focus their availability efforts on the things that are most important to their business model (whether they do it consciously or not).
Understanding what archetype a company falls into helps me understand what availability investments they are willing to make and why. Below are some broad categories I have come up with. Assumptions about strengths and weaknesses are based on my own past experiences.
Maybe some of these archetypes will help you as well.
Singular Focus Availability - Do one thing extremely well.
Stripe or Okta come to mind. They have very specific customer paths that are extraordinarily important to their business. For Stripe, the model is focused around their payments API and they advertise rigorous availability in that specific scenario. Okta is an identity company. If their authentication doesn’t work then their core business model doesn’t work.
Companies like this tend to have a high focus on a few singular metrics to the point that they spend significant efforts on these metrics and if they change even a tiny bit it is a big deal. They get great at detecting change quickly and responding.
One thing they may have less experience with is running a portfolio of products that all need to be tracked for availability (many SLOs, many customer journeys, etc).
Commerce Focused Availability - Measure the funnel that makes money.
Amazon (the website, not AWS), Walmart or Expedia are some examples. These companies rely heavily on ecommerce for their business model fall into a different bucket.
For these companies there isn’t necessarily a singular experience that matters. The most important thing for them is to measure the funnel of customers actually buying things.
If a change goes out that impacts this funnel, it’s time to hit the panic button and revert it. Making changes outside this funnel is less critical to availability. Companies like this make it easy to justify availability investments in the funnel because uptime directly equates to dollars.
These companies may have less rigorous needs when it comes to the rest of their technology, which might make it harder to fund availability investments in other areas.
Infrastructure Focused Availability - Meet the needs of large enterprises.
For companies like AWS, Microsoft (Azure) and GitHub (source control) they are providing a platform to paying customers (many times these are large enterprises). Any time a major feature of that platform stops working, it can directly impact the business of a customer that is paying a lot of money. Going even further, many of these customers require audit and regulatory compliance.
It isn’t enough to measure the funnel or a specific scenario at these companies. More effort needs to go into measurement across the board because there isn’t a single scenario to focus on.
One nice thing about these companies, at least in Big Tech, is that there are more resources to spend on focused teams (like SRE) and this can help seed large scale availability programs that move the needle.
On the other hand, it is often harder to build up a focused availability effort around a single area in these types of companies because of the breadth of scenarios that need to be measured, the number of engineering teams involved, and the different ways those scenarios need to be measured.
Startup Availability - Only measure what you need.
When you’re working at a startup your number one goal is to make money and grow by building things quickly, even if this means incurring debt down the road. Often times that debt comes in the form of having rudimentary alerts and measures of customer experiences and accepting that things may break sometimes.
I think this is a totally fine place to be, since strict measurements of availability are not likely to be necessary until you transition to a different archetype. Things like SLOs are great to have, but making sure you know if things are “up” or “down” is more important than following industry best practices.
Understanding Your Archetype
Understanding which archetype your company falls into might help you with two things:
Applying availability literature to the scenarios the company is willing to invest in. If you don’t work for an infrastructure provider it may be unlikely that you will succeed in implementing the SRE book for everyone, but you may have a lot of success in applying those techniques to your funnel.
Understanding a transition in archetypes. Perhaps the most common transition is from startup to another archetype, but you may also see a transition from a single product to a portfolio of products. Realizing that the transition is happening gives you a good framework for talking to others about why it may be necessary to change the culture around availability in a new way.
These are broad categories so they may not apply in every scenario, or perhaps I’m missing an archetype. If so, I would love to hear from you.