What playing Magic: the Gathering taught me about incidents.
In this post I hope to expose the true depth of my nerd-dom to my readers (and maybe provide some incident response thoughts at the same time).
Here are a collection of things I learned after getting back into Magic: the Gathering over the past 10 years or so. They are things that apply to both the MTG scene and your friendly neighborhood incident response process.
Doing it live is the scariest part, so practice in advance.
MTG has these in-store events every new set release called prereleases. You get to go into the store and play a tournament against other players with the new cards before they are released yet.
The thing is… you have to build a deck live, within a certain amount of time, and play in a high pressure environment against people you don’t know. The first time I played in a prerelease I was terrible and surprisingly nervous (this is just a game, right?).
It turns out that participating in a prerelease has much the same feel as a zoom call for a major incident:
You only go once in a while, which means you don’t get a lot of practice before the major event.
“Doing it live” in front of people you don’t know opens you up to all kinds of embarrassing mistakes that they can see and point out.
If you aren’t very good at building a deck you’ll run out of time during the deck building phase and not have a very good deck.
The same things that make for a successful prerelease also make for a successful incident:
Practice in advance. A few times. Study the cards before you go so you don’t have to read them all as you unwrap them. Know how many lands and creatures you need before you walk in the door. Practicing for an incident means knowing how the environment works and how to use the observability tooling. Understand how to escalate and who your dependencies are before you go on-call. Going in cold and trying to follow a playbook will mean a longer mitigation time.
A supportive environment helps take the pressure off. Some players are competitive, but others notice your relative lack of experience and help you out. You generally play better when the person across the table from you isn’t critical and putting pressure on you every turn. Similarly, incidents are high stakes but keeping our cool and making sure the environment is supportive (even if someone makes a mistake) makes for a better outcome and lets everyone work at their best.
There is a lot of value to be had in something others find off-putting.
One of the most famous reddit posts of all time is about a Magic: The Gathering tournament (it’s not flattering). Magic is a kid’s game (if you ask my parents) or a neckbeard’s paradise (if you ask many, even in the gaming scene).
Maybe it is the era in which I grew up, or just the nature of the game, but I still enjoy it — and I still love going to the prereleases — even if there do seem to be a lot of neckbeards.
Availability and incident response have a similar “ick” factor for many developers. They are about a tax you have to pay to do the fun work. There are a small set of people (and they don’t seem to typically be neckbeards) who relish fixing the things that might blow up the site, want to dig deep into the error data, and sometimes maybe even find that they never feel more alive than during a 2 hour high stakes incident when the site is down and the problem is unsolved.
Maybe I’m one of them.
Don’t be afraid of the data. Use it.
Did I mention Magic: the Gathering has changed a lot? There is this site called 17lands. And it turns out that by tracking win rates among their user base they can pretty quickly identify magic cards in a new set that look awesome but are actually terrible or cards that look terrible and actually turn out to be awesome. Even a few years ago this data just plain didn’t exist, nor did the ability to use it. Now it is everywhere and it can be a useful tool if you’re interested in optimizing your play.
Incident response and the data around it is similar. In order to get the most value from the game you need to collect and classify your incidents and repairs (even if it’s tedious work). Having that data will help you distinguish the follow ups that look awesome but actually provide no value from the ones that are valuable but seem terrible.
Additionally, in incident response, just like in MTG, you really do need to do the math to be successful. You can’t build a good 40 card deck without understanding the mathematical probability of drawing a land you’re splashing for. The same is true of building a useful SLO. There is no shortcut to understanding the probability calculation behind the SLO, or the accompanying burn rate, or the dependency math.
The game has changed since Richard Garfield invented it. Embrace that.
I remember excitedly racing to the gaming section of my local KB Toy Store (not a thing anymore) and looking for a sweet Revised Starter Deck. Things were simple back then: jam a bunch of powerful high cost flyers into a dual color deck and follow the directions on the included instruction booklet.
Fast forward to today’s MTG scene where we have multiple game formats (Modern, Commander, Standard, etc) along with foil etched cards, and and an online game. Getting started still starts with a deck, but from there you have a lot of options and tools at your disposal.
Incident response presents the same challenge these days. Long gone are the days of staying the weekend and shutting down an entire intranet so we could migrate from one version of our web portal to another, or living in the blissful ignorance of 500’s blasting the website until a user sends an email.
Our versions of fancy foil are the incident.io products, SLOs and monitors in Datadog/Honeycomb and all the complexity they bring with them and the new trappings of incident response and observability. These are added complexity to understand and implement but they bring us so much further in the incident world and they have evolved as we’ve learned more about how to “play the game”.
Conclusions
High stakes environments with Math. Or… Adrenaline sources that still involve sitting in the same chair for many hours.
I’m not sure what exactly draws me to both MTG and Availability, but I hope to see you out there (either in an incident or at your LGS).