Incident Response

Incidents are unplanned investments; their costs have already been incurred. Your org’s challenge is to get ROI on those events.

– John Allspaw1

Incidents require learning in order to prevent them in the future. However, simply following a template and a process is not going to magically cause a learning. We must examine why we are doing an incident response plan as John Allspaw notes1. This leads to an interesting thought of when do the costs of an incident justify the need for engineers to expend large amounts of time learning from them? At what scale? Certainly at FAANG level it makes sense. But if an outage of an hour causes you only a few thousand dollars in lost revenue, is it worth many thousands of dollars for a group of engineers to meet and try to learn about an incident which isn’t likely to occur again (assuming that the RCA is fixed).

The Five Whys

The idea of continuing to ask why until you reach the root cause of the problem, a cited example2:

The car didn’t start… because the battery is dead… because the alternator wasn’t charging it… because the alternator belt broke… because the belt was beyond its useful life but wasn’t replaced… because it wasn’t maintained according to recommended schedule.


References

1.
John Allspaw. 📌 Incidents are unplanned investments; their costs have already been incurred. Your org’s challenge is to get ROI on those events. Right now, in most companies, this ROI is left sitting in the dark because of the ``template-driven’’ approaches and ``action item’’ myopia. @allspaw Tweet at https://twitter.com/allspaw/status/1051252775311613952 (2018).
2.
Kroll, R. More than five whys and “layer eight” problems. at https://rachelbythebay.com/w/2023/02/13/broken/ (2023).

Links to this note