Incident Response
Incidents are unplanned investments; their costs have already been incurred. Your org’s challenge is to get ROI on those events.
– John Allspaw1
Incidents require learning in order to prevent them in the future. However, simply following a template and a process is not going to magically cause a learning. We must examine why we are doing an incident response plan as John Allspaw notes1. This leads to an interesting thought of when do the costs of an incident justify the need for engineers to expend large amounts of time learning from them? At what scale? Certainly at FAANG level it makes sense. But if an outage of an hour causes you only a few thousand dollars in lost revenue, is it worth many thousands of dollars for a group of engineers to meet and try to learn about an incident which isn’t likely to occur again (assuming that the RCA is fixed).