On Call Run Books: There’s a Better Way

On call runbooks are a subset of a team’s runbooks used to assist on call responders. I’ve recently had conversations with a few folks about on call runbooks and I thought my view warranted a blog post.

In this post, I’ll describe on call runbooks, their pros and cons, and better places to invest developer time.

My views in this post assume two-pizza software engineering teams that are on call for own code.

Description

An on call runbook usually follows the Issue->Problem->Resolution->Validation pattern.

Examples of On Call runbook sections:

  • Alarm: HTTP 5xx % exceeded threshold, here are the common problems/resolutions, here’s a way to assert success
  • Alarm: Disk Usage exceeded threshold, here are the common problems/resolutions, here’s a way to assert success
  • Alarm: Queue processing rate is too slow, here are the common problems/resolutions, here’s a way to assert success

The primary goal of an on call runbook is to provide a responder a way to resolve an alarm.

Pros/Cons of On Call Runbooks

Pros:

  • Less time to ramp up new hire to on call.
  • Standard process for responding to “common alarms”.
  • Responders document alarm resolution.
  • Can document workarounds for dependencies outside of the team’s control (e.g., cloud provider, datastores, other teams, etc.)

Cons:

  • Good runbooks require upkeep as the system changes. It is hard to maintain a good runbook because of difficulties detecting when a change to a system will need a corresponding runbook update.
  • On call runbook entries that don’t get updates become harmful to responders (especially the new hire), the runbook, and the system itself. At best, a stale entry resolves the issue. A stale entry could do nothing or actually harm the system, lowering responder confidence in the entry/runbooks entirely.
  • After its initial creation, any subsequent execution of a runbook entry is toil.
  • Runbooks only cover known-known failure modes. This is not comprehensive. There is no runbook for known-unknown and unknown-unknown failure modes.

🚨On Alarms🚨

Let’s not forget an important idea: Alarms should be rare and under exceptional circumstances. If there are common scenarios when an alarm fires, either the alarming is too sensitive, the service alarms on the wrong signals, or the service is wildly deficient and not meeting expectations. All of these are bad.

An oncall runbook cedes that some alarms are not exceptional. That is bad. I’ll echo Charity’s tweet here:

Even with run books, exceptional alarms happen. Exceptional alarms have no runbook entry. A responder has to determine the problem and its resolution.

Better Investments: Observability and CI/CD

Since runbooks don’t cover all failure modes, responders already must be able to diagnose/mitigate/resolve issues. This is a sunk cost.

Since responders have to diagnose/mitigate/resolve issues, they need to be able to make assertions about the system’s behavior. They also need to be able to change the system’s state. This need is compounded when the data needed to mitigate/resolve is not already available and it has to be created in situ (via patch/deployment).

Assertions about a systems behavior maps directly to observability; namely tracing, logs, and metrics.

Changing a system’s running state maps directly to CI/CD; quickly testing application changes and deployments. Frequently, exercising an application rollback mitigates a production issue!

Summary

Invest in observability and CI/CD over on call runbooks.