all posts
Reliability 6 min read

On-call you can sleep through: our SLO playbook

The best on-call shift is the one where nothing pages you, because the things that would have paged you were caught by a budget, not a human.

Alert on symptoms, not causes

We page on user-visible symptoms tied to an SLO — error rate, latency, freshness. We do not page on CPU at 80%. A saturated CPU that is still meeting its SLO is not an incident; it’s a Tuesday.

  • Every alert maps to an SLO and a runbook.
  • If an alert can’t be acted on, it’s deleted, not muted.
  • Error budgets, not vibes, decide when we slow down to fix reliability.