all posts
Reliability
On-call you can sleep through: our SLO playbook
The best on-call shift is the one where nothing pages you, because the things that would have paged you were caught by a budget, not a human.
Alert on symptoms, not causes
We page on user-visible symptoms tied to an SLO — error rate, latency, freshness. We do not page on CPU at 80%. A saturated CPU that is still meeting its SLO is not an incident; it’s a Tuesday.
- Every alert maps to an SLO and a runbook.
- If an alert can’t be acted on, it’s deleted, not muted.
- Error budgets, not vibes, decide when we slow down to fix reliability.