Procedure name
How to use this template
- Copy this file to
docs/runbooks/<runbook-slug>.md(kebab-case, descriptive). - Update the frontmatter:
id: short uppercase identifier (e.g.RB-RESTART-WORKERS).owner: team or person responsible for the procedure. Accepts handles (@user) or teams (@org/team).severity: incident level this runbook applies to.P0: total outage / data loss / customer impactP1: severe production degradationP2: partial or recurring degradationP3: minor issue / planned maintenance
last_tested: last time the procedure was executed (in a drill or production) and verified to work. Wrap in quotes to prevent YAML from interpreting it as aDate.on_call: handle of the person currently on-call for this service (optional; may rotate).
- Update
sidebar_labelto a short name. - Delete this
:::noteblock. - Complete the sections below. The H1 and metadata are rendered automatically from the frontmatter.
When to use
Describe the condition that triggers this runbook. E.g.: "Alert worker_queue_depth > 1000 for more than 5 min".
Preconditions
- Access to
kubectlfor theprodcluster. - Active credentials in
@eigenoid/platform. - ...
Procedure
- Step 1: check the current state.
bashkubectl get pods -n workers
- Step 2: apply the mitigation.
bashkubectl rollout restart deployment/workers -n workers
- Step 3: verify that the system stabilized.
Verification
How to confirm the procedure worked. Metrics, logs, endpoints to check.
Rollback
If something goes wrong, how to revert.
Escalation
Who to escalate to if the runbook does not resolve the issue. E.g.: @oncall-lead or the @eigenoid/platform team.
References
- Related dashboards.
- Alerts that trigger it.
- Relevant ADRs.