Procedure name

How to use this template

Copy this file to docs/runbooks/<runbook-slug>.md (kebab-case, descriptive).
Update the frontmatter:
- id: short uppercase identifier (e.g. RB-RESTART-WORKERS).
- owner: team or person responsible for the procedure. Accepts handles (@user) or teams (@org/team).
- severity: incident level this runbook applies to.
  - P0: total outage / data loss / customer impact
  - P1: severe production degradation
  - P2: partial or recurring degradation
  - P3: minor issue / planned maintenance
- last_tested: last time the procedure was executed (in a drill or production) and verified to work. Wrap in quotes to prevent YAML from interpreting it as a Date.
- on_call: handle of the person currently on-call for this service (optional; may rotate).
Update sidebar_label to a short name.
Delete this :::note block.
Complete the sections below. The H1 and metadata are rendered automatically from the frontmatter.

When to use

Describe the condition that triggers this runbook. E.g.: "Alert worker_queue_depth > 1000 for more than 5 min".

Step 2: apply the mitigation.

kubectl rollout restart deployment/workers -n workers
bash

How to confirm the procedure worked. Metrics, logs, endpoints to check.

If something goes wrong, how to revert.

Who to escalate to if the runbook does not resolve the issue. E.g.: @oncall-lead or the @eigenoid/platform team.