Skip to main content

Procedure name

How to use this template
  1. Copy this file to docs/runbooks/<runbook-slug>.md (kebab-case, descriptive).
  2. Update the frontmatter:
    • id: short uppercase identifier (e.g. RB-RESTART-WORKERS).
    • owner: team or person responsible for the procedure. Accepts handles (@user) or teams (@org/team).
    • severity: incident level this runbook applies to.
      • P0: total outage / data loss / customer impact
      • P1: severe production degradation
      • P2: partial or recurring degradation
      • P3: minor issue / planned maintenance
    • last_tested: last time the procedure was executed (in a drill or production) and verified to work. Wrap in quotes to prevent YAML from interpreting it as a Date.
    • on_call: handle of the person currently on-call for this service (optional; may rotate).
  3. Update sidebar_label to a short name.
  4. Delete this :::note block.
  5. Complete the sections below. The H1 and metadata are rendered automatically from the frontmatter.

When to use

Describe the condition that triggers this runbook. E.g.: "Alert worker_queue_depth > 1000 for more than 5 min".

Preconditions

  • Access to kubectl for the prod cluster.
  • Active credentials in @eigenoid/platform.
  • ...

Procedure

  1. Step 1: check the current state.
    kubectl get pods -n workers
    bash
  2. Step 2: apply the mitigation.
    kubectl rollout restart deployment/workers -n workers
    bash
  3. Step 3: verify that the system stabilized.

Verification

How to confirm the procedure worked. Metrics, logs, endpoints to check.

Rollback

If something goes wrong, how to revert.

Escalation

Who to escalate to if the runbook does not resolve the issue. E.g.: @oncall-lead or the @eigenoid/platform team.

References

  • Related dashboards.
  • Alerts that trigger it.
  • Relevant ADRs.