Runbook step recipe

What this recipe does

Removes the "where in the runbook am I?" tax during a paging event. Paste the alert + what you've already tried; get back the next step plus a section citation that maps to a real runbook page. The verifier reads the runbook corpus at compile time and rejects any output whose citation does not resolve. No hallucinated runbook sections.

The spec

{
  "output_kind": "json",
  "schema": {
    "required": ["action", "runbook_path", "section_anchor", "why"],
    "properties": {
      "action": { "type": "string", "maxLength": 240 },
      "runbook_path": { "type": "string" },
      "section_anchor": { "type": "string" },
      "why": { "type": "string", "maxLength": 200 },
      "escalate": { "type": "boolean" }
    }
  },
  "verifier": {
    "corpus": "runbooks/",
    "citation_must_resolve": true,
    "escalate_when_no_match": true
  }
}

Gold pairs (1 of 90 shown)

input

alert: kafka consumer-lag > 30s on orders.fulfilled
prev_actions:
  - confirmed broker health (3/3 up)
  - checked downstream service (orders-worker) — running
  - tailed orders-worker logs — last commit was 15 min ago
symptom_now: lag still climbing, currently 2m18s

output

{
  "action": "Restart orders-worker pods one-by-one with kubectl rollout restart deploy/orders-worker -n prod. Verify lag drops within 90 seconds.",
  "runbook_path": "runbooks/kafka/consumer-lag.md",
  "section_anchor": "#stalled-consumer-no-recent-commits",
  "why": "Broker is healthy and consumer is alive but committing no offsets — matches stalled-consumer pattern. Section says rolling restart is the cheapest first move.",
  "escalate": false
}

Compile

kolm compile "suggest next runbook step with citation" \
  --base qwen2.5-7b-instruct \
  --pairs pairs.jsonl \
  --corpus runbooks/ \
  --verifier citation-resolves \
  --k-floor 0.85 \
  --output runbook-step.kolm

ok wrote runbook-step.kolm
   k_score=0.88  signature=hmac-sha256

K-score gate

K-score 0.88 held-out 45 pairs · citation-resolves 100% · oncall-rated correct 84% · escalate-when-uncertain 96%

The escalate flag is the safety valve: when the model is below confidence threshold or no runbook section matches the symptom, it flips escalate: true and returns the on-call escalation chain instead of inventing a step. This is the only way to ship "next step" suggestions during real pages without making things worse.

Run-time profile

M2 MacBook

2.1s

RTX 5090

490ms

iPhone 15 Pro

5.6s

CPU x86 (server)

7.0s

Deploy

# pagerduty webhook handler:
on_alert() {
  state=$(echo "$1" | jq -c '{alert, prev_actions, symptom_now}')
  step=$(kolm run runbook-step.kolm --input-stdin <<< "$state")
  if [ "$(echo "$step" | jq -r .escalate)" = "true" ]; then
    pagerduty escalate
  else
    slack post "#oncall-$(date +%s)" "$step"
  fi
}

"What's the next step?"