cookbook · ops · runbook-step
Recipe · ops

"What's the next step?"

A local .kolm file that takes the current incident state (alert text, last 3 actions taken, current symptom) and returns the next runbook step plus a citation to the runbook section. Trained on your team's actual runbooks; the verifier rejects any suggestion that is not grounded in a real section of the corpus.

base modelqwen2.5-7b-instruct
gold pairs90 (45 train / 45 eval)
k-score floor0.85
artifact size2.6 GB
compile time~30 min
spec sourcecitation verifier

What this recipe does

Removes the "where in the runbook am I?" tax during a paging event. Paste the alert + what you've already tried; get back the next step plus a section citation that maps to a real runbook page. The verifier reads the runbook corpus at compile time and rejects any output whose citation does not resolve. No hallucinated runbook sections.

The spec

{
  "output_kind": "json",
  "schema": {
    "required": ["action", "runbook_path", "section_anchor", "why"],
    "properties": {
      "action": { "type": "string", "maxLength": 240 },
      "runbook_path": { "type": "string" },
      "section_anchor": { "type": "string" },
      "why": { "type": "string", "maxLength": 200 },
      "escalate": { "type": "boolean" }
    }
  },
  "verifier": {
    "corpus": "runbooks/",
    "citation_must_resolve": true,
    "escalate_when_no_match": true
  }
}

Gold pairs (1 of 90 shown)

input
alert: kafka consumer-lag > 30s on orders.fulfilled
prev_actions:
  - confirmed broker health (3/3 up)
  - checked downstream service (orders-worker) — running
  - tailed orders-worker logs — last commit was 15 min ago
symptom_now: lag still climbing, currently 2m18s
output
{
  "action": "Restart orders-worker pods one-by-one with kubectl rollout restart deploy/orders-worker -n prod. Verify lag drops within 90 seconds.",
  "runbook_path": "runbooks/kafka/consumer-lag.md",
  "section_anchor": "#stalled-consumer-no-recent-commits",
  "why": "Broker is healthy and consumer is alive but committing no offsets — matches stalled-consumer pattern. Section says rolling restart is the cheapest first move.",
  "escalate": false
}

Compile

kolm compile "suggest next runbook step with citation" \
  --base qwen2.5-7b-instruct \
  --pairs pairs.jsonl \
  --corpus runbooks/ \
  --verifier citation-resolves \
  --k-floor 0.85 \
  --output runbook-step.kolm

ok wrote runbook-step.kolm
   k_score=0.88  signature=hmac-sha256

K-score gate

K-score 0.88 held-out 45 pairs · citation-resolves 100% · oncall-rated correct 84% · escalate-when-uncertain 96%

The escalate flag is the safety valve: when the model is below confidence threshold or no runbook section matches the symptom, it flips escalate: true and returns the on-call escalation chain instead of inventing a step. This is the only way to ship "next step" suggestions during real pages without making things worse.

Run-time profile

M2 MacBook
2.1s
RTX 5090
490ms
iPhone 15 Pro
5.6s
CPU x86 (server)
7.0s

Deploy

# pagerduty webhook handler:
on_alert() {
  state=$(echo "$1" | jq -c '{alert, prev_actions, symptom_now}')
  step=$(kolm run runbook-step.kolm --input-stdin <<< "$state")
  if [ "$(echo "$step" | jq -r .escalate)" = "true" ]; then
    pagerduty escalate
  else
    slack post "#oncall-$(date +%s)" "$step"
  fi
}