What this recipe does
Removes the "where in the runbook am I?" tax during a paging event. Paste the alert + what you've already tried; get back the next step plus a section citation that maps to a real runbook page. The verifier reads the runbook corpus at compile time and rejects any output whose citation does not resolve. No hallucinated runbook sections.
The spec
{
"output_kind": "json",
"schema": {
"required": ["action", "runbook_path", "section_anchor", "why"],
"properties": {
"action": { "type": "string", "maxLength": 240 },
"runbook_path": { "type": "string" },
"section_anchor": { "type": "string" },
"why": { "type": "string", "maxLength": 200 },
"escalate": { "type": "boolean" }
}
},
"verifier": {
"corpus": "runbooks/",
"citation_must_resolve": true,
"escalate_when_no_match": true
}
}
Gold pairs (1 of 90 shown)
alert: kafka consumer-lag > 30s on orders.fulfilled prev_actions: - confirmed broker health (3/3 up) - checked downstream service (orders-worker) — running - tailed orders-worker logs — last commit was 15 min ago symptom_now: lag still climbing, currently 2m18s
{
"action": "Restart orders-worker pods one-by-one with kubectl rollout restart deploy/orders-worker -n prod. Verify lag drops within 90 seconds.",
"runbook_path": "runbooks/kafka/consumer-lag.md",
"section_anchor": "#stalled-consumer-no-recent-commits",
"why": "Broker is healthy and consumer is alive but committing no offsets — matches stalled-consumer pattern. Section says rolling restart is the cheapest first move.",
"escalate": false
}
Compile
kolm compile "suggest next runbook step with citation" \ --base qwen2.5-7b-instruct \ --pairs pairs.jsonl \ --corpus runbooks/ \ --verifier citation-resolves \ --k-floor 0.85 \ --output runbook-step.kolm ok wrote runbook-step.kolm k_score=0.88 signature=hmac-sha256
K-score gate
The escalate flag is the safety valve: when the model is below confidence threshold or no runbook section matches the symptom, it flips escalate: true and returns the on-call escalation chain instead of inventing a step. This is the only way to ship "next step" suggestions during real pages without making things worse.
Run-time profile
Deploy
# pagerduty webhook handler: on_alert() { state=$(echo "$1" | jq -c '{alert, prev_actions, symptom_now}') step=$(kolm run runbook-step.kolm --input-stdin <<< "$state") if [ "$(echo "$step" | jq -r .escalate)" = "true" ]; then pagerduty escalate else slack post "#oncall-$(date +%s)" "$step" fi }