What this recipe does
SREs spend 20-40 minutes after every incident copying messages from #incident-* into a Notion template. The same five sections every time. The same redaction every time (paste tokens, customer ids, internal hostnames). This recipe automates the whole pass: paste the channel transcript and the metrics dump, get back a structured draft you can refine in 5 minutes.
The verifier enforces (a) all five sections are present, (b) the timeline rows are ISO-8601 timestamps in chronological order, (c) the redactor pass scrubs anything matching your secrets.regex file before the model ever sees it.
The spec
{
"output_kind": "json",
"schema": {
"required": ["summary", "timeline", "impact", "root_cause", "follow_ups"],
"properties": {
"summary": { "type": "string", "maxLength": 280 },
"timeline": { "type": "array", "items": {
"required": ["ts", "event"],
"properties": {
"ts": { "type": "string", "format": "date-time" },
"event": { "type": "string" }
}
} },
"impact": { "type": "string" },
"root_cause": { "type": "string" },
"follow_ups": { "type": "array", "items": { "type": "string" } }
}
},
"verifier": {
"redact_before_train": true,
"redact_pattern_file": "secrets.regex",
"timeline_chronological": true,
"min_follow_ups": 1
}
}
The redactor runs at compile time against every gold pair and every candidate output, so the model never learns to leak. secrets.regex is a one-line-per-pattern file; ours hits 14 patterns (Stripe keys, JWT shapes, internal hostname suffixes, raw IPv4, etc.).
Gold pairs (1 of 80 shown)
[14:02] @oncall: paging — checkout p95 latency 8.2s, normal 240ms [14:03] @oncall: alert id PD-AC8FR2, datadog board attached [14:05] @sara: looking — db pool saturated, 200/200 in use [14:06] @luis: just deployed v3.41.0 at 13:55, rolling back now [14:09] @luis: rollback done, p95 dropping [14:14] @sara: back to baseline 280ms [14:21] @oncall: confirming all clear, postmortem to follow datadog: db.connections.active 14:00 → 200 sustained 14:01-14:09 → drop 14:09 deploys: v3.41.0 deployed-prod 13:55:18Z, rolled-back 14:06:47Z
{
"summary": "v3.41.0 deploy held db connections open without releasing them, saturating the pool within 6 minutes and pushing checkout p95 from 240ms to 8.2s. Rollback restored baseline at 14:09.",
"timeline": [
{"ts": "2026-05-09T13:55:18Z", "event": "v3.41.0 deployed"},
{"ts": "2026-05-09T14:00:00Z", "event": "db.connections.active reaches pool ceiling (200)"},
{"ts": "2026-05-09T14:02:00Z", "event": "PagerDuty fires; on-call paged"},
{"ts": "2026-05-09T14:06:47Z", "event": "v3.41.0 rolled back"},
{"ts": "2026-05-09T14:09:00Z", "event": "connections drain; latency recovering"},
{"ts": "2026-05-09T14:14:00Z", "event": "p95 returns to 280ms baseline"}
],
"impact": "~12 min of degraded checkout for all customers; estimated 4.1k abandoned carts based on baseline conversion.",
"root_cause": "v3.41.0 introduced a code path that opened a per-request db connection but missed the close on the success branch. Connection pool saturated 6 min after deploy.",
"follow_ups": [
"Add connection-leak test to checkout integration suite",
"Wire pool-utilization alert at 80% (currently fires at 100%)",
"Confirm rollback budget documented for deploys after 13:00 PT"
]
}
Compile
kolm compile "summarize incident channel into 5-section postmortem draft" \ --base qwen2.5-7b-instruct \ --pairs pairs.jsonl \ --redactor secrets.regex \ --verifier schema:postmortem.json \ --k-floor 0.82 \ --output incident-summarizer.kolm ok wrote incident-summarizer.kolm k_score=0.86 signature=hmac-sha256
K-score gate
"SRE-rated useful" was a manual pass: an oncall engineer rated each output 1-5; the model needs a mean of 3.5 or higher across the held-out set. We landed at 4.1.
Run-time profile
Deploy
# slack slash command — runs inside the BAA boundary: /postmortem #incident-2026-05-09-checkout # or pipe straight from the cli: slack export #incident-2026-05-09-checkout \ | kolm run incident-summarizer.kolm --input-stdin \ > postmortem-draft.json