cookbook · ops · incident-summarizer
Recipe · ops

Incidents in, postmortems out.

A local .kolm file that takes the raw Slack thread plus the relevant Datadog / Grafana / PagerDuty timeline and writes a 5-section postmortem draft: summary, timeline, impact, root cause, follow-ups. Trained on 80 of your past postmortems. Runs inside your VPC; no incident text ever leaves.

base modelqwen2.5-7b-instruct
gold pairs80 (40 train / 40 eval)
k-score floor0.82
artifact size2.6 GB
compile time~28 min
spec sourceJSON Schema + redactor

What this recipe does

SREs spend 20-40 minutes after every incident copying messages from #incident-* into a Notion template. The same five sections every time. The same redaction every time (paste tokens, customer ids, internal hostnames). This recipe automates the whole pass: paste the channel transcript and the metrics dump, get back a structured draft you can refine in 5 minutes.

The verifier enforces (a) all five sections are present, (b) the timeline rows are ISO-8601 timestamps in chronological order, (c) the redactor pass scrubs anything matching your secrets.regex file before the model ever sees it.

The spec

{
  "output_kind": "json",
  "schema": {
    "required": ["summary", "timeline", "impact", "root_cause", "follow_ups"],
    "properties": {
      "summary": { "type": "string", "maxLength": 280 },
      "timeline": { "type": "array", "items": {
        "required": ["ts", "event"],
        "properties": {
          "ts": { "type": "string", "format": "date-time" },
          "event": { "type": "string" }
        }
      } },
      "impact": { "type": "string" },
      "root_cause": { "type": "string" },
      "follow_ups": { "type": "array", "items": { "type": "string" } }
    }
  },
  "verifier": {
    "redact_before_train": true,
    "redact_pattern_file": "secrets.regex",
    "timeline_chronological": true,
    "min_follow_ups": 1
  }
}

The redactor runs at compile time against every gold pair and every candidate output, so the model never learns to leak. secrets.regex is a one-line-per-pattern file; ours hits 14 patterns (Stripe keys, JWT shapes, internal hostname suffixes, raw IPv4, etc.).

Gold pairs (1 of 80 shown)

input - slack transcript + datadog timeline
[14:02] @oncall: paging — checkout p95 latency 8.2s, normal 240ms
[14:03] @oncall: alert id PD-AC8FR2, datadog board attached
[14:05] @sara: looking — db pool saturated, 200/200 in use
[14:06] @luis: just deployed v3.41.0 at 13:55, rolling back now
[14:09] @luis: rollback done, p95 dropping
[14:14] @sara: back to baseline 280ms
[14:21] @oncall: confirming all clear, postmortem to follow

datadog: db.connections.active 14:00 → 200 sustained 14:01-14:09 → drop 14:09
deploys: v3.41.0 deployed-prod 13:55:18Z, rolled-back 14:06:47Z
output
{
  "summary": "v3.41.0 deploy held db connections open without releasing them, saturating the pool within 6 minutes and pushing checkout p95 from 240ms to 8.2s. Rollback restored baseline at 14:09.",
  "timeline": [
    {"ts": "2026-05-09T13:55:18Z", "event": "v3.41.0 deployed"},
    {"ts": "2026-05-09T14:00:00Z", "event": "db.connections.active reaches pool ceiling (200)"},
    {"ts": "2026-05-09T14:02:00Z", "event": "PagerDuty fires; on-call paged"},
    {"ts": "2026-05-09T14:06:47Z", "event": "v3.41.0 rolled back"},
    {"ts": "2026-05-09T14:09:00Z", "event": "connections drain; latency recovering"},
    {"ts": "2026-05-09T14:14:00Z", "event": "p95 returns to 280ms baseline"}
  ],
  "impact": "~12 min of degraded checkout for all customers; estimated 4.1k abandoned carts based on baseline conversion.",
  "root_cause": "v3.41.0 introduced a code path that opened a per-request db connection but missed the close on the success branch. Connection pool saturated 6 min after deploy.",
  "follow_ups": [
    "Add connection-leak test to checkout integration suite",
    "Wire pool-utilization alert at 80% (currently fires at 100%)",
    "Confirm rollback budget documented for deploys after 13:00 PT"
  ]
}

Compile

kolm compile "summarize incident channel into 5-section postmortem draft" \
  --base qwen2.5-7b-instruct \
  --pairs pairs.jsonl \
  --redactor secrets.regex \
  --verifier schema:postmortem.json \
  --k-floor 0.82 \
  --output incident-summarizer.kolm

ok wrote incident-summarizer.kolm
   k_score=0.86  signature=hmac-sha256

K-score gate

K-score 0.86 held-out 40 pairs · schema-pass 100% · chronological 98% · redactor catches 100% · SRE-rated useful 88%

"SRE-rated useful" was a manual pass: an oncall engineer rated each output 1-5; the model needs a mean of 3.5 or higher across the held-out set. We landed at 4.1.

Run-time profile

M2 MacBook
2.4s
RTX 5090
540ms
iPhone 15 Pro
6.1s
CPU x86 (server)
7.8s

Deploy

# slack slash command — runs inside the BAA boundary:
/postmortem #incident-2026-05-09-checkout

# or pipe straight from the cli:
slack export #incident-2026-05-09-checkout \
  | kolm run incident-summarizer.kolm --input-stdin \
  > postmortem-draft.json