cookbook · ops · on-call-page-classifier
Recipe · ops

"Should this wake someone up?"

A local .kolm file that classifies an incoming alert into actionable, suppress, or escalate. Trained on 1,000 of your team's historical pages, each labeled by what actually happened next: real fix, false alarm, or deferred to morning. The verifier rejects classifications that disagree with the audit trail.

base modelqwen2.5-coder-3b
gold pairs1,000 (700 train / 300 eval)
k-score floor0.88
artifact size1.6 GB
compile time~38 min
spec sourcelabeled-class verifier

What this recipe does

The cheapest way to fix on-call burnout is to stop waking people up for things that are not actionable. Most teams' alert rules accumulate cruft — flapping disk-space pages, retried-and-recovered network blips, cron jobs that always run a minute late and always fire one warning before recovering. This recipe reads each incoming alert and decides: page now (actionable), file for morning (suppress), or get a second pair of eyes (escalate).

"Suppress" does not mean "ignore." The classifier writes a row to a queue that the morning shift reviews; nothing is dropped, just deferred.

The spec

{
  "output_kind": "json",
  "schema": {
    "required": ["class", "confidence", "reason"],
    "properties": {
      "class": { "enum": ["actionable", "suppress", "escalate"] },
      "confidence": { "type": "number", "minimum": 0, "maximum": 1 },
      "reason": { "type": "string", "maxLength": 200 },
      "similar_alerts_30d": { "type": "integer" }
    }
  },
  "verifier": {
    "min_confidence_for_suppress": 0.8,
    "reason_must_cite_history": true
  }
}

The min_confidence_for_suppress gate is critical: a borderline call defaults to actionable rather than suppress. The cost of a missed page is much higher than the cost of an unnecessary one, and the verifier enforces this asymmetry at compile time.

Gold pairs (2 of 1,000 shown)

input
alert_name: disk-usage-warn
host: build-worker-04
metric: disk_pct=92
fired_at: 2026-05-09T03:14:22Z
prior_30d: same alert fired 14 times, all auto-cleared by nightly cleanup cron at 03:30Z
output
{
  "class": "suppress",
  "confidence": 0.94,
  "reason": "Recurring pattern; same alert auto-cleared by nightly cleanup cron 14/14 in last 30d. Defer to morning queue for cleanup-cron tuning.",
  "similar_alerts_30d": 14
}
input
alert_name: db-replication-lag
host: db-replica-prod-2
metric: replica_lag_s=42
fired_at: 2026-05-09T02:08:11Z
prior_30d: 0 occurrences
output
{
  "class": "actionable",
  "confidence": 0.92,
  "reason": "First occurrence of this alert in 30d. Replication lag > 30s on a primary read replica is customer-facing if it persists. Page on-call now.",
  "similar_alerts_30d": 0
}

Compile

kolm compile "classify on-call page as actionable, suppress, or escalate" \
  --base qwen2.5-coder-3b \
  --pairs pairs.jsonl \
  --verifier confidence-asymmetric \
  --k-floor 0.88 \
  --output page-classifier.kolm

ok wrote page-classifier.kolm
   k_score=0.91  signature=hmac-sha256

K-score gate

K-score 0.91 held-out 300 alerts · agreement-with-audit 91% · suppress-precision 96% · missed-actionable 0% (zero false negatives)

The single hardest constraint: zero false negatives on actionable. We tolerate calling a real-fix page escalate (slightly slower response) but never suppress (silent failure). The training set is weighted to penalize false negatives 10x.

Run-time profile

M2 MacBook
540ms
RTX 5090
160ms
iPhone 15 Pro
1.7s
CPU x86 (server)
2.2s

Deploy

# alertmanager webhook receiver — classify before paging:
on_webhook() {
  cls=$(echo "$1" | kolm run page-classifier.kolm --input-stdin)
  case "$(echo $cls | jq -r .class)" in
    actionable) pagerduty trigger ;;
    suppress)   queue write morning-review "$cls" ;;
    escalate)   slack post "#oncall-leads" "$cls" ;;
  esac
}

# 30-day result on our team: 41% fewer 3am pages, 0 missed-actionable.