What this recipe does
The cheapest way to fix on-call burnout is to stop waking people up for things that are not actionable. Most teams' alert rules accumulate cruft — flapping disk-space pages, retried-and-recovered network blips, cron jobs that always run a minute late and always fire one warning before recovering. This recipe reads each incoming alert and decides: page now (actionable), file for morning (suppress), or get a second pair of eyes (escalate).
"Suppress" does not mean "ignore." The classifier writes a row to a queue that the morning shift reviews; nothing is dropped, just deferred.
The spec
{
"output_kind": "json",
"schema": {
"required": ["class", "confidence", "reason"],
"properties": {
"class": { "enum": ["actionable", "suppress", "escalate"] },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 },
"reason": { "type": "string", "maxLength": 200 },
"similar_alerts_30d": { "type": "integer" }
}
},
"verifier": {
"min_confidence_for_suppress": 0.8,
"reason_must_cite_history": true
}
}
The min_confidence_for_suppress gate is critical: a borderline call defaults to actionable rather than suppress. The cost of a missed page is much higher than the cost of an unnecessary one, and the verifier enforces this asymmetry at compile time.
Gold pairs (2 of 1,000 shown)
alert_name: disk-usage-warn host: build-worker-04 metric: disk_pct=92 fired_at: 2026-05-09T03:14:22Z prior_30d: same alert fired 14 times, all auto-cleared by nightly cleanup cron at 03:30Z
{
"class": "suppress",
"confidence": 0.94,
"reason": "Recurring pattern; same alert auto-cleared by nightly cleanup cron 14/14 in last 30d. Defer to morning queue for cleanup-cron tuning.",
"similar_alerts_30d": 14
}
alert_name: db-replication-lag host: db-replica-prod-2 metric: replica_lag_s=42 fired_at: 2026-05-09T02:08:11Z prior_30d: 0 occurrences
{
"class": "actionable",
"confidence": 0.92,
"reason": "First occurrence of this alert in 30d. Replication lag > 30s on a primary read replica is customer-facing if it persists. Page on-call now.",
"similar_alerts_30d": 0
}
Compile
kolm compile "classify on-call page as actionable, suppress, or escalate" \ --base qwen2.5-coder-3b \ --pairs pairs.jsonl \ --verifier confidence-asymmetric \ --k-floor 0.88 \ --output page-classifier.kolm ok wrote page-classifier.kolm k_score=0.91 signature=hmac-sha256
K-score gate
The single hardest constraint: zero false negatives on actionable. We tolerate calling a real-fix page escalate (slightly slower response) but never suppress (silent failure). The training set is weighted to penalize false negatives 10x.
Run-time profile
Deploy
# alertmanager webhook receiver — classify before paging: on_webhook() { cls=$(echo "$1" | kolm run page-classifier.kolm --input-stdin) case "$(echo $cls | jq -r .class)" in actionable) pagerduty trigger ;; suppress) queue write morning-review "$cls" ;; escalate) slack post "#oncall-leads" "$cls" ;; esac } # 30-day result on our team: 41% fewer 3am pages, 0 missed-actionable.