cookbook · ops · log-grep
Recipe · ops

"Show me 5xx from checkout in the last hour."

A local .kolm file that turns a one-line English query into a structured log search. Output is the canonical query string for your stack (LogQL, Datadog grammar, CloudWatch Logs Insights) plus a parsed AST so a runner can actually execute it. The verifier parses every candidate output as the target grammar at compile time.

base modelqwen2.5-coder-3b
gold pairs70 (35 train / 35 eval) per backend
k-score floor0.85
artifact size1.6 GB
compile time~22 min
spec sourcegrammar parser

What this recipe does

Stops the cycle of "I know what I want to find but I can't remember the LogQL syntax." The model reads the query, the available log streams (passed as context), and emits a structured query in the dialect you configured. The verifier parses every output through the real grammar — outputs that fail to parse never make it into the corpus.

Trained per-backend: one .kolm for Loki, one for Datadog, one for CloudWatch. Same prompt, different output dialect.

The spec

{
  "output_kind": "json",
  "schema": {
    "required": ["query", "backend", "time_window"],
    "properties": {
      "query": { "type": "string" },
      "backend": { "enum": ["loki", "datadog", "cloudwatch"] },
      "time_window": { "type": "string", "pattern": "^[0-9]+[smhd]$" },
      "limit": { "type": "integer", "maximum": 5000 }
    }
  },
  "verifier": {
    "grammar_parse": true,
    "reject_unbounded_window": true
  }
}

Gold pairs (3 of 70 shown)

input
show me 5xx from the checkout service in the last hour
output - loki
{
  "query": "{service=\"checkout\"} | json | status_code >= 500 | status_code < 600",
  "backend": "loki",
  "time_window": "1h",
  "limit": 1000
}
input
slowest 50 db queries today on prod-rw
output - datadog
{
  "query": "service:db host:prod-rw @duration_ms:>1000 sort:@duration_ms desc",
  "backend": "datadog",
  "time_window": "24h",
  "limit": 50
}
input
cold-start lambdas in us-east-1 last 15 min
output - cloudwatch
{
  "query": "fields @timestamp, @message | filter @message like /Init Duration/ | filter @message like /us-east-1/",
  "backend": "cloudwatch",
  "time_window": "15m",
  "limit": 200
}

Compile

kolm compile "natural language to log query" \
  --base qwen2.5-coder-3b \
  --pairs pairs.jsonl \
  --verifier grammar:loki \
  --k-floor 0.85 \
  --output log-grep-loki.kolm

ok wrote log-grep-loki.kolm
   k_score=0.89  signature=hmac-sha256

K-score gate

K-score 0.89 held-out 35 pairs · grammar-pass 100% · bounded-window 100% · semantic-match 86%

"Semantic match" is human-rated: does the query actually find what the request asked for? The grammar gate is hard (100% must parse); the semantic gate is soft (86% lands on the user's intent on the first shot, which beats every "ChatGPT writes me a LogQL query" workflow we measured).

Run-time profile

M2 MacBook
820ms
RTX 5090
220ms
iPhone 15 Pro
2.5s
CPU x86 (server)
3.2s

Deploy

# oncall cli — `lg "<question>"` runs the model and pipes to the backend:
lg() {
  q=$(kolm run log-grep-loki.kolm --input "$*" | jq -r '.query')
  w=$(kolm run log-grep-loki.kolm --input "$*" | jq -r '.time_window')
  logcli query "$q" --since="$w" --limit=200
}

lg "5xx from checkout in the last hour"