PR review recipe

What this recipe does

Replaces the "ask GPT to review this PR" prompt that lives in your CI with a signed local model. The deliverable is a .kolm file that runs offline, returns a JSON object you can render in a comment, and never leaks the diff to a third-party API.

The shape: diff in, structured-review out. Each issue carries severity (error / warning / info), category (correctness / style / security / perf), and a precise file:line pointer. The verifier rejects free-form prose and enforces the schema before any output ships.

The spec

{
  "type": "object",
  "required": ["summary", "issues"],
  "properties": {
    "summary": { "type": "string", "maxLength": 280 },
    "issues": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["severity", "category", "file", "line", "detail"],
        "properties": {
          "severity": { "enum": ["error", "warning", "info"] },
          "category": { "enum": ["correctness", "style", "security", "perf"] },
          "file": { "type": "string" },
          "line": { "type": "integer", "minimum": 1 },
          "detail": { "type": "string", "maxLength": 480 }
        }
      }
    }
  }
}

Gold pairs (3 of 120 shown)

input - diff #017

--- a/src/auth.js
+++ b/src/auth.js
@@ -42,8 +42,11 @@ function checkPassword(req, res) {
   const stored = db.users[req.body.email];
-  if (stored.password === req.body.password) {
+  if (stored.password == req.body.password) {
     return res.json({ token: sign(stored) });
   }
   return res.status(401).end();
}

output

{
  "summary": "Loose equality on credential check; missing bcrypt compare.",
  "issues": [
    {
      "severity": "error", "category": "security",
      "file": "src/auth.js", "line": 44,
      "detail": "Plaintext password comparison with == coerces types and timing-leaks. Use bcrypt.compare or crypto.timingSafeEqual."
    }
  ]
}

input - diff #042

--- a/lib/sum.py
+++ b/lib/sum.py
@@ -1,3 +1,3 @@
 def sum_pos(xs):
-    return sum(x for x in xs if x > 0)
+    return sum([x for x in xs if x > 0])

output

{
  "summary": "List comprehension is materialized unnecessarily; minor perf regression.",
  "issues": [
    {
      "severity": "info", "category": "perf",
      "file": "lib/sum.py", "line": 3,
      "detail": "Generator already lazy; wrapping it in a list allocates an extra array. Drop the brackets."
    }
  ]
}

input - diff #088

--- a/api/users.go
+++ b/api/users.go
@@ -10,6 +10,9 @@ func ListUsers(w http.ResponseWriter, r *http.Request) {
   page, _ := strconv.Atoi(r.URL.Query().Get("page"))
+  if page < 0 { page = 0 }
+  if page > 1000 { page = 1000 }
   users := db.ListUsers(page * 50)
}

output

{
  "summary": "Defensive bounds added; correct.",
  "issues": []
}

Compile

# pairs.jsonl: 120 (input, output) pairs of the shape above
kolm compile "PR review with structured issues" \
  --base qwen2.5-coder-7b \
  --pairs pairs.jsonl \
  --spec spec.json \
  --k-floor 0.85 \
  --output pr-review.kolm

ok wrote pr-review.kolm
   k_score=0.89  signature=hmac-sha256
   artifact_hash=sha256:8a4f...e612

K-score gate

K-score 0.89 held-out 60 pairs · verifier-pass 91% · recipe-coverage 88% · latency-ratio 0.84

Floor was 0.85. Run came in at 0.89. The verifier rejected three Pass-1 outputs that emitted free-form prose instead of the schema; those got recompiled and the gate held.

Run-time profile

M2 MacBook

1.4s

RTX 5090

340ms

iPhone 15 Pro

3.8s

CPU x86 (server)

6.1s

Numbers are p50 on a 32-line diff. Long diffs (200+ lines) scale roughly 2x. The artifact is 2.4 GB on disk; cold-load to first-token is dominated by the load step on phone-class hardware.

Deploy

# GitHub Action that posts a review on every push:
- uses: kolm-ai/run-action@v1
  with:
    artifact: pr-review.kolm
    input: ${{ github.event.pull_request.diff_url }}
    output: review.json
- uses: actions/github-script@v7
  with:
    script: |
      const r = require('./review.json');
      for (const i of r.issues) {
        await github.rest.pulls.createReviewComment({...});
      }

PR review, compiled.

What this recipe does

The spec

Gold pairs (3 of 120 shown)

Compile

K-score gate

Run-time profile

Deploy

Related recipes