cookbook · coding · test-gen
Recipe · coding

Test cases that actually run.

A local .kolm file that writes pytest / vitest / go test cases for a target function. The verifier executes each generated test at compile time and rejects any that fail to assemble or that don't actually exercise the function. Trained on 100 paired examples per framework.

base modelqwen2.5-coder-7b
gold pairs100 (50 train / 50 eval) per framework
k-score floor0.85
artifact size2.4 GB
compile time~32 min
spec sourceexecution verifier

What this recipe does

Reads a target function and writes 3-7 test cases in your chosen framework. The output is the test file's source, ready to drop into tests/. The trick is the verifier: at compile time, every candidate output is run through pytest (or vitest, or go test) against the actual function. Tests that fail to import, fail to find the function, or pass with zero assertions are rejected.

The training set ends up containing only tests that exercise real behavior on real code. The model learns the shape of "tests that actually do something."

The spec

{
  "output_kind": "text",
  "verifier": {
    "executor": "pytest",
    "must_assemble": true,
    "must_call_target": true,
    "min_assertions": 2,
    "max_lines": 80
  }
}

Swap executor for vitest, jest, go test, or cargo test. The verifier knows the framework's idioms - what counts as an assertion, what counts as a test boundary, what kind of import is required.

Gold pairs (2 of 100 shown)

input - python
def normalize_email(s: str) -> str:
    s = s.strip().lower()
    if "+" in s.split("@")[0]:
        local, domain = s.split("@")
        local = local.split("+")[0]
        s = f"{local}@{domain}"
    return s
output - pytest
import pytest
from app.email import normalize_email

def test_lowercases():
    assert normalize_email("Alice@Example.com") == "alice@example.com"

def test_strips_plus_alias():
    assert normalize_email("alice+spam@example.com") == "alice@example.com"

def test_strips_whitespace():
    assert normalize_email("  alice@example.com  ") == "alice@example.com"

@pytest.mark.parametrize("raw", ["", "  "])
def test_handles_empty(raw):
    assert normalize_email(raw) == raw.strip().lower()
input - typescript
export function clamp(x: number, lo: number, hi: number): number {
  return Math.max(lo, Math.min(x, hi));
}
output - vitest
import { describe, it, expect } from "vitest";
import { clamp } from "./clamp";

describe("clamp", () => {
  it("returns x when in range", () => {
    expect(clamp(5, 0, 10)).toBe(5);
  });
  it("clamps below", () => {
    expect(clamp(-3, 0, 10)).toBe(0);
  });
  it("clamps above", () => {
    expect(clamp(99, 0, 10)).toBe(10);
  });
  it("handles equal bounds", () => {
    expect(clamp(7, 5, 5)).toBe(5);
  });
});

Compile

kolm compile "write pytest tests that exercise the target" \
  --base qwen2.5-coder-7b \
  --pairs pairs.jsonl \
  --verifier executor=pytest,min-assertions=2 \
  --k-floor 0.85 \
  --output test-gen.kolm

ok wrote test-gen.kolm
   k_score=0.88  signature=hmac-sha256

K-score gate

K-score 0.88 held-out 50 pairs · assemble-pass 96% · calls-target 100% · min-assertions met 92%

The "calls-target" check is what catches the common failure mode: tests that import the function and assert True == True without ever calling it. Verifier-rejected outputs never make it into the corpus, so the model learns to write tests that actually exercise the code under test.

Run-time profile

M2 MacBook
1.7s
RTX 5090
410ms
iPhone 15 Pro
4.4s
CPU x86 (server)
5.8s

Deploy

# bulk-generate tests for newly-added functions:
for fn in $(git diff --name-only HEAD~1 -- "*.py"); do
  for func in $(kolm extract-functions "$fn"); do
    out=$(kolm run test-gen.kolm --input "$func")
    test_path="tests/test_$(basename "$fn" .py).py"
    echo "$out" >> "$test_path"
  done
done

# run the generated tests immediately:
pytest tests/ -x -q