Test generation recipe

What this recipe does

Reads a target function and writes 3-7 test cases in your chosen framework. The output is the test file's source, ready to drop into tests/. The trick is the verifier: at compile time, every candidate output is run through pytest (or vitest, or go test) against the actual function. Tests that fail to import, fail to find the function, or pass with zero assertions are rejected.

The training set ends up containing only tests that exercise real behavior on real code. The model learns the shape of "tests that actually do something."

The spec

{
  "output_kind": "text",
  "verifier": {
    "executor": "pytest",
    "must_assemble": true,
    "must_call_target": true,
    "min_assertions": 2,
    "max_lines": 80
  }
}

Swap executor for vitest, jest, go test, or cargo test. The verifier knows the framework's idioms - what counts as an assertion, what counts as a test boundary, what kind of import is required.

Gold pairs (2 of 100 shown)

input - python

def normalize_email(s: str) -> str:
    s = s.strip().lower()
    if "+" in s.split("@")[0]:
        local, domain = s.split("@")
        local = local.split("+")[0]
        s = f"{local}@{domain}"
    return s

output - pytest

import pytest
from app.email import normalize_email

def test_lowercases():
    assert normalize_email("Alice@Example.com") == "alice@example.com"

def test_strips_plus_alias():
    assert normalize_email("alice+spam@example.com") == "alice@example.com"

def test_strips_whitespace():
    assert normalize_email("  alice@example.com  ") == "alice@example.com"

@pytest.mark.parametrize("raw", ["", "  "])
def test_handles_empty(raw):
    assert normalize_email(raw) == raw.strip().lower()

input - typescript

export function clamp(x: number, lo: number, hi: number): number {
  return Math.max(lo, Math.min(x, hi));
}

output - vitest

import { describe, it, expect } from "vitest";
import { clamp } from "./clamp";

describe("clamp", () => {
  it("returns x when in range", () => {
    expect(clamp(5, 0, 10)).toBe(5);
  });
  it("clamps below", () => {
    expect(clamp(-3, 0, 10)).toBe(0);
  });
  it("clamps above", () => {
    expect(clamp(99, 0, 10)).toBe(10);
  });
  it("handles equal bounds", () => {
    expect(clamp(7, 5, 5)).toBe(5);
  });
});

Compile

kolm compile "write pytest tests that exercise the target" \
  --base qwen2.5-coder-7b \
  --pairs pairs.jsonl \
  --verifier executor=pytest,min-assertions=2 \
  --k-floor 0.85 \
  --output test-gen.kolm

ok wrote test-gen.kolm
   k_score=0.88  signature=hmac-sha256

K-score gate

K-score 0.88 held-out 50 pairs · assemble-pass 96% · calls-target 100% · min-assertions met 92%

The "calls-target" check is what catches the common failure mode: tests that import the function and assert True == True without ever calling it. Verifier-rejected outputs never make it into the corpus, so the model learns to write tests that actually exercise the code under test.

Run-time profile

M2 MacBook

1.7s

RTX 5090

410ms

iPhone 15 Pro

4.4s

CPU x86 (server)

5.8s

Deploy

# bulk-generate tests for newly-added functions:
for fn in $(git diff --name-only HEAD~1 -- "*.py"); do
  for func in $(kolm extract-functions "$fn"); do
    out=$(kolm run test-gen.kolm --input "$func")
    test_path="tests/test_$(basename "$fn" .py).py"
    echo "$out" >> "$test_path"
  done
done

# run the generated tests immediately:
pytest tests/ -x -q

Test cases that actually run.