What this recipe does
Reads a target function and writes 3-7 test cases in your chosen framework. The output is the test file's source, ready to drop into tests/. The trick is the verifier: at compile time, every candidate output is run through pytest (or vitest, or go test) against the actual function. Tests that fail to import, fail to find the function, or pass with zero assertions are rejected.
The training set ends up containing only tests that exercise real behavior on real code. The model learns the shape of "tests that actually do something."
The spec
{
"output_kind": "text",
"verifier": {
"executor": "pytest",
"must_assemble": true,
"must_call_target": true,
"min_assertions": 2,
"max_lines": 80
}
}
Swap executor for vitest, jest, go test, or cargo test. The verifier knows the framework's idioms - what counts as an assertion, what counts as a test boundary, what kind of import is required.
Gold pairs (2 of 100 shown)
def normalize_email(s: str) -> str:
s = s.strip().lower()
if "+" in s.split("@")[0]:
local, domain = s.split("@")
local = local.split("+")[0]
s = f"{local}@{domain}"
return s
import pytest
from app.email import normalize_email
def test_lowercases():
assert normalize_email("Alice@Example.com") == "alice@example.com"
def test_strips_plus_alias():
assert normalize_email("alice+spam@example.com") == "alice@example.com"
def test_strips_whitespace():
assert normalize_email(" alice@example.com ") == "alice@example.com"
@pytest.mark.parametrize("raw", ["", " "])
def test_handles_empty(raw):
assert normalize_email(raw) == raw.strip().lower()
export function clamp(x: number, lo: number, hi: number): number {
return Math.max(lo, Math.min(x, hi));
}
import { describe, it, expect } from "vitest";
import { clamp } from "./clamp";
describe("clamp", () => {
it("returns x when in range", () => {
expect(clamp(5, 0, 10)).toBe(5);
});
it("clamps below", () => {
expect(clamp(-3, 0, 10)).toBe(0);
});
it("clamps above", () => {
expect(clamp(99, 0, 10)).toBe(10);
});
it("handles equal bounds", () => {
expect(clamp(7, 5, 5)).toBe(5);
});
});
Compile
kolm compile "write pytest tests that exercise the target" \ --base qwen2.5-coder-7b \ --pairs pairs.jsonl \ --verifier executor=pytest,min-assertions=2 \ --k-floor 0.85 \ --output test-gen.kolm ok wrote test-gen.kolm k_score=0.88 signature=hmac-sha256
K-score gate
The "calls-target" check is what catches the common failure mode: tests that import the function and assert True == True without ever calling it. Verifier-rejected outputs never make it into the corpus, so the model learns to write tests that actually exercise the code under test.
Run-time profile
Deploy
# bulk-generate tests for newly-added functions: for fn in $(git diff --name-only HEAD~1 -- "*.py"); do for func in $(kolm extract-functions "$fn"); do out=$(kolm run test-gen.kolm --input "$func") test_path="tests/test_$(basename "$fn" .py).py" echo "$out" >> "$test_path" done done # run the generated tests immediately: pytest tests/ -x -q