What this recipe does
A PM gets a 3-paragraph issue. Before they can scope it, they spend 20 minutes restructuring it into a spec. Same template every time. This recipe does the restructure pass and stops there — it doesn't decide product strategy or set priorities, it just turns prose into the shape your team reviews.
The verifier enforces (a) all four sections present, (b) each "success criterion" is testable (must contain a measurable verb), (c) the brand-voice gate from the docstring recipe — your prevailing tone, not a generic LLM voice.
The spec
{
"output_kind": "json",
"schema": {
"required": ["problem", "scope", "success_criteria", "open_questions"],
"properties": {
"problem": { "type": "string", "maxLength": 600 },
"scope": { "type": "object",
"required": ["in", "out"],
"properties": {
"in": { "type": "array", "items": {"type":"string"} },
"out": { "type": "array", "items": {"type":"string"} }
}
},
"success_criteria": { "type": "array", "items": {"type":"string"}, "minItems": 2 },
"open_questions": { "type": "array", "items": {"type":"string"} }
}
},
"verifier": {
"success_criterion_must_contain_verb": true,
"style_corpus": "prior-specs/",
"max_words_per_section": 200
}
}
Gold pairs (1 of 80 shown)
title: feed loads slow for users with >200 saved items body: Customers on the high-volume saved list (top 5%) report the feed taking 4-7 seconds to render. Telemetry confirms: p95 feed-render at 3.8s for users above 200 saved items, vs 480ms for everyone else. We've seen it in support tickets twice this week. Likely the N+1 query in feed.py:118 — we know about it but never prioritized it. Should we?
{
"problem": "Customers with >200 saved items see p95 feed-render at 3.8s vs 480ms baseline. Affects ~5% of active users; surfaced in 2 support tickets this week. Source is the N+1 query at feed.py:118.",
"scope": {
"in": [
"Eliminate the N+1 query at feed.py:118 (use prefetch_related)",
"Add p95-by-saved-count chart to the feed perf dashboard"
],
"out": [
"Pagination redesign (separate work)",
"Caching layer (consider after this fix lands)"
]
},
"success_criteria": [
"p95 feed-render reduces below 800ms for users with 200+ saved items",
"DB query count for the feed endpoint drops below 12 per render (currently scales with saved count)",
"Zero new feed-perf support tickets in the 14 days after release"
],
"open_questions": [
"Do we backfill prefetch hints into the existing serializer, or write a new one?",
"What's the rollout plan — flag-gated for high-volume users first?"
]
}
Compile
kolm compile "github issue to 4-section feature spec" \ --base qwen2.5-7b-instruct \ --pairs pairs.jsonl \ --style-corpus prior-specs/ \ --verifier success-criterion-testable \ --k-floor 0.80 \ --output feature-spec.kolm ok wrote feature-spec.kolm k_score=0.84 signature=hmac-sha256
K-score gate
The "testable verb" check is the hardest gate — it kills outputs like "feed should be faster" (no measurable threshold) and forces "p95 feed-render reduces below 800ms" (concrete, observable, falsifiable).
Run-time profile
Deploy
# github action — runs on every issue with label `needs-spec`: on: issues: types: [labeled] jobs: draft-spec: if: github.event.label.name == 'needs-spec' runs-on: ubuntu-latest steps: - run: | spec=$(kolm run feature-spec.kolm --input "$ISSUE_BODY") gh issue comment $NUM --body "$spec"