Feature spec from issue recipe

What this recipe does

A PM gets a 3-paragraph issue. Before they can scope it, they spend 20 minutes restructuring it into a spec. Same template every time. This recipe does the restructure pass and stops there — it doesn't decide product strategy or set priorities, it just turns prose into the shape your team reviews.

The verifier enforces (a) all four sections present, (b) each "success criterion" is testable (must contain a measurable verb), (c) the brand-voice gate from the docstring recipe — your prevailing tone, not a generic LLM voice.

The spec

{
  "output_kind": "json",
  "schema": {
    "required": ["problem", "scope", "success_criteria", "open_questions"],
    "properties": {
      "problem": { "type": "string", "maxLength": 600 },
      "scope": { "type": "object",
        "required": ["in", "out"],
        "properties": {
          "in": { "type": "array", "items": {"type":"string"} },
          "out": { "type": "array", "items": {"type":"string"} }
        }
      },
      "success_criteria": { "type": "array", "items": {"type":"string"}, "minItems": 2 },
      "open_questions": { "type": "array", "items": {"type":"string"} }
    }
  },
  "verifier": {
    "success_criterion_must_contain_verb": true,
    "style_corpus": "prior-specs/",
    "max_words_per_section": 200
  }
}

Gold pairs (1 of 80 shown)

input - github issue

title: feed loads slow for users with >200 saved items
body:
Customers on the high-volume saved list (top 5%) report the feed taking
4-7 seconds to render. Telemetry confirms: p95 feed-render at 3.8s for users
above 200 saved items, vs 480ms for everyone else. We've seen it in support
tickets twice this week. Likely the N+1 query in feed.py:118 — we know
about it but never prioritized it. Should we?

output

{
  "problem": "Customers with >200 saved items see p95 feed-render at 3.8s vs 480ms baseline. Affects ~5% of active users; surfaced in 2 support tickets this week. Source is the N+1 query at feed.py:118.",
  "scope": {
    "in": [
      "Eliminate the N+1 query at feed.py:118 (use prefetch_related)",
      "Add p95-by-saved-count chart to the feed perf dashboard"
    ],
    "out": [
      "Pagination redesign (separate work)",
      "Caching layer (consider after this fix lands)"
    ]
  },
  "success_criteria": [
    "p95 feed-render reduces below 800ms for users with 200+ saved items",
    "DB query count for the feed endpoint drops below 12 per render (currently scales with saved count)",
    "Zero new feed-perf support tickets in the 14 days after release"
  ],
  "open_questions": [
    "Do we backfill prefetch hints into the existing serializer, or write a new one?",
    "What's the rollout plan — flag-gated for high-volume users first?"
  ]
}

Compile

kolm compile "github issue to 4-section feature spec" \
  --base qwen2.5-7b-instruct \
  --pairs pairs.jsonl \
  --style-corpus prior-specs/ \
  --verifier success-criterion-testable \
  --k-floor 0.80 \
  --output feature-spec.kolm

ok wrote feature-spec.kolm
   k_score=0.84  signature=hmac-sha256

K-score gate

K-score 0.84 held-out 40 specs · schema-pass 100% · success-criteria-testable 96% · PM-rated useful 86%

The "testable verb" check is the hardest gate — it kills outputs like "feed should be faster" (no measurable threshold) and forces "p95 feed-render reduces below 800ms" (concrete, observable, falsifiable).

Run-time profile

M2 MacBook

2.2s

RTX 5090

510ms

iPhone 15 Pro

5.8s

CPU x86 (server)

7.4s

Deploy

# github action — runs on every issue with label `needs-spec`:
on:
  issues:
    types: [labeled]
jobs:
  draft-spec:
    if: github.event.label.name == 'needs-spec'
    runs-on: ubuntu-latest
    steps:
      - run: |
          spec=$(kolm run feature-spec.kolm --input "$ISSUE_BODY")
          gh issue comment $NUM --body "$spec"

Issue in, spec out.