Honest training-data generation: local-first, no hallucinated K-score

The problem
What we do
What we do not do
Provenance and the K-score split
The five strategies
Recipe: three commands
Honest limits

The problem.

A health insurer wants a PHI redactor. A logistics company wants an invoice extractor. A SaaS support team wants a ticket classifier. All three teams come to us with the same thing: a handful of real examples in a Google Doc, maybe a CSV with 8 rows, and the very reasonable question, "how do I train on this?"

The bad answer in 2025 is "paste your data into ChatGPT and ask for variations." That leaks the data, and any K-score you compute on the resulting set is a story about ChatGPT, not about your task.

The other bad answer is "use a synthetic data generator that calls a frontier model." Same problem in a fancier wrapper. The seeds leave the machine, the K-score on the synthetic data is not the K-score on real data, and the team that owns the compliance binder cannot sign the artifact.

The constraint set we landed on for kolm seeds:

No hallucination from thin air. Every strategy needs real seeds as input. The only exception is the public seed datasets shipped with kolm, which are openly licensed and tagged as such.
No fake K-score. We never emit a K-score. K-score is computed by the train loop, with provenance, so a templated row is never counted as if it were a user-authored row.
No third-party API. The only network egress allowed is --strategy local-llm, which is gated to localhost and prompts the user for explicit confirmation.
Local-first. Templates, public datasets, and mutation rules ship on disk. The verb works on an airgapped laptop.
Deterministic. Same --from + same --strategy + same --seed-rng produces byte-identical output.

What we do.

Templated mutation. Rule-based, deterministic. The mutators rotate punctuation, casing, sentence order, and minor connectives. They never invent a fact. The result is a stylistic variation of a real seed used to test the model for robustness, not to ground-truth a new claim.

PII-aware mutation. For redaction tasks, we permute the identifier itself. The MRN becomes a different 7-digit number; the SSN uses 9xx area codes that the SSA has reserved as never-issued; the phone uses 555 prefix (the Hollywood-fiction range); names come from a 100-entry dictionary shipped with kolm. The output (the redacted form) is preserved exactly. The model learns the redaction shape on a synthetic identifier without ever seeing a real one.

Classifier mutation. Synonym substitution from a small hand-built dictionary, sentence-order swap when there are two clauses, case variation. The class label is preserved verbatim because that is the ground-truth signal.

Extractor mutation. Field-order permutation in the input, JSON key-rotation in the output. Same fields, different surface form. The model learns the extraction is independent of the order the source text presented the fields.

Public seed datasets. Where a public-domain dataset already exists (NIST Safe Harbor examples, the Enron emails for classification, our own synthetic invoice rows marked as such), we ship a small starter set with kolm. Bootstrap with one verb, then add your own real examples on top.

Optional local LLM. If you have Ollama running on localhost:11434 or llama.cpp on localhost:8080, you can call it via --strategy local-llm. The URL is validated to be localhost (one of localhost, 127.0.0.1, ::1); any other hostname is refused at parse time. The first invocation prompts the user for confirmation. No data leaves the machine.

What we do not do.

The list is short and load-bearing.

We do not call OpenAI, Anthropic, Google, or any third-party API. There is no code path in kolm seeds that opens a network connection to a non-localhost endpoint. The CLI ships without API keys for these providers; adding them yourself does not change the behavior.
We do not generate K-score from purely synthetic data. kolm seeds emits input/output pairs only. The K-score is computed downstream by the train loop, which reads the per-row provenance tag and reports the score broken out by source.
We do not hallucinate data from thin air. Every strategy that emits rows requires real seeds as input. The only path to rows-with-no-seeds is kolm seeds bootstrap, which copies a public-domain dataset shipped with kolm, and the dataset's header notes its provenance explicitly.
We do not silently fall back to a less-private path. If the local-LLM endpoint is unreachable, we error with a useful hint. We do not switch to a remote endpoint or to a templated mutation without telling the user.

Provenance and the K-score split.

Every row written by kolm seeds generate carries two fields the train loop reads.

// a seed row, written verbatim from the user's input file:
{"input": "patient John Doe MRN 1234567 visited on 2024-03-12",
 "output": "patient [REDACTED] MRN [REDACTED] visited on [DATE]",
 "source": "seed",         // from the user's seeds.jsonl directly
 "from_seed": 0}        // 0-indexed pointer into seeds.jsonl

// a templated mutation derived from seed 0:
{"input": "Patient Bailey Brown, MRN 4291807, visited on 02/14/2023.",
 "output": "patient [redacted], mrn [redacted], visited on [date].",
 "source": "templated",    // produced by mutatePIIRow()
 "from_seed": 0}        // derived from seed 0

The train loop reads the source field and reports two K-scores: K_seed on the rows tagged source: "seed" (your real data), and K_templated on the rows tagged source: "templated" (the robustness test set). The artifact's compliance binder reports both. The gate is set on K_seed, not on a blend, because that is the score about your task.

A team shipping with 5 user seeds and 195 templated rows might see K_seed = 0.94, K_templated = 0.91. A team that only trained on templated mutations of one or two seeds will see K_seed on a tiny holdout and the binder will say so. The numbers tell the truth.

The five strategies.

The table below is the full set. Each strategy is deterministic given a seed file and a --seed-rng integer.

strategy	best for	what it does	provenance tag
templated	generic	rotates punctuation, casing, connectives. no semantic invention.	`templated`
redact-pii-templated	PHI, PII, GDPR redactors	permutes identifier numbers (within reserved ranges), names from a 100-entry dictionary, dates in 5 formats.	`templated`
classify-mutate	ticket and email classifiers	synonym substitution from a small dictionary, sentence-order swap, case variation. label preserved.	`templated`
extract-permute	invoice and form extractors	input field-order swap, JSON key rotation in output. same fields, different surface form.	`templated`
local-llm	users who already run ollama or llama.cpp locally	POSTs to localhost (validated). user must confirm. seeds never leave the machine.	`local-llm`

None of these are large-language-model-grade variation. They are deterministic, rule-based, and small. The point is to expand a handful of real examples into a few hundred rows the train loop can use to estimate robustness, while keeping every row traceable to a user seed.

Recipe: three commands.

The full flow from "I have 8 examples in a CSV" to "I have a .kolm artifact" is three CLI calls. None of them touch the network unless you opt in to --strategy local-llm.

# 1. scaffold a seeds.jsonl with starter rows for your task type.
#    edit it to replace placeholders with your real examples.
$ kolm seeds new phi-redactor
wrote 5 starter rows to ./seeds.jsonl

# 2. expand to 200 rows via rule-based mutation. deterministic.
$ kolm seeds generate --from seeds.jsonl --count 200 \
                       --strategy redact-pii-templated --seed-rng 42
generating 200 rows from 8 seeds using strategy 'redact-pii-templated'...
  composition: 8 user seeds + 192 templated mutations + 0 hallucinated
  provenance tags written: yes (every row has source + from_seed)
  determinism: --seed-rng 42 (reproducible)

# 3. train. the loop reads provenance and reports K-score per source.
$ kolm train --spec phi-redactor.spec.json \
              --seeds ~/.kolm/seeds/expanded-1234-42.jsonl
K_seed     = 0.94 (8 rows)
K_templated = 0.91 (192 rows)
gate 0.85 passed; binder written to ./hipaa-binder.html

The artifact's compliance binder records both numbers, the seed count, the strategy, and the --seed-rng integer. Anyone reproducing the build can re-run the three commands and confirm the bytes match.

Honest limits.

Three places this story breaks, named on purpose.

Templated mutation does not substitute for real data. If you start with 3 real examples, the train loop has 3 real examples to score against. Templated rows test the model's tolerance for surface variation; they do not introduce new semantic cases. A team shipping a redactor that has never seen a real lab report should be told its K_seed is a small-sample estimate. The binder says so.

Some tasks have no public seed dataset. A construction-claims classifier or a proprietary supply-chain extractor is not a public dataset problem. For these, kolm seeds bootstrap has nothing to offer; the user has to bring their own seeds, full stop. The verb refuses with a useful error rather than fabricating.

Local-LLM strategy is only as private as your local LLM. If the user points --local-llm-url at http://localhost:8000 and that endpoint is a reverse proxy to a frontier model, the privacy guarantee is broken. We validate the URL hostname is localhost; we cannot validate the daemon behind that port. The user is responsible for what their localhost is.

The honest path is small. Five rules, four strategies, one optional opt-in. The point is not to generate impressive numbers; the point is that the numbers the binder reports are the numbers the auditor can trust.

The verb is part of kolm v11. The full grammar is documented at kolm seeds --help. The CLI source is on the public mirror at cli/kolm.js. The public-domain phi-redactor seed dataset is in data/public-seeds/.

Honest training-data generation: no fake K-score, no API leakage.

Contents