research . data . 10 min read

Document ingestion.

The buyer's data is in PDFs, EMLs, mailboxes, HTML exports, and CSVs. The training pipeline needs uniform JSONL with prompt/response pairs, deduplicated, redacted of PII, and tagged with a task mode. The conversion is unglamorous and decisive: a 20% bad ingest means a 20% worse adapter, no matter how good the trainer is downstream. kolm runs a single multi-format pipeline that turns the buyer's raw documents into training pairs without a retriever, an embedding store, or a vector database.

May 14, 2026 · Kolmogorov research · apps/data/ingest.py

The RAG replacement story

The conventional architecture for "ask questions about the buyer's documents" is retrieval-augmented generation: embed every chunk into a vector store, run the user's query through the same embedder, retrieve the top-k chunks, and stuff them into the model's context window at inference time. RAG works. It also requires a vector database, an embedder, an embedding refresh job whenever the documents change, a retrieval-quality eval, and a per-query latency cost that grows with k.

The alternative kolm pursues is to train the document content into a LoRA adapter, gate the result against a held-out QA eval set, and serve the adapter without retrieval at inference. The buyer pays the training cost once (and re-pays it when the documents change) instead of paying retrieval latency forever. For a fixed and slowly-changing document corpus (compliance manuals, internal policies, product specs, dictation templates), the math favors training. For a high-churn corpus or one where the buyer needs precise citation back to a document id, retrieval is the right tool; kolm does not block it.

Paragraph-aware chunking with overlap

The dumbest chunker (fixed character window with a sliding stride) corrupts every paragraph that crosses a chunk boundary and produces training pairs where the response references content that is no longer in the prompt. The fix is small and decisive: split on paragraph breaks first, then pack paragraphs into chunks up to a target token count, with an overlap of N tokens between adjacent chunks. The overlap is enough to recover most cross-boundary references.

def chunk_paragraphs(text: str, target_tokens: int = 512, overlap_tokens: int = 64) -> list[str]:
    paragraphs = [p.strip() for p in re.split(r"\n\s*\n", text) if p.strip()]
    chunks, current, current_tokens = [], [], 0
    for p in paragraphs:
        p_tokens = _approx_token_count(p)
        if current_tokens + p_tokens > target_tokens and current:
            chunks.append("\n\n".join(current))
            tail = _tail_tokens("\n\n".join(current), overlap_tokens)
            current, current_tokens = [tail] if tail else [], _approx_token_count(tail) if tail else 0
        current.append(p)
        current_tokens += p_tokens
    if current:
        chunks.append("\n\n".join(current))
    return chunks

The token counter is approximate (4 characters per token is close enough for chunking purposes); the trainer's actual tokenizer runs later. The overlap is calibrated per task mode: 64 tokens for QA, 32 for instruction, 128 for completion (longer because cross-boundary context matters more for autoregressive continuation).

PII redaction

The ingest pipeline is the right layer to redact PII because it runs on raw documents before any model sees them. The redactor uses regex patterns for the structured fields (email, phone, social-security, account number, ICD-10 with a leading patient name) and a token-class denylist for the unstructured ones (name + DOB combinations, signature blocks at the end of emails). Patterns hit are replaced with stable placeholder tokens ([EMAIL_1], [PHONE_2]) so the model learns to ignore them rather than reproduce them.

The structured-regex layer catches roughly 95% of PII; the residual 5% (free-text names, addresses inside narrative paragraphs) needs an entity recognizer or a human-in-the-loop pass. kolm reports the regex-hit count per document and flags documents with hit-density above a threshold for the buyer's review. The redactor is conservative: a false positive (redacting a non-PII token) costs only some training signal; a false negative (leaking real PII into the training set) is the bug the pipeline must not have.

Three task modes

Mode	Prompt shape	Response shape	When to use
qa	Synthetic question over the chunk (Self-Instruct-style, generated by a teacher model)	The relevant span from the chunk	Buyer wants to ask questions and get cited answers
instruction	Task description + the chunk as context	The action the task description specifies (summarize, rewrite, classify)	Buyer wants the model to perform a task over their documents
completion	The first half of the chunk	The second half of the chunk	Buyer wants the model to write in the style of their documents

kolm's default for an enterprise distill job is qa+instruction at a 3:1 mix, with completion added only when the buyer's brief explicitly names "style" or "voice" as the objective. The Self-Instruct technique (Wang et al, 2022) generates the QA prompts: a teacher model is shown the chunk and asked to produce 5 plausible questions, then each question is paired with the answer it implies. Pairs that fail a simple round-trip check (answer not in chunk) are dropped.

Per-format readers

Each format has its own reader that returns plain text; everything downstream is format-agnostic. The four hot-path readers:

PDF. PyMuPDF first (fast, layout-preserving), with a pdfplumber fallback for PDFs that PyMuPDF returns empty text from (scanned PDFs with broken text layers). OCR fallback (Tesseract) is wired but off by default; the buyer opts in because the throughput drop is large.
EML and MBOX. Python's mailbox module for mbox, email module for individual .eml files. Multipart messages prefer text/plain; HTML body is converted via BeautifulSoup. Attachments are skipped by default; the buyer opts in if the attachments themselves are training data.
HTML. BeautifulSoup with the lxml parser, body text only, <script> and <style> stripped, link-text preserved as plain text. The reader does not chase outbound links; URLs are added to a queue the buyer can run a separate ingest over.
CSV. Each row becomes a paragraph constructed from the header-to-value mapping. The buyer specifies which columns are prompt and which are response (or the reader infers it from the column-name patterns "input"/"output", "question"/"answer", "prompt"/"completion").

Three more formats are wired but rare: DOCX via python-docx, TXT via direct read, JSONL via line-by-line parse with explicit schema validation.

Deduplication

Two passes. The first is exact-hash deduplication: chunks with identical SHA-256 are collapsed to one. The second is near-duplicate detection via MinHash + locality-sensitive hashing with a Jaccard threshold of 0.85 on shingled 5-grams. Both passes run before mode-tagging because a duplicate chunk in qa mode produces a duplicate prompt-answer pair and inflates the training set with no new signal.

The dedup statistics ship in the receipt: n_documents, n_chunks_pre_dedup, n_chunks_post_dedup, dedup_ratio. A high dedup ratio is a flag (the buyer may have ingested the same set of templates twice); a low one is normal.

Edge cases worth naming

Scanned PDFs. PyMuPDF returns empty strings for scanned PDFs. The reader falls back to pdfplumber, which is more aggressive about pulling text from form fields and tagged structure but still fails on actual image scans. The OCR fallback exists; the buyer must opt in because Tesseract throughput on a 200-page contract is roughly 60 seconds, vs sub-second for a text-layer PDF.

The signature block at the end of every email. A 50,000-message mailbox where every message ends with the same 6-line signature contributes 50,000 copies of that signature to the training set. The signature-block detector looks for a delimiter (-- or three or more newlines) followed by a recognized name pattern; everything after is stripped. The detector is opt-in (the buyer may want the signature in the training set if the model needs to learn to sign emails) and on-by-default for the redaction pipeline (the signature contains PII).

HTML with heavy boilerplate. Most HTML pages are 80% navigation, footer, sidebar, and 20% content. BeautifulSoup naively returns all of it. The reader includes a heuristic content extractor (Mozilla's Readability port) that the buyer can enable; without it, the chunker tends to fill chunks with menu items. The heuristic is not always right; the receipt records which extractor was used so a reviewer can spot-check.

Encoding. Real-world PDFs and emails carry mixed encodings that crash naive UTF-8 readers. Every reader runs chardet on the raw bytes and decodes with errors="replace" as a last resort. Documents where the replacement rate exceeds 1% are flagged in the receipt.

What the receipt records

"ingest": {
  "method": "multi_format_paragraph_chunk_with_overlap",
  "n_documents": 1820,
  "n_chunks_pre_dedup": 41612,
  "n_chunks_post_dedup": 38904,
  "dedup_ratio": 0.065,
  "redaction_hits": {
    "email": 1284,
    "phone": 712,
    "ssn": 31,
    "signature_block": 14076
  },
  "modes": {"qa": 29178, "instruction": 9726, "completion": 0},
  "target_tokens": 512,
  "overlap_tokens": 64,
  "papers": [
    "arXiv:2212.10560",
    "arXiv:2009.03300"
  ]
}

The canonical-JSON manifest hash covers the block, so a tampered receipt invalidates the artifact signature. A reviewer can check the dedup ratio, the redaction-hit distribution, and the mode mix against the buyer's brief.

Where ingest sits in the kolm compile loop

Ingest is the first stage of every training run that starts from documents instead of captured API traffic. The pipeline is documents → ingest → training pairs (JSONL) → SFT → (optional preference) → K-score gate → sign → ship. The captured-traffic path is shorter (no ingest needed); the document path is the path enterprises pick when they want to train on their corpus rather than on their live traffic. The same K-score gate, the same audit log, the same receipt format on the way out.

Citations

Wang, Y. et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560, 2022. The QA-pair generation pattern.

Hendrycks, D. et al. Measuring Massive Multitask Language Understanding. arXiv:2009.03300, 2020. MMLU, the canonical knowledge eval that ingest pipelines should at least match on shape.

Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401, 2020. The architecture this pipeline is the alternative to.

Borgeaud, S. et al. Improving language models by retrieving from trillions of tokens. arXiv:2112.04426, 2022. RETRO.

← Back to research · Frontier reference · Document ingestion · RS-1 spec