kolm / the frontier stack
A category-by-category inventory of every frontier ML technique kolm runs in production. Each one is wired into the compiler or runtime, auto-gated on the hardware capability that justifies it, and recorded in the signed receipt of every run that uses it. No flags to flip, no plumbing to write.
19
Training
6
Decoding
5
Serving
6
Evaluation
2
Data
4
Operational
Group-relative policy optimization with verifiable rewards. DeepSeek R1 pattern, no value head. Wired to the K-score evaluator.
Five-method preference router. ORPO when SFT data + tight VRAM; SimPO when pairs without ref model.
Shifting reference policy each round; 4 judges (learned RM, LLM judge, self-rewarding, verifiable). apps/trainer/online_dpo.py.
Token-level KL forward / reverse / JSD plus α-CE. Top-k logit pruning, on-policy option. MiniLLM lineage.
Top-1 (Switch) or top-k (Mixtral) router, z-loss + load-balance aux. Expert CIDs in the receipt.
Composable adapter variants. --quality enables all four; the optimizer splits A/B groups when LoRA+ is on.
Embedding-space noise during SFT. Auto-applied with the quality preset; receipt records the noise magnitude.
bitsandbytes NF4 + double-quant on the base model during adapter training. Lets 14B fit a 24GB consumer card.
bitsandbytes paged optimizer states. 14B SFT on a single A100-80GB. Auto-on when CUDA is detected.
Optional kernel path for ≤14B SFT. Same adapter shape on disk, smaller VRAM envelope.
Patches RMSNorm, SwiGLU, RoPE, LM head for Qwen / Llama / Gemma / Phi. ~1.6–2.0x training-time on Ampere+.
_attn_impl_for() picks fa3 on Hopper / Blackwell, fa2 on Ampere, sdpa elsewhere. No user knob.
Auto-enable on sm_120 (Blackwell, RTX 50-series) when torch ≥ 2.8 + cuBLASLt 12.9 present. bf16 fallback.
TransformerEngine fp8 autocast on Hopper. Receipt records the recipe (delayed scaling, hybrid).
Optional pre-SFT mixed-denoising pass. R-denoising, S-denoising, X-denoising at calibrated ratios. apps/trainer/span_objective.py.
Whisper-distill + speech-LLM pairing. Captions in, transcript-receipt out. Reuses the same compile pipeline.
FSDP2 + Megatron-style TP for multi-GPU SFT. Auto-on for >14B. Receipt records the topology.
Activations recomputed on the backward; trades 1.3x time for 4x memory. Auto-on past a per-arch threshold.
Step-N + RNG-state + optimizer-state snapshots. Pre-empted spot run resumes from the last checkpoint.
Pairings: Qwen 7B→1.5B, Llama 3B→1B, Gemma 12B→1B. Wired in apps/runtime/serve.py.
Multi-head spec decoding. ~2.3–2.8x on H100. vLLM 0.7+ speculative_model.
Tree-verified spec decoding. ~3–4x on H100. Pairs with the same draft inventory as 20.
Outlines + lm-format-enforcer logits processors. JSON-schema, regex, CFG, choice modes.
Native render for Qwen / Llama / Hermes; union schema fallback. Every emission parses.
Auto-on for compute capability sm_90+. vLLM kv_cache_dtype="fp8". 2x context per byte.
Continuous batching + paged KV. ~2,200 tok/s 7B INT8 on A100. Production default.
Prefix-cache sharing across requests. ~1.15x vLLM on verifier-chain workloads.
FlashAttention-2 default, OpenAI-compatible shim. Easy if you're already on HF Hub.
Compiled engines + FP8 KV. ~3,000 tok/s 7B INT8 on H100. Fastest backend we ship.
DistServe / Mooncake pattern. INPROC single-node 2x-GPU or CROSSHOST with NVLink / RDMA / TCP. Receipt records the transport.
K = 0.40·A + 0.15·S + 0.15·L + 0.15·C + 0.15·V. Binding compile-time gate; no override flag.
1–10 rubric, configurable judge model. Reported alongside K-score; never replaces the gate. Budget-capped per run.
A vs B comparisons with position-bias correction. Bradley-Terry aggregation across the eval set.
Vectara HHEM-2.1 grounding score + claim decomposition for RAG. Per-claim min as conservative aggregate.
MedQA, FinQA, LegalBench, GPQA, HumanEval, MT-Bench, AlpacaEval, Arena-Hard. Run alongside K-score.
Every receipt carries input_sha + output_sha + CID. Replay later from the receipt alone with byte-equality check.
Five-step chain over canonical-JSON manifest. CID cidv1:sha256 pins the exact bytes. Pure-Rust verifier ships in packages/runtime-rs/.
Nitro COSE_Sign1, SEV-SNP, TDX quote, GCP / Azure CVM, docker. Quote travels with the artifact.
Run with egress=0; the harness blackholes network at the syscall layer. Receipt records the syscall counter.
ZIP+manifest.json with the canonical-JSON→sha256 CID rule. Spec is versioned and frozen; no proprietary opcodes.
Each technique above links to the in-codebase implementation or the published research note. Receipts cite by name; the manifest carries the enabled set.