kolm  /  the frontier stack

Forty-two frontier techniques. All shipped. All in the receipt.

A category-by-category inventory of every frontier ML technique kolm runs in production. Each one is wired into the compiler or runtime, auto-gated on the hardware capability that justifies it, and recorded in the signed receipt of every run that uses it. No flags to flip, no plumbing to write.

19

Training

6

Decoding

5

Serving

6

Evaluation

2

Data

4

Operational

Section 01

Training

19 techniques
01GRPOtrainer

Group-relative policy optimization with verifiable rewards. DeepSeek R1 pattern, no value head. Wired to the K-score evaluator.

02DPO · KTO · ORPO · SimPO · IPOpreference

Five-method preference router. ORPO when SFT data + tight VRAM; SimPO when pairs without ref model.

03Online DPOtrainer

Shifting reference policy each round; 4 judges (learned RM, LLM judge, self-rewarding, verifiable). apps/trainer/online_dpo.py.

04Response distillationtrainer

Token-level KL forward / reverse / JSD plus α-CE. Top-k logit pruning, on-policy option. MiniLLM lineage.

05MoE with LoRA expertstrainer

Top-1 (Switch) or top-k (Mixtral) router, z-loss + load-balance aux. Expert CIDs in the receipt.

06LoRA + DoRA + rsLoRA + LoRA+adapter

Composable adapter variants. --quality enables all four; the optimizer splits A/B groups when LoRA+ is on.

07NEFTune noiseadapter

Embedding-space noise during SFT. Auto-applied with the quality preset; receipt records the noise magnitude.

08QLoRA 4-bit NF4quant

bitsandbytes NF4 + double-quant on the base model during adapter training. Lets 14B fit a 24GB consumer card.

09Paged AdamW 8-bitoptimizer

bitsandbytes paged optimizer states. 14B SFT on a single A100-80GB. Auto-on when CUDA is detected.

10Unsloth fast-LoRA pathtrainer

Optional kernel path for ≤14B SFT. Same adapter shape on disk, smaller VRAM envelope.

11Liger fused kernelskernels

Patches RMSNorm, SwiGLU, RoPE, LM head for Qwen / Llama / Gemma / Phi. ~1.6–2.0x training-time on Ampere+.

12FlashAttention 3attn

_attn_impl_for() picks fa3 on Hopper / Blackwell, fa2 on Ampere, sdpa elsewhere. No user knob.

13NVFP4 mixed precisionquant

Auto-enable on sm_120 (Blackwell, RTX 50-series) when torch ≥ 2.8 + cuBLASLt 12.9 present. bf16 fallback.

14FP8 mixed precisionquant

TransformerEngine fp8 autocast on Hopper. Receipt records the recipe (delayed scaling, hybrid).

15UL2 span corruptionobjective

Optional pre-SFT mixed-denoising pass. R-denoising, S-denoising, X-denoising at calibrated ratios. apps/trainer/span_objective.py.

16Voice / speech SFTmodality

Whisper-distill + speech-LLM pairing. Captions in, transcript-receipt out. Reuses the same compile pipeline.

17FSDP2 + tensor parallelscale

FSDP2 + Megatron-style TP for multi-GPU SFT. Auto-on for >14B. Receipt records the topology.

18Gradient checkpointingmemory

Activations recomputed on the backward; trades 1.3x time for 4x memory. Auto-on past a per-arch threshold.

19Resumable checkpointsscale

Step-N + RNG-state + optimizer-state snapshots. Pre-empted spot run resumes from the last checkpoint.

Section 02

Decoding

6 techniques
Section 03

Serving

5 techniques
Section 04

Evaluation

6 techniques
Section 05

Data

2 techniques
Section 06

Operational

4 techniques

Each technique above links to the in-codebase implementation or the published research note. Receipts cite by name; the manifest carries the enabled set.

Full reference at /research →