The frontier stack · 42 techniques shipped in production

Section 01

Training

19 techniques

Group-relative policy optimization with verifiable rewards. DeepSeek R1 pattern, no value head. Wired to the K-score evaluator.

02DPO · KTO · ORPO · SimPO · IPOpreference

Five-method preference router. ORPO when SFT data + tight VRAM; SimPO when pairs without ref model.

03Online DPOtrainer

Shifting reference policy each round; 4 judges (learned RM, LLM judge, self-rewarding, verifiable). apps/trainer/online_dpo.py.

04Response distillationtrainer

Token-level KL forward / reverse / JSD plus α-CE. Top-k logit pruning, on-policy option. MiniLLM lineage.

05MoE with LoRA expertstrainer

Top-1 (Switch) or top-k (Mixtral) router, z-loss + load-balance aux. Expert CIDs in the receipt.

06LoRA + DoRA + rsLoRA + LoRA+adapter

Composable adapter variants. --quality enables all four; the optimizer splits A/B groups when LoRA+ is on.

07NEFTune noiseadapter

Embedding-space noise during SFT. Auto-applied with the quality preset; receipt records the noise magnitude.

08QLoRA 4-bit NF4quant

bitsandbytes NF4 + double-quant on the base model during adapter training. Lets 14B fit a 24GB consumer card.

09Paged AdamW 8-bitoptimizer

bitsandbytes paged optimizer states. 14B SFT on a single A100-80GB. Auto-on when CUDA is detected.

10Unsloth fast-LoRA pathtrainer

Optional kernel path for ≤14B SFT. Same adapter shape on disk, smaller VRAM envelope.

11Liger fused kernelskernels

Patches RMSNorm, SwiGLU, RoPE, LM head for Qwen / Llama / Gemma / Phi. ~1.6–2.0x training-time on Ampere+.

12FlashAttention 3attn

_attn_impl_for() picks fa3 on Hopper / Blackwell, fa2 on Ampere, sdpa elsewhere. No user knob.

13NVFP4 mixed precisionquant

Auto-enable on sm_120 (Blackwell, RTX 50-series) when torch ≥ 2.8 + cuBLASLt 12.9 present. bf16 fallback.

14FP8 mixed precisionquant

TransformerEngine fp8 autocast on Hopper. Receipt records the recipe (delayed scaling, hybrid).

15UL2 span corruptionobjective

Optional pre-SFT mixed-denoising pass. R-denoising, S-denoising, X-denoising at calibrated ratios. apps/trainer/span_objective.py.

16Voice / speech SFTmodality

Whisper-distill + speech-LLM pairing. Captions in, transcript-receipt out. Reuses the same compile pipeline.

17FSDP2 + tensor parallelscale

FSDP2 + Megatron-style TP for multi-GPU SFT. Auto-on for >14B. Receipt records the topology.

18Gradient checkpointingmemory

Activations recomputed on the backward; trades 1.3x time for 4x memory. Auto-on past a per-arch threshold.

19Resumable checkpointsscale

Step-N + RNG-state + optimizer-state snapshots. Pre-empted spot run resumes from the last checkpoint.

Section 02

Decoding

6 techniques

20Speculative decoding (draft model)runtime

Pairings: Qwen 7B→1.5B, Llama 3B→1B, Gemma 12B→1B. Wired in apps/runtime/serve.py.

21MEDUSAruntime

Multi-head spec decoding. ~2.3–2.8x on H100. vLLM 0.7+ speculative_model.

22EAGLE-2runtime

Tree-verified spec decoding. ~3–4x on H100. Pairs with the same draft inventory as 20.

23Constrained generationgrammar

Outlines + lm-format-enforcer logits processors. JSON-schema, regex, CFG, choice modes.

24OpenAI-compatible function callingtools

Native render for Qwen / Llama / Hermes; union schema fallback. Every emission parses.

25FP8 KV cachememory

Auto-on for compute capability sm_90+. vLLM kv_cache_dtype="fp8". 2x context per byte.

Section 03

Serving

5 techniques

26vLLM (PagedAttention)engine

Continuous batching + paged KV. ~2,200 tok/s 7B INT8 on A100. Production default.

27SGLang (RadixAttention)engine

Prefix-cache sharing across requests. ~1.15x vLLM on verifier-chain workloads.

28HuggingFace TGIengine

FlashAttention-2 default, OpenAI-compatible shim. Easy if you're already on HF Hub.

29NVIDIA TRT-LLMengine

Compiled engines + FP8 KV. ~3,000 tok/s 7B INT8 on H100. Fastest backend we ship.

30Disaggregated prefill / decodetopology

DistServe / Mooncake pattern. INPROC single-node 2x-GPU or CROSSHOST with NVLink / RDMA / TCP. Receipt records the transport.

Section 04

Evaluation

6 techniques

31K-score gatecompile

K = 0.40·A + 0.15·S + 0.15·L + 0.15·C + 0.15·V. Binding compile-time gate; no override flag.

32LLM-as-judge (pointwise)judge

1–10 rubric, configurable judge model. Reported alongside K-score; never replaces the gate. Budget-capped per run.

33LLM-as-judge (pairwise)judge

A vs B comparisons with position-bias correction. Bradley-Terry aggregation across the eval set.

34Hallucination detection (HHEM)safety

Vectara HHEM-2.1 grounding score + claim decomposition for RAG. Per-claim min as conservative aggregate.

35Public eval packsbenchmarks

MedQA, FinQA, LegalBench, GPQA, HumanEval, MT-Bench, AlpacaEval, Arena-Hard. Run alongside K-score.

36Replay-from-receiptaudit

Every receipt carries input_sha + output_sha + CID. Replay later from the receipt alone with byte-equality check.

Section 05

Data

2 techniques

37Synthetic-data generationdata

Magpie, Evol-Instruct, Self-Instruct, RAFT. Recipe-driven. Receipt records the seed model and the recipe CID.

38Verifier-from-examplesdata

Compile a verifier directly from labeled negatives. The same harness gates training and runtime. /cookbook/verifier-from-examples.

Section 06

Operational

4 techniques

39HMAC-SHA256 receipt chainproof

Five-step chain over canonical-JSON manifest. CID cidv1:sha256 pins the exact bytes. Pure-Rust verifier ships in packages/runtime-rs/.

40TEE attestationisolation

Nitro COSE_Sign1, SEV-SNP, TDX quote, GCP / Azure CVM, docker. Quote travels with the artifact.

41Egress harnessisolation

Run with egress=0; the harness blackholes network at the syscall layer. Receipt records the syscall counter.

42RS-1 .kolm artifact specformat

ZIP+manifest.json with the canonical-JSON→sha256 CID rule. Spec is versioned and frozen; no proprietary opcodes.

Each technique above links to the in-codebase implementation or the published research note. Receipts cite by name; the manifest carries the enabled set.

Full reference at /research →