Long context, one config knob.
RoPE extension is the cheapest way to take a 4k or 8k base model to 32k or 128k. Three published methods, three perplexity profiles, one trainer config. The boring part is making sure the patch lands on both the top-level config and the nested text_config when the model is a VLM.
Why RoPE needs scaling at all
Rotary Position Embedding encodes position as a rotation in the query/key pairs at frequencies that decay along the embedding dimension. The model is trained at some max_position_embeddings; at inference time, positions beyond that window are out-of-distribution and the model degrades sharply. The fix is to remap those longer positions back into the trained band so that the rotations stay in-distribution. That's what context extension is.
There are three published mappings, each with a different trade-off between simplicity, perplexity at the new length, and faithfulness to the original short-context behavior.
The three methods
| Method | Mapping | Strength | Weakness |
|---|---|---|---|
| Linear PI | Compress all positions by factor s | Trivial to implement, broadly supported | High perplexity at 4x+; degrades short-context behavior |
| NTK-aware | Scale the base frequency by s^(d/(d-2)) | Cheap; preserves high-frequency rotations | Caps out around 2-4x extension |
| YaRN | Piecewise NTK-by-parts with attention temperature scaling | Stable at 4-32x; matches or beats trained-long models | More config; needs the original training window declared |
YaRN (Peng et al 2023) is the production default. It partitions RoPE dimensions into three bands: high-frequency (left untouched, preserves attention-to-nearby), low-frequency (linearly compressed, extends range), and a smooth interpolation between them. An attention-logit temperature also gets scaled to compensate for the increased context length's effect on softmax sharpness.
The minimal call
from apps.trainer.long_context import apply_rope_scaling, LongContextConfig
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
apply_rope_scaling(
model,
LongContextConfig(
method="yarn",
factor=4.0, # 32k -> 131k
original_max_position_embeddings=32768, # the base's trained window
),
)
# proceed with SFT/DPO/GRPO at the new length
The trainer then runs SFT (or any of the other heads) at the new context. The continuation budget is the new factor, but the gradient compute scales as O(L^2) for vanilla attention or O(L) for FlashAttention-2/3. With FA3 on a Hopper, 128k context on a 7B model is tractable on a single H100.
The VLM gotcha
VLMs nest their language config inside model.config.text_config. Patching model.config.rope_scaling alone does nothing because the language tower reads its own text_config.rope_scaling. The kolm trainer patches both, with a guard that detects nested text_config and walks into it. This is the difference between a 32k context VLM and a model that silently runs at its base 8k window with confusing degraded outputs.
def _patch(cfg, scaling: dict) -> None:
cfg.rope_scaling = dict(scaling)
# VLMs and some MoE models nest the language config
inner = getattr(cfg, "text_config", None)
if inner is not None:
inner.rope_scaling = dict(scaling)
What lands in the receipt
"long_context": {
"method": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768,
"new_max_position_embeddings": 131072,
"attn_factor": 1.0,
"beta_fast": 32,
"beta_slow": 1,
"papers": ["arXiv:2309.00071", "arXiv:2306.15595"]
}
The buyer's auditor can confirm which scaling method ran and at what factor. The receipt also pins the original_max_position_embeddings the trainer believed it was extending from, which catches the case where a buyer's spec accidentally targets a base model that was already long-context-trained upstream (in which case the multiplier compounds and the result is unstable).
Choosing between methods
Use Linear PI only when the upstream model card explicitly demands it (some Llama variants ship with a rope_scaling.type="linear" field and the long-context behaviors only work in that mode).
Use NTK-aware for cheap 2-4x extensions when the buyer's spec only needs 16k from an 8k base and there's no time for a long fine-tune. The mapping is closed-form; no extra hyperparameters.
Use YaRN for anything 4x and above, and as the default whenever the buyer cares about preserving short-context behavior. YaRN's piecewise mapping leaves the high-frequency band untouched, which means attention-to-recent-tokens (what short prompts depend on) does not degrade.
How much fine-tuning the extended model actually needs
This is the part the published papers underplay. A bare YaRN patch with no fine-tuning works in the sense that it produces coherent output at the new length, but the model has not seen long-context attention patterns and quality lags the trained-long alternative. The kolm default is 200-2000 steps of SFT on long-context data drawn from the buyer's captures (or, if the buyer has no long-context captures yet, the public BookCorpus / Books3 mixture at the new length) before declaring the artifact ready.
The K-score evaluator runs the buyer's eval pack at the new length, which is the actual release gate. If the pack does not contain long-context examples, the K-score will be undefined on that dimension and the trainer warns. The fix is to ask the buyer to add 50-200 long examples to their pack before re-running.
Edge cases worth naming
Compounding factors. If a base model already shipped with rope_scaling={"factor": 8.0} (some Llama 3.1 70B variants), and the buyer's spec asks for 4x, the trainer detects this and refuses to compound silently. The compounded factor would be 32x, which is outside the stable band for YaRN. The error message names the existing factor and asks the buyer to either explicitly request the compounded factor or pick a different base.
FlashAttention is mandatory. Long-context training without FA2 or FA3 is memory-quadratic and unusable past 16k. The trainer's _attn_impl_for() picks FA3 on Hopper and FA2 on Ampere automatically; if neither is available the trainer refuses to extend beyond 16k and recommends a different SKU.
Quantization interacts. int4 weight quantization (AWQ, GPTQ) is fine at long context; FP8 KV cache doubles the effective context per byte and stacks cleanly with YaRN. The combined upper bound on a single H100 with a 7B model is ~128k context at int4 + FP8 KV.
Where this fits in the kolm compile loop
Long-context extension is a pre-SFT step. The spec names the target context length; the trainer picks the method (defaults to YaRN), applies the patch before instantiating the data collator, and the K-score gate at 0.85 covers the result. The receipt's long_context block lets a buyer who needs to reproduce the artifact match the exact extension parameters; the verifier ignores the block for signature checks but it's covered by the canonical-JSON manifest hash.
Citations
Peng, B. et al. YaRN: Efficient Context Window Extension of Large Language Models. arXiv:2309.00071, 2023.
Chen, S. et al. Extending Context Window of Large Language Models via Positional Interpolation. arXiv:2306.15595, 2023.
Liu, H. et al. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889, 2023.