Why we don't ship a checkpoint: the case for adapters

Adapter size economics
Composability
Provenance integrity
The adapter math
When you do want a merged copy
The honest tradeoffs

Adapter size economics.

Every shipped .kolm contains a base-model pointer, a LoRA adapter, a recall index, an eval pack, and a receipt chain. The adapter is the only part with weights. It is 30 to 200 MB depending on rank and target module set. The base model it adapts is 4 to 70 GB. The artifact ships the smaller half; the runtime resolves the larger half from a local cache.

The arithmetic is short. A LoRA on a 7B base with rank r=8 and 4 attention targets (q_proj, k_proj, v_proj, o_proj) introduces roughly 16 million trainable parameters. At fp16 that is 32 MB. At rank 16 it doubles. Above rank 32 the adapter starts to lose its size advantage, but the gains in expressivity flatten too, so we cap most production compiles at r=16.

A merged checkpoint for the same artifact is the full base plus the LoRA weights folded into the base's attention matrices. For a 7B model at fp16 that is roughly 14 GB. The fold is mathematically equivalent (the output token distribution is identical to within numerical noise), but the on-disk footprint is 400x larger.

artifact	base	rank	adapter mb	merged gb	ratio
refund-flagger	qwen2.5-0.5b	r=8	32 MB	1.0 GB	31x
ticket-router	qwen2.5-3b	r=16	64 MB	6.0 GB	94x
contract-summarizer	llama-3.1-8b	r=16	96 MB	16 GB	167x
code-reviewer	qwen2.5-coder-7b	r=32	192 MB	14 GB	73x

The ratio is not the only number that matters. The download story matters too. A team that compiles five specialized models on top of the same base model ships five adapters totaling under a gigabyte, plus one base cached once. The merged-checkpoint equivalent ships five copies of the base plus the diffs, weighing in at 80+ GB. Bandwidth is a real cost; reproducible-fetch time is a real cost; the laptop's free disk is a real cost. The adapter format respects all three.

Composability.

An adapter is additive over the base. Two adapters compiled from independent tasks can be stacked at inference time, applied serially or in a weighted blend, and detached cleanly. A merged checkpoint cannot.

The composition is straightforward when the adapters target the same modules on the same base. The kolm runtime exposes this as an inference-time option:

# /run with two adapters layered on the same base
$ kolm run support-ticket-router-1.0.0.kolm \
  --stack refund-flagger-2.1.0.kolm \
  --blend 0.7,0.3
[runtime]  base: qwen2.5-3b (cached)
[runtime]  adapter 1: support-ticket-router  r=16, weight 0.7
[runtime]  adapter 2: refund-flagger          r=8,  weight 0.3
[runtime]  inference: ready

This is not exotic. It is the standard PEFT loading pattern, exposed through the kolm CLI so that artifact stacking does not require dropping into Python. The use case is concrete: a support team that has compiled a router and a refund-classifier independently can run them together without recompiling either.

The composition has limits. The two adapters have to share the same base model and the same target-module list, and the rank addition has to fit in the runtime's adapter cache. A 7B base with three r=16 adapters and one r=32 adapter (~480 MB total adapter footprint) is the practical upper bound on a 16 GB consumer GPU; beyond that the runtime starts swapping. The CLI logs a warning when you approach the cache limit.

A merged checkpoint has none of this flexibility. Every variant is a separate file, every variant is a separate download, every variant occupies its own slot in the GPU memory budget. The adapter format treats the base as a shared resource; the merged format treats it as a duplicated cost.

Provenance integrity.

This is the argument that does not show up in adapter-vs-checkpoint surveys but is the load-bearing one for the kolm artifact format.

An adapter plus a named base is a reproducible artifact. A merged checkpoint is not.

When a regulator asks "what model produced this output", the adapter-plus-base answer is two hashes and a pointer: the base model is a published string (Qwen/Qwen2.5-7B-Instruct@v1.0.3), the adapter is a 96 MB file with a known sha256, the runtime composed them deterministically. The audit chain is short and inspectable. A third party can re-pull the base from its public registry, re-load the adapter, run the eval pack, and reproduce the K-score within the deterministic-decoding tolerance.

A merged checkpoint loses this. The base model is no longer separable from the adapter; the resulting weights are a function of (base, adapter, fold-precision, fold-implementation). Two different inference runtimes folding the same LoRA into the same base produce checkpoints whose weights differ in the 5th decimal place. The output token distributions are operationally identical, but the file hashes are not. The chain of custody, in cryptographic terms, has a join with no inverse.

The receipt chain inside a .kolm records the base as a content-addressed pointer, the adapter as a sha256, and the K-score as the output of running the eval pack against (base + adapter). If you ship a merged copy, the receipt would have to record the merged weights as a single sha256, and the audit story collapses to "this is the model; we have lost the structure that produced it".

An adapter is an artifact with a parent. A merged checkpoint is an orphan. The receipt chain only makes sense when the parent is still named.

The adapter math.

For the engineers who want the exact shape of the thing the compiler emits, the LoRA used inside every .kolm is the standard low-rank decomposition from Hu et al. 2021. A target weight matrix W of shape (d_out, d_in) is augmented by a learned low-rank update:

# at inference time
W_eff = W + (alpha / r) * B @ A

# where
A: shape (r, d_in)         random gaussian init, learnable
B: shape (d_out, r)        zero init, learnable
r: rank                    8, 16, or 32 in production
alpha: scaling             16 or 32 in production

The target_modules list is the set of weight matrices the adapter touches. The compiler infers this from the base model's architecture; for Llama, Qwen, and Mistral families that is ["q_proj", "k_proj", "v_proj", "o_proj"]. For GPT-style models it is ["c_attn"]. For Falcon and some legacy bases it is ["query_key_value"]. The inference is in apps/trainer/trainer_local.py and falls back through a small ordered list of candidates.

The rank-alpha pair encodes the strength of the update. Higher rank means more degrees of freedom; higher alpha means each unit of rank pushes the base distribution harder. The product (alpha / r) is what shows up in the actual forward pass, so a rank-8 alpha-16 adapter has the same effective scaling as a rank-16 alpha-32 adapter, but the latter has twice the trainable parameters and roughly twice the expressivity ceiling.

The compile loop picks defaults from a small table. Most artifacts ship at r=8, alpha=16; the trainer escalates to r=16, alpha=32 when the K-score on the first attempt lands under 0.80 and there is room in the budget. The chosen values are written into the receipt chain at training_stats.lora_r and training_stats.lora_alpha so that a third party can reproduce the run exactly.

When you do want a merged copy.

Three legitimate reasons to want a merged checkpoint instead of an adapter.

Sealed-runtime environments. Some inference stacks (older vLLM builds, certain edge inference pipelines, in-process llama.cpp embeds) do not load adapters at all. They expect a single checkpoint file with merged weights. If your deployment target is one of these, the adapter format is a non-starter.

Simplified deploy. A single file is simpler to ship than two files. For a small team with no operational complexity tolerance, the merged checkpoint reduces the number of moving parts at the cost of disk and audit. This is a tradeoff with a real price tag; it is not the wrong tradeoff for every team.

Vendor handoff. If you are handing the model to a third party who runs their own inference stack, an adapter requires them to also have the base model at the exact pinned commit. A merged checkpoint requires them to have nothing. The handoff is simpler.

The kolm escape hatch for these cases is a documented one-line PEFT fold. The trainer artifacts include the canonical adapter_config.json and adapter_model.safetensors alongside the base pointer. A user who wants the merged version runs:

# reference Python: merge the LoRA into the base
from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
peft_model = PeftModel.from_pretrained(base, "./adapter")
merged = peft_model.merge_and_unload()
merged.save_pretrained("./merged-checkpoint")

The output is a standard transformers checkpoint, loadable by any inference stack that does not know about LoRA. The provenance is lost (the receipt chain no longer applies to the merged bytes), but the deployment compatibility is gained. This is the escape hatch, not the default.

The honest tradeoffs.

The adapter format is the right default for most kolm artifacts. It is not universally right.

Cold-start cost. The runtime has to load the base model and the adapter separately, then compose them at first inference. For a 7B base this is ~6 seconds on a recent laptop GPU. A pre-merged checkpoint loads in one pass and is ready faster. For interactive use this is invisible; for batch-of-one workloads it can matter.

Toolchain dependency. The PEFT library has to be present in the inference runtime. Every supported runtime kolm ships ({transformers, mlx-lm, llama.cpp via gguf-with-lora, vllm}) handles this; a custom inference stack might not. If you cannot guarantee PEFT support at the deployment site, the merged path is safer.

Adapter stacking complexity. Composability is a feature, not free. Two adapters can interact in ways that neither was trained for. The kolm runtime applies stacked adapters in a deterministic order (alphabetical by artifact ID, then by load-time), but the order matters for the output and is documented in the receipt. Teams that stack adapters should re-evaluate the stack the same way they evaluate a single artifact.

The adapter format is a stance about what an AI artifact is: a small, named delta on top of a public, named base. The merged-checkpoint format treats the same task as a private monolithic blob. The former is a build artifact. The latter is a tarball.

LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., arxiv 2106.09685). The paper that introduced the rank-r decomposition kolm uses verbatim.
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., arxiv 2305.14314). The 4-bit base + adapter pairing kolm uses for most production compiles. The artifact format inherits the int4 base / fp16 adapter split.
S-LoRA: Serving Thousands of Concurrent LoRA Adapters (Sheng et al., 2023). The inference-time stacking literature; the kolm runtime's adapter cache is a simplified version.
Synthesis-then-distillation. The pipeline that produces the adapter in the first place.
Receipt chains. The provenance argument in full: how the adapter-plus-base structure makes the chain verifiable.

Why we don't ship a checkpoint: the case for adapters.

Contents

Adapter size economics.

Composability.

Provenance integrity.

The adapter math.

When you do want a merged copy.

The honest tradeoffs.

Why we don't ship a checkpoint: the case for adapters.

Contents

Adapter size economics.

Composability.

Provenance integrity.

The adapter math.

When you do want a merged copy.

The honest tradeoffs.

Related work.