research . methods . 12 min read

Four ways to merge an adapter.

A LoRA adapter is a stack of small tensors. Merging two adapters is a tensor operation. Choosing how those tensors combine is a design decision with four published answers, each with a different failure mode. Below: when each one is right.

May 14, 2026 · Kolmogorov research · apps/trainer/merge.py

What you actually merge

A PEFT LoRA adapter is a dict mapping layer names to weight tensors of two shapes: lora_A (low-rank input projection) and lora_B (low-rank output projection). For two adapters to be mergeable they must share the base model and have the same target modules at the same shapes. The merge operates on the corresponding tensors; the merged result is a new adapter with the same shape signature.

The four methods differ in how they reduce N input tensors to one. Below, w_i are the per-adapter weights (defaulting to 1/N), and t_i are the tensors being combined at one layer.

The four ops

Method	Per-layer op	Best for
Linear	weighted sum: `sum(w_i · t_i)`	Two adapters trained on disjoint subsets of one task
SLERP	spherical interpolation along the unit sphere	Exactly two adapters; preserves norm
DARE	Bernoulli mask + linear: drop with probability `1-density`, rescale survivors by `1/density`	Combining adapters with redundant capacity
TIES	trim by magnitude topk, sign-elect by weighted sum, merge only agreeing entries	Combining adapters trained on different tasks

Linear is the floor

def linear(tensors, weights):
    return sum(w * t for w, t in zip(weights, tensors))

The simplest case. Use this when two adapters were trained on the same task with different subsets of the same dataset, and you want their combined behavior. It does not survive task-interference: an adapter trained for SQL generation and one trained for English summarization will produce a merged tensor that does both badly and neither well.

SLERP traces the unit sphere

def slerp(t1, t2, alpha):
    cos = dot(t1, t2) / (norm(t1) * norm(t2))
    theta = acos(clamp(cos, -1, 1))
    if sin(theta) < 1e-6:
        return (1-alpha)*t1 + alpha*t2  # linear fallback
    return (sin((1-alpha)*theta) / sin(theta)) * t1 + \
           (sin(alpha*theta) / sin(theta)) * t2

SLERP is the right interpolation when you want to preserve the norm of the tensor. A linear interpolation of two unit vectors at 90 degrees produces a vector of norm sqrt(2)/2, not 1. The model behaves differently at lower norms; SLERP avoids that drift.

The kolm implementation falls back to linear when sin(theta) < 1e-6, which catches the degenerate case of two parallel tensors (or the same tensor merged with itself). SLERP is two-input only by definition; for N>2 inputs the implementation pairs adjacent inputs and merges left-to-right, which is what mergekit does.

DARE drops then rescales

def dare(tensor, density, seed):
    mask = bernoulli(p=density, shape=tensor.shape, seed=seed)
    return tensor * mask / density

DARE (Yu et al 2023, Drop And REscale) operates on a single adapter relative to the base: each parameter is dropped with probability 1-density and survivors are rescaled by 1/density so the expectation is preserved. The intuition is that fine-tuned adapters carry a lot of redundant capacity; dropping 50% rarely hurts, and the surviving signal can combine more cleanly with another adapter's surviving signal.

In a multi-adapter merge, DARE is the prep step: each input adapter is DARE'd first, then the results are linearly combined. The published recipe combines DARE with TIES (DARE-TIES) and is the recommended default for combining three or more task adapters.

TIES is the most powerful op

def ties(tensors, weights, density):
    # 1. Trim each tensor to keep only the top-K magnitudes
    trimmed = [keep_topk(t, density) for t in tensors]

    # 2. Sign-elect: the elected sign is the sign of the weighted sum
    elected = sign(sum(w * t for w, t in zip(weights, trimmed)))

    # 3. Merge only entries agreeing with the elected sign
    agreement = [(sign(t) == elected) for t in trimmed]
    n_agree = sum(agreement)  # per-position
    merged = sum(w * t * agree for w, t, agree in zip(weights, trimmed, agreement))

    # 4. Normalise by agreement count (avoid div-by-zero)
    return merged / max(n_agree, 1) * sum(weights)

TIES (Yadav et al 2023, TRIM-ELECT-SUM) is the published winner for cross-task merges. The idea: most parameters are tied to a specific task; trim away the small-magnitude ones (they're noise), then for each remaining position pick the sign that the majority of adapters agree on (weighted by adapter importance), and average only the agreeing entries.

The kolm implementation rescales the final result by sum(weights) to preserve the expected magnitude regardless of density. This is a deviation from the original paper, which left the magnitude unrescaled; the rescaled version composes more cleanly with downstream inference because the merged adapter's effective LR matches the sum of input adapter weights.

The minimal call

from apps.trainer.merge import merge_adapters, MergeConfig

merged = merge_adapters(
    adapters=[adapter_sql, adapter_python, adapter_chat],
    config=MergeConfig(
        method="ties",
        weights=[1.0, 1.0, 0.5],   # chat gets half-weight
        density=0.5,                # keep top-50% magnitude
        seed=42,
    ),
)
merged.save_pretrained("./merged-adapter")

The output is a PEFT adapter directory ready to load against the same base model. The merged adapter participates in the registry the same way a trained one does: receipt, CID, signature, K-score gate.

What lands in the receipt

"merge": {
  "method": "ties",
  "papers": ["arXiv:2306.01708", "arXiv:2311.03099"],
  "source_adapters": [
    {"cid": "cidv1:sha256:...sql", "weight": 1.0},
    {"cid": "cidv1:sha256:...python", "weight": 1.0},
    {"cid": "cidv1:sha256:...chat", "weight": 0.5}
  ],
  "density": 0.5,
  "seed": 42,
  "n_layers": 64,
  "n_params": 18874368
}

Each source adapter is pinned by CID. The verifier can reconstruct the merge bit-for-bit from the sources, the method, the weights, and the seed; this is what makes adapter merges reproducible across deployments.

Determinism

Linear, SLERP, and TIES are deterministic by definition (no random numbers). DARE uses a Bernoulli mask, so the trainer pins a seed and the receipt records it. Re-running the merge with the same inputs, method, weights, density, and seed produces byte-identical output. This is the property that makes merge artifacts admissible into the registry alongside trained artifacts.

When to merge vs. retrain

Merging is cheap (seconds, no GPU) and reproducible (the merged adapter is a function of its inputs). Retraining is expensive (GPU-hours) and produces a new artifact with new captures. The honest decision rule is:

If the buyer has a single coherent task with multiple capture batches, retrain on the union. K-score will be higher.
If the buyer has multiple distinct tasks that should coexist in one adapter, merge with TIES. K-score on each task individually will be 80-95% of the standalone adapter; combined coverage exceeds either.
If the buyer has two adapters trained on the same task with different data subsets, merge with linear. Treat the result as an ensemble.
If the buyer wants to interpolate between two checkpoints (e.g., for an A/B), SLERP is the right primitive.

Edge cases worth naming

Shape mismatches. Adapters trained on different bases (Qwen2.5-7B vs Llama-3.1-8B) cannot be merged because their tensor shapes differ. The trainer detects this from the PEFT config and fails loudly. There is no automatic projection step; the buyer must pick one base and retrain.

Rank mismatches. Two adapters with different LoRA rank (r=16 vs r=32) cannot be merged directly. The kolm implementation pads the smaller to the larger with zeros in the second half of the rank dimension; the result preserves the lower-rank adapter's behavior exactly and adds zero contribution from the missing rank slots. This is a deviation from mergekit, which refuses; the kolm path is more useful and the receipt records that padding occurred.

The merged adapter has no captures. Because merging is a tensor operation, there is no capture pack associated with the merged result. The K-score gate runs against a buyer-supplied eval pack instead. If the buyer does not have one, the trainer falls back to the union of the source adapters' eval packs (which is a fair test) and warns.

Where this fits in the kolm compile loop

Merging is a post-training step. The buyer's spec names the source adapters by CID and the merge method; the trainer materialises the inputs from the registry, runs the merge, gates with K-score, signs, and publishes. The merged artifact is registry-equivalent to a trained one and can be a source for further merges (the source-adapter chain in the receipt is recursive).

Citations

Yadav, P. et al. TIES-Merging: Resolving Interference When Merging Models. arXiv:2306.01708, 2023.

Yu, L. et al. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. arXiv:2311.03099, 2023 (DARE).

Goddard, C. et al. Arcee's MergeKit: A Toolkit for Merging Large Language Models. arXiv:2403.13257, 2024.

← Back to research · Frontier reference · Adapter merging · RS-1 spec