research . training . 12 min read

Federated compile.

Hospitals cannot pool patient notes. Banks cannot pool dispute transcripts. Law firms cannot pool draft pleadings. The data is exactly where the model improvement would come from, and exactly where it cannot leave. Federated training fixes the locality problem; secure aggregation and differential privacy fix the leakage problem on top. kolm runs FedAvg over LoRA adapters with Bonawitz pairwise masks and a clip-and-noise privacy accountant on each client's update.

May 14, 2026 · Kolmogorov research · apps/trainer/federated.py

FedAvg in one paragraph

McMahan et al (2017) proposed Federated Averaging: each client trains for some local steps on its private data, sends only the parameter delta to the coordinator, and the coordinator averages the deltas weighted by the number of examples each client saw. The global model takes a step; round ends; next round picks a fresh set of clients and repeats. No raw rows ever leave the client. The communication payload is the model size, not the dataset size, and the wall-clock is dominated by the slowest client (the "straggler" problem).

Naive FedAvg over a 7B base model would send 7 GB per client per round. That is the first thing LoRA fixes.

Why LoRA is the right adapter shape for federation

LoRA factorizes the per-layer weight update as ΔW = B · A where A is r-by-d and B is d-by-r. For a 7B base with r=16 and the standard attention-projection target set, the adapter is roughly 50 MB. The base model is loaded once on each client from a public mirror and never moves. The round-trip is now 50 MB up, 50 MB down, not 7 GB. On a 100-Mbps clinic uplink that is four seconds per round instead of nine minutes.

Two structural properties of LoRA make it well-suited to federation that the original LoRA paper did not call out:

Linear in the deltas. The federated average over A and B tensors is exactly the average over ΔW = B · A at each layer. No coordinate-mismatch issues like you get when averaging full-precision weights after stochastic quantization.
Bounded sensitivity by design. Per-client clipping (the DP step below) is cheap and well-conditioned because the parameter vector is small and structured.

Pairwise PRG-derived masks (Bonawitz et al, 2017)

The coordinator should never see any client's individual update; the privacy guarantee is that only the sum (or weighted average) is revealed. Bonawitz et al introduced a clean construction. For each ordered pair of clients (i, j) with i < j, derive a shared seed from a Diffie-Hellman exchange (or from a pre-distributed key for kolm's enterprise deployments), expand the seed into a pseudo-random mask the size of the model, and have client i add the mask while client j subtracts it. When the coordinator sums the masked updates, every mask cancels with its twin and the result is the clean sum.

kolm's implementation derives each pairwise mask as HMAC-SHA256(round_seed, sorted(id_i, id_j)) → expand to LoRA-shape. The expansion is done in 32-byte blocks and reshaped to match each (A, B) tensor in order; the sign is determined by whether the client's id sorts before or after its peer's. This produces a deterministic mask without an interactive key exchange, at the cost of requiring the coordinator to know which clients are participating in a round (it does anyway).

def _pairwise_mask(seed: bytes, lo: str, hi: str, shape: tuple, sign: int) -> np.ndarray:
    key = hmac.new(seed, f"{lo}|{hi}".encode(), sha256).digest()
    nbytes = int(np.prod(shape)) * 4
    stream = b"".join(hmac.new(key, i.to_bytes(4, "big"), sha256).digest()
                       for i in range((nbytes + 31) // 32))
    flat = np.frombuffer(stream[:nbytes], dtype=np.float32).copy()
    return sign * flat.reshape(shape) * 1e-3

Differential privacy on the per-client update

Secure aggregation hides the individual update from the coordinator but does not hide the sum, and a determined adversary can still infer membership from the trajectory of the global model. The defense is differential privacy on each client's contribution: clip the LoRA delta vector to L2 norm C and add Gaussian noise with standard deviation σ · C. The (eps, delta) accountant tracks the privacy budget consumed across rounds.

def _clip_and_noise(delta: dict, clip_norm: float, sigma: float, rng) -> dict:
    flat = np.concatenate([t.reshape(-1) for t in delta.values()])
    norm = float(np.linalg.norm(flat))
    scale = min(1.0, clip_norm / (norm + 1e-12))
    out = {}
    for name, tensor in delta.items():
        clipped = tensor * scale
        noise = rng.normal(0.0, sigma * clip_norm, size=tensor.shape).astype(tensor.dtype)
        out[name] = clipped + noise
    return out

Per-round accountant: (eps_round, delta_round) = analyze_gaussian(sigma, q=clients_per_round/total_clients). The composition over T rounds is reported in the manifest so a security reviewer can audit the budget against the buyer's policy.

Round orchestration

Step	Coordinator	Client
1. Select	Picks a subset of clients (typically 30-100 out of the cohort) and broadcasts the round seed	Receives the seed; loads the current global adapter
2. Train	Idle	Trains LoRA for E local epochs on its private data (typical: 1-3)
3. Clip + noise	Idle	Computes ΔA, ΔB; clips to L2 norm C; adds Gaussian noise
4. Mask	Idle	Adds Bonawitz pairwise masks for every peer in this round
5. Upload	Receives masked, noised delta	Sends masked, noised delta; HMAC-signs the payload
6. Aggregate	Sums the deltas weighted by n_examples; masks cancel pairwise; emits new global adapter + round receipt	Idle

The round receipt records round_id, seed, n_clients, eps_cumulative, delta_cumulative, k_score, adapter_cid. The chain is HMAC-linked round to round so a reviewer can verify the trajectory.

Edge cases worth naming

Dropout during a round. If client j goes offline after the seed exchange but before upload, every peer's mask that included j is still in the sum and there is no twin to cancel it. The fix is the secure-aggregation paper's standard remedy: have clients secret-share their PRG seeds with t-of-n peers at the start of the round, and have the coordinator reconstruct the seeds of dropped clients to unmask the survivors. kolm uses a Shamir t-of-n scheme with t=ceil(n/2)+1; the receipt records how many reconstructions happened.

The DP accountant is the limit, not the loss. Buyers think the federated job runs until the loss converges. With DP it runs until the privacy budget is exhausted, then it stops. The trainer reports the budget burn rate at every round so the buyer can either lower σ (less noise, faster budget burn, larger per-round footprint) or accept fewer rounds.

KL drift between clients. Clinic A's notes and clinic B's notes drift in distribution. A naive FedAvg can stall or oscillate. The standard mitigations are FedProx (proximal regularization toward the global) and per-client adapter warmup. kolm defaults to a 1-epoch warmup against a small shared seed dataset; the seed dataset itself is in the manifest.

Adversarial clients. Secure aggregation prevents an honest-but-curious coordinator from seeing individual updates, not a malicious client from sending a poisoned one. kolm clips before noising (which bounds any single client's contribution) and reports the per-round update-norm distribution so an outlier client can be flagged offline.

What the receipt records

"federated": {
  "method": "fedavg_lora_bonawitz_dp",
  "n_rounds": 24,
  "clients_per_round": 32,
  "total_clients": 117,
  "local_epochs": 2,
  "clip_norm": 1.0,
  "sigma_dp": 1.1,
  "eps_cumulative": 4.8,
  "delta_cumulative": 1.0e-5,
  "shamir_threshold": 17,
  "dropouts_recovered": 6,
  "papers": [
    "arXiv:1602.05629",
    "Bonawitz-CCS-2017",
    "arXiv:2007.14390",
    "Abadi-CCS-2016"
  ]
}

The canonical-JSON manifest hash covers this block, so a tampered receipt invalidates the artifact signature. The auditor checks the privacy budget against the buyer's stated policy and the dropout-recovery count against the secure-aggregation paper's threshold.

Where federated compile fits in the stack

A central regulator (Mayo Clinic's data office, a bank's CISO, a law firm's risk committee) cannot move data out, and a single tenant's data is not enough to make the model competitive against a frontier baseline. Federation is the workaround. kolm's stance is that federation is one tool among several: a single-tenant on-prem job is simpler and should be tried first; federation is the layer you add when the model needs cross-site signal and the cross-site signal cannot move. The federated adapter ships through the same K-score gate and HMAC chain as a single-tenant adapter; the difference is that the receipt has the privacy accountant attached.

Citations

McMahan, H. B. et al. Communication-Efficient Learning of Deep Networks from Decentralized Data. arXiv:1602.05629, 2017. The FedAvg paper.

Bonawitz, K. et al. Practical Secure Aggregation for Privacy-Preserving Machine Learning. CCS 2017. Pairwise PRG-derived masks and t-of-n Shamir recovery for dropouts.

Beutel, D. J. et al. Flower: A Friendly Federated Learning Framework. arXiv:2007.14390, 2020.

Abadi, M. et al. Deep Learning with Differential Privacy. CCS 2016. The clip-and-noise mechanism and the moments accountant.

← Back to research · Frontier reference · Federated compile · RS-1 spec