Research / Serving

Serving 1,000 LoRAs from one model

A LoRA adapter is roughly 1% the size of the base model. If serving one model costs you 80 GB of VRAM, serving a thousand adapters on top of it should not cost you another 80,000 GB. That is the entire premise.

2026-05-14 · kolm research 5 min read

The shape of the problem

Kolm artifacts are LoRA adapters. A buyer with twenty captures gets one adapter. A buyer with twenty thousand captures still gets one adapter. After a year, the same tenant has dozens of adapters: one per task, per skill, per persona. Across all tenants, the count is in the thousands.

The naive way to serve them is to pre-merge each adapter into the base and run a separate model server per artifact. This works at the scale of "ten" and falls over at "a thousand": the base model is duplicated thousands of times, VRAM goes linear with tenant count, and a cold artifact pays the full warm-up cost.

What S-LoRA changed

Sheng 2023 (S-LoRA) made the observation that the base model and the adapters live on different time-scales. The base is hot all the time. Any given adapter is hot for the seconds while a request is in flight. So load the base once, and stream adapters in and out of GPU memory on demand.

The hard part is the kernel. A single batch may have requests bound to different adapters. You cannot run one matmul for the whole batch; you need a kernel that applies x · W_base + x · (A_i · B_i) where i varies row by row. S-LoRA wrote that kernel. Punica (Chen 2023) wrote a sibling kernel optimized for the heterogeneous case.

The math

For a base of size B and adapter of size a (where a ≈ 0.01·B):

StrategyVRAM (N adapters)Time-to-first-token
One server per adapterN·B0 (always warm)
Pre-merge on demand1·B + warm copiesseconds (full merge)
S-LoRA / Punica1·B + Σa_i (~1.01·B)milliseconds (adapter stream)

For 1,000 adapters on Qwen2.5-7B (14 GB base, ~80 MB per adapter), the math is 14,000 GB versus 94 GB. The latter is one A100. The former is a cluster.

The kolm pattern

POST /v1/run
{
  "artifact": "alice/refund-flagger@1.4.2",
  "input": "ticket text..."
}

The runtime looks up the artifact's base_model, finds the server that already has that base loaded, queues the request, and streams in the adapter weights at the position where the request is in the batch. The base server handles requests for any artifact that shares the same base.

from apps.runtime.adapter_pool import AdapterPool
pool = AdapterPool(
    base="qwen2.5-7b-instruct",
    max_active_adapters=64,
    eviction="lru",
)
pool.serve(request, adapter_cid="cidv1:sha256:...")

max_active_adapters caps how many sit in GPU memory at once. eviction="lru" drops the least-recently-used when a new one comes in. Cold misses pay the adapter load time (~50 ms for an 80 MB LoRA on PCIe Gen4), not the base warm-up (~3 s).

How adapters are addressed

An artifact's identity is its CID: cidv1:sha256:<hex> over the canonical-JSON of its manifest. The pool keys on CID, not on tenant slug. Two different tenants who happen to have identical artifacts share the same in-memory copy. This rarely happens with private adapters but matters for distill-from-base templates that ship pinned.

What the receipt records

{
  "serving": {
    "base_model": "qwen2.5-7b-instruct@a1b2c3d",
    "adapter_cid": "cidv1:sha256:7f3a...",
    "pool_id": "us-west-2/h100-pool-3",
    "cold_miss": false,
    "queue_depth": 4,
    "latency_ms": 142
  }
}

If a buyer's latency spikes, the receipt tells you why: cold miss, deep queue, or unrelated (model itself). This is the lever for an SLO conversation that does not devolve into telemetry archaeology.

Edge cases

Mixed ranks. Adapters trained with different LoRA ranks (8/16/64/128) can share a pool. The kernel pads to the max rank in the active batch. Wider ranks cost a hair more per token; narrower ones are free-riders.

Target-module differences. Some adapters touch q_proj, k_proj, v_proj, o_proj; others add gate_proj, up_proj, down_proj. The pool handles this by computing each target module independently. The mismatch is invisible to the caller.

Quantization. S-LoRA was written for fp16 bases. With AWQ int4 bases, the same idea works but the kernel changes (you are dequantizing the base on the fly, then adding the LoRA contribution). vLLM supports this in the punica backend as of v0.6.

Hot vs cold. Pre-warm the top 10 adapters per pool at startup. Use a small in-cluster cache (Redis or in-memory) keyed on CID. This is what kolm does in production.

Where this sits in the kolm loop

Capture, compile, evaluate, swap. The swap step writes a new artifact CID into the registry. The runtime watches the registry and, on the next request for that tenant slug, looks up the new CID and pulls the new adapter into the pool. Old adapters age out via LRU. No restarts, no warmup waits.

Citations

Sheng et al. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv:2311.03285

Chen et al. 2023. Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547

Hu et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685

← back to research kolm.ai/research/multi-lora-serving