Research / Serving
Serving 1,000 LoRAs from one model
A LoRA adapter is roughly 1% the size of the base model. If serving one model costs you 80 GB of VRAM, serving a thousand adapters on top of it should not cost you another 80,000 GB. That is the entire premise.
The shape of the problem
Kolm artifacts are LoRA adapters. A buyer with twenty captures gets one adapter. A buyer with twenty thousand captures still gets one adapter. After a year, the same tenant has dozens of adapters: one per task, per skill, per persona. Across all tenants, the count is in the thousands.
The naive way to serve them is to pre-merge each adapter into the base and run a separate model server per artifact. This works at the scale of "ten" and falls over at "a thousand": the base model is duplicated thousands of times, VRAM goes linear with tenant count, and a cold artifact pays the full warm-up cost.
What S-LoRA changed
Sheng 2023 (S-LoRA) made the observation that the base model and the adapters live on different time-scales. The base is hot all the time. Any given adapter is hot for the seconds while a request is in flight. So load the base once, and stream adapters in and out of GPU memory on demand.
The hard part is the kernel. A single batch may have requests bound to different adapters. You cannot run one matmul for the whole batch; you need a kernel that applies x · W_base + x · (A_i · B_i) where i varies row by row. S-LoRA wrote that kernel. Punica (Chen 2023) wrote a sibling kernel optimized for the heterogeneous case.
The math
For a base of size B and adapter of size a (where a ≈ 0.01·B):
| Strategy | VRAM (N adapters) | Time-to-first-token |
|---|---|---|
| One server per adapter | N·B | 0 (always warm) |
| Pre-merge on demand | 1·B + warm copies | seconds (full merge) |
| S-LoRA / Punica | 1·B + Σa_i (~1.01·B) | milliseconds (adapter stream) |
For 1,000 adapters on Qwen2.5-7B (14 GB base, ~80 MB per adapter), the math is 14,000 GB versus 94 GB. The latter is one A100. The former is a cluster.
The kolm pattern
POST /v1/run
{
"artifact": "alice/refund-flagger@1.4.2",
"input": "ticket text..."
}
The runtime looks up the artifact's base_model, finds the server that already has that base loaded, queues the request, and streams in the adapter weights at the position where the request is in the batch. The base server handles requests for any artifact that shares the same base.
from apps.runtime.adapter_pool import AdapterPool
pool = AdapterPool(
base="qwen2.5-7b-instruct",
max_active_adapters=64,
eviction="lru",
)
pool.serve(request, adapter_cid="cidv1:sha256:...")
max_active_adapters caps how many sit in GPU memory at once. eviction="lru" drops the least-recently-used when a new one comes in. Cold misses pay the adapter load time (~50 ms for an 80 MB LoRA on PCIe Gen4), not the base warm-up (~3 s).
How adapters are addressed
An artifact's identity is its CID: cidv1:sha256:<hex> over the canonical-JSON of its manifest. The pool keys on CID, not on tenant slug. Two different tenants who happen to have identical artifacts share the same in-memory copy. This rarely happens with private adapters but matters for distill-from-base templates that ship pinned.
What the receipt records
{
"serving": {
"base_model": "qwen2.5-7b-instruct@a1b2c3d",
"adapter_cid": "cidv1:sha256:7f3a...",
"pool_id": "us-west-2/h100-pool-3",
"cold_miss": false,
"queue_depth": 4,
"latency_ms": 142
}
}
If a buyer's latency spikes, the receipt tells you why: cold miss, deep queue, or unrelated (model itself). This is the lever for an SLO conversation that does not devolve into telemetry archaeology.
Edge cases
Mixed ranks. Adapters trained with different LoRA ranks (8/16/64/128) can share a pool. The kernel pads to the max rank in the active batch. Wider ranks cost a hair more per token; narrower ones are free-riders.
Target-module differences. Some adapters touch q_proj, k_proj, v_proj, o_proj; others add gate_proj, up_proj, down_proj. The pool handles this by computing each target module independently. The mismatch is invisible to the caller.
Quantization. S-LoRA was written for fp16 bases. With AWQ int4 bases, the same idea works but the kernel changes (you are dequantizing the base on the fly, then adding the LoRA contribution). vLLM supports this in the punica backend as of v0.6.
Hot vs cold. Pre-warm the top 10 adapters per pool at startup. Use a small in-cluster cache (Redis or in-memory) keyed on CID. This is what kolm does in production.
Where this sits in the kolm loop
Capture, compile, evaluate, swap. The swap step writes a new artifact CID into the registry. The runtime watches the registry and, on the next request for that tenant slug, looks up the new CID and pulls the new adapter into the pool. Old adapters age out via LRU. No restarts, no warmup waits.
Citations
Sheng et al. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv:2311.03285
Chen et al. 2023. Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547
Hu et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685