Compute backend selection: local vs cloud

The 14-backend registry
The scoring formula
When local beats cloud
When cloud beats local
Worked examples
Honest tradeoffs

The 14-backend registry.

The kolm compute layer dispatches every training and inference job to one of fourteen backends. Six are local. Seven are cloud. One is your own remote box over SSH. Every backend is declared in src/compute/registry.json with the same field set: kind, train, infer, airgap, cost_per_hour_usd, cold_start_seconds, vram_cap_gb, auth, tier. The CLI exposes the picker through kolm compute list|detect|pick|use|info|test|status.

# every backend the CLI knows about
$ kolm compute list
LOCAL
  local-cpu       tier 1  always available     0 $/hr   torch-cpu
  local-cuda      tier 1  NVIDIA               0 $/hr   torch-cu12 + unsloth + peft
  local-mps       tier 1  Apple Silicon        0 $/hr   torch-mps + peft
  local-mlx       tier 2  Apple Silicon        0 $/hr   mlx-lm
  local-rocm      tier 2  AMD                  0 $/hr   torch-rocm6 + peft
  local-directml  tier 2  Windows DX12         0 $/hr   torch-directml
CLOUD
  modal           tier 1  KOLM_MODAL_TOKEN     2.50 $/hr   serverless GPU
  runpod          tier 1  KOLM_RUNPOD_TOKEN    1.20 $/hr   pods / serverless
  vast            tier 1  KOLM_VAST_TOKEN      0.50 $/hr   marketplace SSH
  together        tier 1  KOLM_TOGETHER_TOKEN  per-token  managed LoRA
  lambda          tier 2  KOLM_LAMBDA_TOKEN    2.00 $/hr   on-demand SSH
  replicate       tier 2  KOLM_REPLICATE_TOKEN per-call   Cog containers
  fal             tier 2  KOLM_FAL_TOKEN       per-call   inference only
SELF-HOSTED
  remote-ssh      tier 1  SSH + KOLM_REMOTE_HOST  0 $/hr   your own GPU

The split is intentional. Local backends are co-located with the training data and the artifact bytes; they cost nothing per hour, they survive an offline week, and they reproduce bit-for-bit when the toolchain is pinned. Cloud backends rent silicon you cannot afford to keep idle, with cold starts that vary by an order of magnitude across vendors. The remote-ssh backend is a third class: you bring the GPU, kolm drives it from your laptop.

The scoring formula.

When you run kolm compute pick, the picker computes a score in [0, 1] for every backend that passes your constraint filter, then returns the top result with the next three runners-up. The score function is intentionally short:

# src/compute/index.js, scoreBackend()
S = 0.35 * available
  + 0.20 * cost_inv
  + 0.15 * latency_inv
  + 0.15 * repro
  + 0.15 * perf

Each term lives on [0, 1]. The weights are not load-bearing; they encode a stance.

0.35 on availability. A backend that needs a token you have not set, or hardware you do not own, scores zero on this axis. The picker will name it as the best capability match but refuse to dispatch.
0.20 on cost. Cost is normalized so $0/hr is 1.0 and $5/hr is 0.0. Local backends max out cost; modal at $2.50/hr halves it; vast at $0.50/hr keeps most of it.
0.15 on latency. Cold-start seconds, normalized so 0s is 1.0 and 120s is 0.0. Modal cold starts in ~5s and dominates serverless; runpod and lambda eat a minute or more.
0.15 on reproducibility. Local kinds get 1.0, self-hosted SSH gets 0.9, managed cloud gets 0.6. This is the only axis with no detection input: it encodes how confident a third party would be that your second run produces the same artifact bytes as your first.
0.15 on performance. A per-backend bias, not a benchmark. local-cuda sits at 1.00, local-cpu sits at 0.30. The bias is the tiebreaker that keeps the picker from picking CPU when both CPU and an accelerator are present.

The tightest call the picker makes is local-cpu vs an idle modal token. CPU wins on availability, cost, latency, and reproducibility; modal wins decisively on performance. The 0.15 perf weight breaks that tie in modal's favor only when the per-hour cost normalization stays above 0.5 (so any modal-class GPU under ~$2.50/hr). That math is the whole reason "CPU is not the move" became a default the registry could enforce.

When local beats cloud.

The four cases where the local default is right by construction.

Regulated data. Healthcare, finance, defense. Any training corpus that is subject to HIPAA, GDPR, SR 11-7, or an export-controlled boundary cannot leave the regulated network. A cloud backend implies a Business Associate Agreement or a Standard Contractual Clause exchange before the first byte ships. Local backends ship zero bytes. The constraint is binary; the picker's --airgap flag enforces it:

$ kolm compute pick --airgap
picked:  local-mps
device:  apple m2 max (38 cores, 64 GB unified)
score:   0.842
reason:  available; score 0.842 (local)
runners-up: local-cpu (0.701), local-mlx (0.832 not detected)

Repeated training. A compile pipeline that runs nightly to incorporate the last 24 hours of captured traffic is a fixed-cost workload. On modal at $2.50/hr times one hour per night, that is $912/year of pure compute rent before the first user sees a benefit. The same 64-core M2 Max already on the engineer's desk amortizes to zero.

Small models. 0.5B to 3B base, LoRA rank 8 or 16, an eval set under 200 cases. The whole compile fits in 12 GB of unified memory on Apple Silicon and finishes in under 20 minutes on a recent M-series chip. Cold-starting a cloud worker takes longer than the local run.

Dev iteration. The expensive part of training a fine-tune is the eighteenth attempt, not the first. Local backends keep the loop tight: the recipe pack lives next to the captured corpus, the K-score lands in stdout, the failed compile leaves a directory you can git diff. Pushing every iteration through a cloud queue compounds latency per attempt.

When cloud beats local.

The three cases where the cloud backend is the right default.

One-shot big training. 70B base, full fine-tune, eight-figure parameter count touched. Even with a top-end consumer GPU, the VRAM cap rules out the run. kolm compute pick --min-vram 80 filters down to backends with at least 80 GB of accelerator memory, which is the full cloud slate plus any local card a hyperscaler engineer happens to own.

70B+ models, occasionally. The pattern is: most artifacts are 3B-7B local compiles, but once a quarter there is a 70B variant for a specific high-stakes task. Provisioning a permanent H100 for one job a quarter is wrong; renting modal for the four-hour compile is right. The picker carries the same constraint set across both: the receipt chain, the K-score gate, the artifact format are identical.

Occasional bursts. A demo, a customer evaluation, a one-time data migration. The amortization argument inverts: there is no second run. Cloud cost is a single line item, not a recurring obligation. The picker's --budget flag is the steering wheel:

$ kolm compute pick --budget 1.00
picked:  vast
device:  rtx 4090 (24 GB)
score:   0.766
reason:  available; score 0.766 (cloud-marketplace)
filtered out: modal ($2.50/hr > 1.00), runpod ($1.20/hr > 1.00), lambda ($2.00/hr > 1.00)

Worked examples.

Three concrete invocations and what they teach.

Example one: airgap mode on a laptop.

$ kolm compute pick --airgap
picked:  local-cuda
device:  rtx 4090 laptop (16 GB)
score:   0.864
$ kolm compute use local-cuda
# default backend written to ~/.kolm/config.json
$ kolm compile -t support_ticket_router -d ./tickets.jsonl
[trainer]     using local-cuda (env: KOLM_BACKEND=local-cuda)
[trainer]     no outbound network calls; airgap mode active
[K-score]     0.917 (ships at 0.85 gate)
[artifact]    ./support_ticket_router-1.0.0.kolm  3.8 GB

The picker filtered out every cloud backend on the airgap: false flag and ranked local-cuda above local-mps because the discrete GPU has a higher perf bias. The compile produced a signed artifact, the receipt chain was written, no byte left the laptop.

Example two: budget-constrained one-shot.

$ KOLM_VAST_TOKEN=vk_xx kolm compute pick --budget 5 --min-vram 80
picked:  vast
device:  h100 80gb
score:   0.778
reason:  available; score 0.778 (cloud-marketplace)
runners-up: runpod (0.752), lambda (0.720), modal (0.681)

Modal was eligible (under the $5 budget) but lost on the cost axis. Runpod is closer to vast on cost but eats a 60-second cold start. The picker logs every runner-up so you can audit the choice later.

Example three: smoke-test a backend before depending on it.

$ KOLM_MODAL_TOKEN=mk_xx kolm compute test modal
[1/4]     auth check                        ok
[2/4]     container provision (cold)        4.8s
[3/4]     trainer dry run (no weights)      ok
[4/4]     receipt round-trip                ok
PASS     modal is ready for kolm compile

The test verb exists because the worst failure mode is the one that surfaces inside kolm compile when you have already spent an hour preparing the corpus. kolm compute test exercises auth, provisioning, the trainer entry point, and a receipt round-trip in under thirty seconds.

Honest tradeoffs.

Three real costs the picker cannot abstract away.

Cold-start cost. Cloud cold starts are charged against your wallclock, not the backend's. Modal at 5 seconds is acceptable for a 20-minute compile; runpod at 60 seconds is acceptable for a 4-hour run. The score function encodes this, but the score is on [0, 1]; when you are watching a queue, the difference between 5 seconds and 90 seconds is not 0.1 of a number.

Reproducibility risk. A local compile on the same machine with the same pinned toolchain produces the same artifact bytes. A cloud compile on modal at 09:00 versus modal at 09:15 may use a different host kernel, a different CUDA driver, a different scheduler-assigned GPU. The receipt chain still verifies (the K-score and the eval scores are deterministic over the artifact bytes), but the artifact bytes themselves may shift by a hash. This is why the picker scores managed cloud at 0.6 on the repro axis. If you need byte-identical artifacts across runs, the answer is a local backend or a pinned remote-ssh.

Cost predictability. Per-hour cloud is a budget. Per-token managed services (together, replicate, fal) are not. The same task description, run twice, can produce two compile costs that differ by 3x depending on how the recipe pack tokenizes the eval set. Per-hour is what the picker prefers; per-token comes with a warning in kolm compute info.

The local-first default is not a religious commitment. It is a stance: dispatch the closest accelerator that meets the constraint, log the alternative the picker would have chosen, and let the receipt chain prove the artifact came out the same either way.

Every compile records the backend that produced it inside metrics.compute on the receipt: backend name, device string, cost in USD, wall duration, and a provenance subfield naming the picker version that chose it. When you re-verify an artifact a year from now, the chain tells you what was built and where.

Compute backend selection: local vs cloud.

Contents