Device transfer: pick a base, a quant, and a target box

Transfer flow.

The four steps below update with your picker selection. Real commands, real toolchains. Run them in order on the source machine, then on the target device.

1 Build the export from your .kolm. on source machine

Convert the .kolm artifact into a gguf file for the picked target. The export embeds the manifest, the quant tier, the K-score on the eval pack, and a SHA-256 of the resulting payload.

$ kolm export

Requires the llama.cpp toolchain.

2 Move it to the device. via scp

Copy the exported file onto the target box.

$ scp ./out user@device:~/

Tip: signed transfer is optional but recommended. See /airgap for sneakernet patterns.

3 Run on the target device. runtime

Invoke the runtime that matches the backend. Output goes to stdout. Bind to a local port to wire up an OpenAI-compat client.

$ run

4 Verify the artifact stayed honest. optional

On the source machine, regenerate the binder and compare. Quantization can drop K-score by a fraction of a point. Confirm the drop is within tolerance for your task before procurement signs off.

$ kolm verify

Reviewer-grade evidence: /verify-prod accepts the same .kolm in the browser and runs the same six checks.

What the numbers mean.

Size on disk. The packed artifact, including base weights at the chosen quantization tier. Add roughly 1 GB at runtime for kv-cache, tokenizer, and the runtime working set. The fit verdict accounts for that overhead.

Throughput. Tokens per second at the chosen quantization, taken from published llama.cpp, MLX, and CoreML community benchmarks for a 2025 build. These are point estimates, not measurements on your device. Quantization scaling assumes the standard memory-bandwidth-bound regime (int4 roughly 1.0x reference, int8 roughly 0.55x, fp16 roughly 0.30x, int3 roughly 1.15x).

K-score loss. The delta versus the fp16 reference on the artifact's embedded eval pack. Typical: int8 loses about 0.5 points, int4 about 2 points, int3 about 5 points. Task-dependent. The compile gate refuses K below 0.85 by default.

Fit verdict. Pass if (artifact size + 1 GB runtime overhead) is at most 80 percent of device RAM. Tight if 80 to 100 percent. Over if it exceeds device RAM. RTX 4090 row uses VRAM, not system RAM.

Recommended backend. GGUF for llama.cpp targets (Pi, Jetson, x86 laptops, Steam Deck, Snapdragon, RTX). MLX for Apple Silicon. CoreML for iPhone. ONNX for Android via ORT. TensorRT for NVIDIA serving rigs.

Source artifact sizes.

Reference numbers used by the picker. fp16 is the canonical export from the Hugging Face checkpoint; int4 uses the q4_k_m gguf variant.

base	fp16	int8	int4 (q4_k_m)	int3 (q3_k_s)
Llama-3.1-8B	16 GB	8 GB	4.4 GB	3.3 GB
Llama-3.2-3B	6 GB	3 GB	1.7 GB	1.3 GB
Llama-3.2-1B	2 GB	1 GB	0.58 GB	0.44 GB
Phi-3-mini-3.8B	7.6 GB	3.8 GB	2.1 GB	1.6 GB
Mistral-7B	14 GB	7 GB	3.9 GB	2.9 GB

Reference throughput at int4.

Tokens per second sampled from 2025 community benchmarks. Reads as: a 7B int4 artifact on this device at this rate, when memory permits. SBC and phone rows assume a 3B int4 artifact because 7B does not realistically fit.

device	typical artifact	tok/s	backend
Raspberry Pi 5	3B int4	3 to 5	GGUF / llama.cpp
Jetson Orin Nano	7B int4	10 to 15	GGUF / llama.cpp CUDA
Jetson AGX Orin	7B int4	25 to 40	GGUF or TensorRT
M3 Pro	7B int4	25 to 35	MLX
M3 Max	7B int4	50 to 70	MLX
iPhone 15 Pro	3B int4	10 to 15	CoreML
Pixel 8	3B int4	8 to 12	ONNX Runtime / Mobile
Steam Deck	7B int4	12 to 18	GGUF / llama.cpp Vulkan
Snapdragon X Elite	7B int4	30 to 50	GGUF or ONNX
RTX 4090	7B int4	150 to 200	GGUF / llama.cpp CUDA

What the forecast does not promise.

Real throughput depends on context length, decode length, prompt cache, thermal state, and the build of the runtime. Long contexts hurt small bases faster than they hurt large ones. Cold-start (loading weights into RAM) is the dominant term for many edge boxes and is not in the rate. The picker rate is a steady-state decode estimate after warmup.

We have not measured every device row. Apple Silicon and NVIDIA GPU rows are the most-validated. SBC and phone rows are point estimates from the community. Treat the picker as a sizing sketch, not a benchmark report. Run kolm bench <artifact.kolm> --device <name> on the actual box to get a number you can sign off on.

The quantization tier records in the manifest. Every export stamps quant, the base, the K-score on the embedded eval pack, and the fp16 reference. A reviewer can recompute everything from the artifact later.

One command after you pick.

The generated command runs end to end against the local Python toolchain (llama.cpp for gguf, mlx-lm for mlx, optimum-cli for onnx, trtllm-build for tensorrt). The --preview flag returns the same forecast as JSON without invoking the toolchain, which is useful in CI or on a box that does not have the converter installed.

Per-device quickstarts.

Five devices have full install-and-run quickstarts that go beyond the live picker: hardware-specific toolchain setup, transfer mechanics, runtime sanity checks, and the inevitable Apple/Linux/iOS/NVIDIA/browser-specific footnotes.

quickstart Raspberry Pi 5 →

apt + llama.cpp from source on ARM64, scp from the source Mac/Linux box, then llama-cli or llama-server. 4GB vs 8GB realism.

quickstart M3 Mac →

venv + mlx-lm, no transfer (local run), and an mlx_lm.server on :8080. The Apple Silicon unified-memory ceiling explained.

quickstart iPhone 15 Pro →

CoreML export + Xcode bundle + the personal-team vs paid Developer Program paths. Why there is no scp path to a stock iPhone.

quickstart Jetson Orin Nano →

TensorRT-LLM engine build (or ONNX Runtime CUDA fallback), scp the engine to the Jetson, run at about 38 tok/s on a 25W edge box. First-run JIT cost explained.

quickstart Browser (WASM) →

Export a wasm bundle, host with kolm serve, the user clicks Run in the tab. No CLI required for the end user, weights cached in IndexedDB.

Build it on a laptop. Ship it to a Pi.

01 source artifact

02 quantization

03 target device