Pick the source artifact, the quantization tier, and the target box. Get an estimated size on disk, tokens per second, K-score loss, and the exact kolm export command. Throughput rows are forecasts based on public llama.cpp, MLX, and CoreML benchmarks. Measure on your actual hardware before you sign procurement.
$ kolm export
The four steps below update with your picker selection. Real commands, real toolchains. Run them in order on the source machine, then on the target device.
.kolm.
on source machine
Convert the .kolm artifact into a gguf file for the picked target. The export embeds the manifest, the quant tier, the K-score on the eval pack, and a SHA-256 of the resulting payload.
$ kolm export
Requires the llama.cpp toolchain.
Copy the exported file onto the target box.
$ scp ./out user@device:~/
Tip: signed transfer is optional but recommended. See /airgap for sneakernet patterns.
Invoke the runtime that matches the backend. Output goes to stdout. Bind to a local port to wire up an OpenAI-compat client.
$ run
On the source machine, regenerate the binder and compare. Quantization can drop K-score by a fraction of a point. Confirm the drop is within tolerance for your task before procurement signs off.
$ kolm verify
Reviewer-grade evidence: /verify-prod accepts the same .kolm in the browser and runs the same six checks.
Size on disk. The packed artifact, including base weights at the chosen quantization tier. Add roughly 1 GB at runtime for kv-cache, tokenizer, and the runtime working set. The fit verdict accounts for that overhead.
Throughput. Tokens per second at the chosen quantization, taken from published llama.cpp, MLX, and CoreML community benchmarks for a 2025 build. These are point estimates, not measurements on your device. Quantization scaling assumes the standard memory-bandwidth-bound regime (int4 roughly 1.0x reference, int8 roughly 0.55x, fp16 roughly 0.30x, int3 roughly 1.15x).
K-score loss. The delta versus the fp16 reference on the artifact's embedded eval pack. Typical: int8 loses about 0.5 points, int4 about 2 points, int3 about 5 points. Task-dependent. The compile gate refuses K below 0.85 by default.
Fit verdict. Pass if (artifact size + 1 GB runtime overhead) is at most 80 percent of device RAM. Tight if 80 to 100 percent. Over if it exceeds device RAM. RTX 4090 row uses VRAM, not system RAM.
Recommended backend. GGUF for llama.cpp targets (Pi, Jetson, x86 laptops, Steam Deck, Snapdragon, RTX). MLX for Apple Silicon. CoreML for iPhone. ONNX for Android via ORT. TensorRT for NVIDIA serving rigs.
Reference numbers used by the picker. fp16 is the canonical export from the Hugging Face checkpoint; int4 uses the q4_k_m gguf variant.
| base | fp16 | int8 | int4 (q4_k_m) | int3 (q3_k_s) |
|---|---|---|---|---|
| Llama-3.1-8B | 16 GB | 8 GB | 4.4 GB | 3.3 GB |
| Llama-3.2-3B | 6 GB | 3 GB | 1.7 GB | 1.3 GB |
| Llama-3.2-1B | 2 GB | 1 GB | 0.58 GB | 0.44 GB |
| Phi-3-mini-3.8B | 7.6 GB | 3.8 GB | 2.1 GB | 1.6 GB |
| Mistral-7B | 14 GB | 7 GB | 3.9 GB | 2.9 GB |
Tokens per second sampled from 2025 community benchmarks. Reads as: a 7B int4 artifact on this device at this rate, when memory permits. SBC and phone rows assume a 3B int4 artifact because 7B does not realistically fit.
| device | typical artifact | tok/s | backend |
|---|---|---|---|
| Raspberry Pi 5 | 3B int4 | 3 to 5 | GGUF / llama.cpp |
| Jetson Orin Nano | 7B int4 | 10 to 15 | GGUF / llama.cpp CUDA |
| Jetson AGX Orin | 7B int4 | 25 to 40 | GGUF or TensorRT |
| M3 Pro | 7B int4 | 25 to 35 | MLX |
| M3 Max | 7B int4 | 50 to 70 | MLX |
| iPhone 15 Pro | 3B int4 | 10 to 15 | CoreML |
| Pixel 8 | 3B int4 | 8 to 12 | ONNX Runtime / Mobile |
| Steam Deck | 7B int4 | 12 to 18 | GGUF / llama.cpp Vulkan |
| Snapdragon X Elite | 7B int4 | 30 to 50 | GGUF or ONNX |
| RTX 4090 | 7B int4 | 150 to 200 | GGUF / llama.cpp CUDA |
Real throughput depends on context length, decode length, prompt cache, thermal state, and the build of the runtime. Long contexts hurt small bases faster than they hurt large ones. Cold-start (loading weights into RAM) is the dominant term for many edge boxes and is not in the rate. The picker rate is a steady-state decode estimate after warmup.
We have not measured every device row. Apple Silicon and NVIDIA GPU rows are the most-validated. SBC and phone rows are point estimates from the community. Treat the picker as a sizing sketch, not a benchmark report. Run kolm bench <artifact.kolm> --device <name> on the actual box to get a number you can sign off on.
The quantization tier records in the manifest. Every export stamps quant, the base, the K-score on the embedded eval pack, and the fp16 reference. A reviewer can recompute everything from the artifact later.
The generated command runs end to end against the local Python toolchain (llama.cpp for gguf, mlx-lm for mlx, optimum-cli for onnx, trtllm-build for tensorrt). The --preview flag returns the same forecast as JSON without invoking the toolchain, which is useful in CI or on a box that does not have the converter installed.
Five devices have full install-and-run quickstarts that go beyond the live picker: hardware-specific toolchain setup, transfer mechanics, runtime sanity checks, and the inevitable Apple/Linux/iOS/NVIDIA/browser-specific footnotes.
apt + llama.cpp from source on ARM64, scp from the source Mac/Linux box, then llama-cli or llama-server. 4GB vs 8GB realism.
venv + mlx-lm, no transfer (local run), and an mlx_lm.server on :8080. The Apple Silicon unified-memory ceiling explained.
CoreML export + Xcode bundle + the personal-team vs paid Developer Program paths. Why there is no scp path to a stock iPhone.
TensorRT-LLM engine build (or ONNX Runtime CUDA fallback), scp the engine to the Jetson, run at about 38 tok/s on a 25W edge box. First-run JIT cost explained.
quickstart Browser (WASM) →Export a wasm bundle, host with kolm serve, the user clicks Run in the tab. No CLI required for the end user, weights cached in IndexedDB.
The full technical doc: the hardware ladder, the quantization tiers, the latency budgets, and a worked plant-floor example.
docs Edge deployment →Reference architectures for plant floor, retail shelf, vehicle, and other bounded-network edge cases. Same .kolm across ARM, x86, RISC-V.
read Air-gapped deployment →Pre-flight cache plan, offline switches, signed-image sneakernet. The artifact ships and stays.
spec RS-1 spec →On-disk shape of a .kolm. quant, base, and k_score live in the manifest.