Device transfer · estimator · public benchmarks

Build it on a laptop. Ship it to a Pi.

Pick the source artifact, the quantization tier, and the target box. Get an estimated size on disk, tokens per second, K-score loss, and the exact kolm export command. Throughput rows are forecasts based on public llama.cpp, MLX, and CoreML benchmarks. Measure on your actual hardware before you sign procurement.

Spec RS-1Backends GGUF / MLX / ONNX / CoreML / TensorRTUpdated 2026-05-15

01 source artifact

02 quantization

03 target device

forecast fits
generate command
$ kolm export

Transfer flow.

The four steps below update with your picker selection. Real commands, real toolchains. Run them in order on the source machine, then on the target device.

1 Build the export from your .kolm. on source machine

Convert the .kolm artifact into a gguf file for the picked target. The export embeds the manifest, the quant tier, the K-score on the eval pack, and a SHA-256 of the resulting payload.

$ kolm export

Requires the llama.cpp toolchain.

2 Move it to the device. via scp

Copy the exported file onto the target box.

$ scp ./out user@device:~/

Tip: signed transfer is optional but recommended. See /airgap for sneakernet patterns.

3 Run on the target device. runtime

Invoke the runtime that matches the backend. Output goes to stdout. Bind to a local port to wire up an OpenAI-compat client.

$ run

4 Verify the artifact stayed honest. optional

On the source machine, regenerate the binder and compare. Quantization can drop K-score by a fraction of a point. Confirm the drop is within tolerance for your task before procurement signs off.

$ kolm verify

Reviewer-grade evidence: /verify-prod accepts the same .kolm in the browser and runs the same six checks.

What the numbers mean.

Size on disk. The packed artifact, including base weights at the chosen quantization tier. Add roughly 1 GB at runtime for kv-cache, tokenizer, and the runtime working set. The fit verdict accounts for that overhead.

Throughput. Tokens per second at the chosen quantization, taken from published llama.cpp, MLX, and CoreML community benchmarks for a 2025 build. These are point estimates, not measurements on your device. Quantization scaling assumes the standard memory-bandwidth-bound regime (int4 roughly 1.0x reference, int8 roughly 0.55x, fp16 roughly 0.30x, int3 roughly 1.15x).

K-score loss. The delta versus the fp16 reference on the artifact's embedded eval pack. Typical: int8 loses about 0.5 points, int4 about 2 points, int3 about 5 points. Task-dependent. The compile gate refuses K below 0.85 by default.

Fit verdict. Pass if (artifact size + 1 GB runtime overhead) is at most 80 percent of device RAM. Tight if 80 to 100 percent. Over if it exceeds device RAM. RTX 4090 row uses VRAM, not system RAM.

Recommended backend. GGUF for llama.cpp targets (Pi, Jetson, x86 laptops, Steam Deck, Snapdragon, RTX). MLX for Apple Silicon. CoreML for iPhone. ONNX for Android via ORT. TensorRT for NVIDIA serving rigs.

Source artifact sizes.

Reference numbers used by the picker. fp16 is the canonical export from the Hugging Face checkpoint; int4 uses the q4_k_m gguf variant.

basefp16int8int4 (q4_k_m)int3 (q3_k_s)
Llama-3.1-8B16 GB8 GB4.4 GB3.3 GB
Llama-3.2-3B6 GB3 GB1.7 GB1.3 GB
Llama-3.2-1B2 GB1 GB0.58 GB0.44 GB
Phi-3-mini-3.8B7.6 GB3.8 GB2.1 GB1.6 GB
Mistral-7B14 GB7 GB3.9 GB2.9 GB

Reference throughput at int4.

Tokens per second sampled from 2025 community benchmarks. Reads as: a 7B int4 artifact on this device at this rate, when memory permits. SBC and phone rows assume a 3B int4 artifact because 7B does not realistically fit.

devicetypical artifacttok/sbackend
Raspberry Pi 53B int43 to 5GGUF / llama.cpp
Jetson Orin Nano7B int410 to 15GGUF / llama.cpp CUDA
Jetson AGX Orin7B int425 to 40GGUF or TensorRT
M3 Pro7B int425 to 35MLX
M3 Max7B int450 to 70MLX
iPhone 15 Pro3B int410 to 15CoreML
Pixel 83B int48 to 12ONNX Runtime / Mobile
Steam Deck7B int412 to 18GGUF / llama.cpp Vulkan
Snapdragon X Elite7B int430 to 50GGUF or ONNX
RTX 40907B int4150 to 200GGUF / llama.cpp CUDA

What the forecast does not promise.

Real throughput depends on context length, decode length, prompt cache, thermal state, and the build of the runtime. Long contexts hurt small bases faster than they hurt large ones. Cold-start (loading weights into RAM) is the dominant term for many edge boxes and is not in the rate. The picker rate is a steady-state decode estimate after warmup.

We have not measured every device row. Apple Silicon and NVIDIA GPU rows are the most-validated. SBC and phone rows are point estimates from the community. Treat the picker as a sizing sketch, not a benchmark report. Run kolm bench <artifact.kolm> --device <name> on the actual box to get a number you can sign off on.

The quantization tier records in the manifest. Every export stamps quant, the base, the K-score on the embedded eval pack, and the fp16 reference. A reviewer can recompute everything from the artifact later.

One command after you pick.

The generated command runs end to end against the local Python toolchain (llama.cpp for gguf, mlx-lm for mlx, optimum-cli for onnx, trtllm-build for tensorrt). The --preview flag returns the same forecast as JSON without invoking the toolchain, which is useful in CI or on a box that does not have the converter installed.

Per-device quickstarts.

Five devices have full install-and-run quickstarts that go beyond the live picker: hardware-specific toolchain setup, transfer mechanics, runtime sanity checks, and the inevitable Apple/Linux/iOS/NVIDIA/browser-specific footnotes.