Compile to your hardware.
Every .kolm artifact is bound to a base model and a target device. The defaults pick themselves from what kolm detects on the box. Override either at the CLI; both ride the receipt chain so verification is reproducible.
Defaults by device
What kolm compile picks when you don't pass --base-model or --target-device. kolm gpu detect returns the device id; the picker walks the table below.
| Device | Class | VRAM | Default train | Default infer |
|---|---|---|---|---|
rtx-5090 | training | 32 GB | Qwen2.5-7B-Instruct | Qwen2.5-7B-Instruct |
rtx-4090 | training | 24 GB | Qwen2.5-3B-Instruct | Qwen2.5-3B-Instruct |
rtx-3090 | training | 24 GB | Qwen2.5-3B-Instruct | Qwen2.5-3B-Instruct |
a100-40gb | training | 40 GB | Qwen2.5-7B-Instruct | Qwen2.5-7B-Instruct |
a100-80gb | training | 80 GB | Qwen2.5-14B-Instruct | Qwen2.5-7B-Instruct |
h100-80gb | training | 80 GB | Qwen2.5-14B-Instruct | Qwen2.5-7B-Instruct |
h200-141gb | training | 141 GB | Qwen2.5-14B-Instruct | Qwen2.5-7B-Instruct |
apple-m3-max | training | 64 GB | Qwen2.5-3B-Instruct | Qwen2.5-3B-Instruct |
apple-m2-pro | inference | 16 GB | n/a | Qwen2.5-3B-Instruct (MLX) |
iphone-15-pro | inference | 4 GB | n/a | Qwen2.5-1.5B-Instruct (4-bit) |
pixel-8-pro | inference | 3 GB | n/a | gemma-3-1b-it (4-bit) |
laptop-igpu | inference | 2 GB | n/a | Qwen2.5-1.5B-Instruct |
cpu-x86_64 | inference | n/a | SmolLM2-1.7B-Instruct | Qwen2.5-0.5B-Instruct |
wasm | inference | n/a | n/a | Qwen2.5-0.5B-Instruct |
The default pick: Qwen 2.5 3B Instruct
When no device is detected and no flag is passed, kolm compile resolves base_model to Qwen/Qwen2.5-3B-Instruct. Why:
- Apache 2.0. Commercial-redistributable, no MAU clause, no acceptable-use policy that has to be re-read per buyer.
- Native tool use. Distillation targets that emit JSON tool-calls work zero-shot.
- 32K context native, 128K with YaRN. Long enough for clinical notes, contracts, long support threads.
- 29 languages. Healthcare, finance, legal callers frequently need non-English.
- 3.09B params. Fits a single 24 GB consumer GPU at bf16 + LoRA r=16 with room for optimizer state. Drops to under 2 GB on disk for the LoRA adapter.
- Beats Llama-3.2-3B on MMLU, GSM8K, MATH, HumanEval, IFEval per the Qwen 2.5 tech report and the public Hugging Face Open LLM Leaderboard.
Full base-model registry
16 models. kolm models list prints the same shape at the terminal.
| Model | License | Params | Ctx | Tool use | Multilingual | Use |
|---|---|---|---|---|---|---|
Qwen/Qwen2.5-0.5B-Instruct | Apache 2.0 | 0.50 B | 32 K | yes | 29 langs | edge / wasm |
Qwen/Qwen2.5-1.5B-Instruct | Apache 2.0 | 1.54 B | 32 K | yes | 29 langs | mobile |
Qwen/Qwen2.5-3B-Instruct ← default | Apache 2.0 | 3.09 B | 32 K / 128 K YaRN | yes | 29 langs | laptop / 4090 |
Qwen/Qwen2.5-7B-Instruct | Apache 2.0 | 7.62 B | 128 K | yes | 29 langs | 5090 / A100 |
Qwen/Qwen2.5-Coder-7B-Instruct | Apache 2.0 | 7.62 B | 128 K | yes | code-first | code distill |
Qwen/Qwen2.5-14B-Instruct | Apache 2.0 | 14.7 B | 128 K | yes | 29 langs | A100 80 / H100 |
meta-llama/Llama-3.2-1B-Instruct | Llama 3.2 Community | 1.24 B | 128 K | yes | English-first | mobile alternate |
meta-llama/Llama-3.2-3B-Instruct | Llama 3.2 Community | 3.21 B | 128 K | yes | English-first | 3B alternate |
meta-llama/Llama-3.1-8B-Instruct | Llama 3.1 Community | 8.03 B | 128 K | yes | English-first | 8B alternate |
microsoft/Phi-3.5-mini-instruct | MIT | 3.82 B | 128 K | no | multilingual | reasoning-first |
google/gemma-3-1b-it | Gemma ToU | 1.00 B | 32 K | partial | 140 langs | mobile (Pixel) |
google/gemma-3-4b-it | Gemma ToU | 4.30 B | 128 K | partial | 140 langs | vision target |
google/gemma-3-12b-it | Gemma ToU | 12.2 B | 128 K | partial | 140 langs | vision alternate |
google/gemma-2-2b-it | Gemma ToU | 2.61 B | 8 K | no | English-first | tiny alt |
mistralai/Ministral-3B-Instruct-2410 | MRL (Mistral Research) | 3.00 B | 128 K | yes | multilingual | 3B alternate |
HuggingFaceTB/SmolLM2-1.7B-Instruct | Apache 2.0 | 1.71 B | 8 K | no | English-first | CPU fallback |
License posture. kolm defaults to Apache 2.0 because that's the license least likely to require legal review in regulated industries. Llama and Gemma stay in the registry as opt-in; pin them with kolm models pin meta-llama/Llama-3.2-3B-Instruct if your buyer accepts the terms.
Device registry
14 device profiles. kolm gpu detect picks a row by parsing nvidia-smi output (NVIDIA), Metal device id (Apple), or falling back to cpu-x86_64 / wasm.
| Device | Arch | VRAM | Attention | Min CUDA | Min torch | FP4 / FP8 / BF16 |
|---|---|---|---|---|---|---|
rtx-5090 | Blackwell sm_120 | 32 GB | fa3 | 12.8 | 2.7 | yes / yes / yes |
rtx-4090 | Ada sm_89 | 24 GB | fa2 | 12.1 | 2.4 | no / yes / yes |
rtx-3090 | Ampere sm_86 | 24 GB | fa2 | 11.8 | 2.2 | no / no / yes |
a100-40gb | Ampere sm_80 | 40 GB | fa2 | 11.8 | 2.2 | no / no / yes |
a100-80gb | Ampere sm_80 | 80 GB | fa2 | 11.8 | 2.2 | no / no / yes |
h100-80gb | Hopper sm_90 | 80 GB | fa3 | 12.4 | 2.4 | no / yes / yes |
h200-141gb | Hopper sm_90 | 141 GB | fa3 | 12.4 | 2.4 | no / yes / yes |
apple-m3-max | Apple Silicon | 64 GB | mlx | n/a | n/a | no / no / yes |
apple-m2-pro | Apple Silicon | 16 GB | mlx | n/a | n/a | no / no / yes |
iphone-15-pro | A17 Pro | 4 GB | coreml | n/a | n/a | no / no / partial |
pixel-8-pro | Tensor G3 | 3 GB | mediapipe | n/a | n/a | no / no / partial |
laptop-igpu | Intel Arc / Iris | 2 GB | directml | n/a | n/a | no / no / partial |
cpu-x86_64 | any x86_64 | n/a | sdpa | n/a | n/a | no / no / no |
wasm | wasm32 | n/a | sdpa | n/a | n/a | no / no / no |
Device-fit contract
Every .kolm artifact carries target_device and train_device in the manifest. Before the runtime loads an adapter it calls verifyDeviceFit(manifest, hostDeviceId) and reads:
| Compile target | Host device | Result | Behavior |
|---|---|---|---|
rtx-5090 | rtx-5090 | ok:true | load and run |
iphone-15-pro | rtx-5090 | ok:true, soft:true | load with warning (cross-class) |
| null (no target pinned) | rtx-5090 | ok:true, soft:true | load with warning (untargeted) |
rtx-5090 | iphone-15-pro | ok:false | refuse (4 GB host can't hold 32 GB compile) |
The runtime never lies about a mismatch: it either loads cleanly, loads with a structured warning, or refuses. Smoke at scripts/smoke-device-bind.mjs proves the four cases pass.
CLI
Three verbs cover the surface: kolm models for the catalog, kolm gpu for the box, kolm compile for the binding.
$ kolm gpu detect rtx-5090 . Blackwell sm_120 . 32 GB . cuda 12.8 . torch 2.7+ $ kolm models recommend --target-device rtx-5090 Qwen/Qwen2.5-7B-Instruct (apache-2.0, 7.6B, 128K ctx, tool-use) Qwen/Qwen2.5-3B-Instruct (apache-2.0, 3.1B, 32K ctx, tool-use) meta-llama/Llama-3.1-8B-Instruct (llama-community, 8B, 128K ctx) $ kolm models pin Qwen/Qwen2.5-7B-Instruct pinned base model: Qwen/Qwen2.5-7B-Instruct $ kolm compile --task "classify support tickets" --target-device rtx-5090 . resolves base model: Qwen/Qwen2.5-7B-Instruct . attention: fa3 . optimizer: paged_adamw_8bit . liger: on . compiling: 100% K=0.917 . signing receipt: HMAC-SHA256 . done.
Full verb tables at /docs. The decision matrix that picks defaults is at /spec under "device-fit".
Why we will revisit
- Qwen3. When Qwen3-3B-Instruct ships under Apache 2.0 with comparable tool-use, the default reranks. The bigger tokenizer is a code win and a tiny-model latency loss; we will measure both.
- Llama 4. If Meta drops the MAU clause, Llama moves up in the picker.
- Gemma 3 vision. Once
kolm capture <image>ships, the mobile-inference default flips from text-only Qwen to Gemma 3 4B. - NVFP4 training. Torch 2.8 + cuBLASLt 12.9 lands native NVFP4 training on Blackwell. When that wheel hits the index, the 5090 trainer flips bf16 LoRA → fp4 LoRA at the same VRAM.
Honesty notes
Two things we are not claiming.
We did not train these base models. The base-model field selects a foundation; .kolm is the LoRA adapter on top, plus the receipt chain. The base remains under its upstream license; you remain the licensee.
Benchmark numbers in the picker come from public scorecards. MMLU / GSM8K / MATH / HumanEval / IFEval scores quoted in this page and in the Qwen 2.5 tech report are reproductions kolm has not independently rerun. The /leaderboard page tracks reproductions we have run; everything else is cited.