kolm  /  learn  /  deploy to a phone

Deploy a .kolm to a phone.

Export the same artifact to iPhone (CoreML), Android (ONNX Mobile), or Apple Silicon (MLX). Forecast size, latency, and fit before shipping. Same K-score, smaller binary, fully offline on the device.

Golden Path 3 of 3 . v1.0 . assumes you already completed step 1

What you will have. A device-native model file ready to drop into your iOS / Android / Mac app. CoreML .mlpackage for iPhone, ONNX .onnx for Android, MLX bundle for M-series Macs. Predicted size and latency on your target device. Same task, same K-score, no cloud round-trip.

Targets supported

Device classBackendOutputTypical size (3B int4)
iPhone 15 Procoreml.mlpackage1.7 GB
iPhone 12 / 13coreml.mlpackage1.7 GB
Pixel 8 / OnePlus 12onnx.onnx1.7 GB
M-series Mac (M1+)mlxmlx bundle1.7 GB
Raspberry Pi 5 (8GB)gguf.gguf1.7 GB
Jetson Orin Nanotensorrtengine plan2.1 GB
Snapdragon X Eliteonnx.onnx1.7 GB

Full picker with predicted latency and memory headroom lives at /device-transfer.

step 1

Preview the fit (no toolchain).

30 seconds

Before installing CoreMLTools or ONNX Runtime, kolm export --preview tells you in JSON whether the artifact will fit on the device, the predicted tokens/sec, and the K-score loss from quantization. Pure JS lookup, no python.

kolm export my-redactor.kolm --preview --device iphone-15-pro --quant int4
{
  "device":                "iPhone 15 Pro (8GB)",
  "quant":                 "int4",
  "size_mb":               1741,
  "estimated_latency_ms":  33.3,
  "tok_per_s":             30,
  "k_loss":                -0.02,
  "k_score_est":           0.910,
  "fits":                  true,
  "fit_verdict":           "fit",
  "backend":               "coreml"
}
checkpoint fits: true and a k_score_est ≥ 0.85 mean you can ship to that device. fit_verdict: tight means it will run but with little headroom; over means try a smaller quantization (int3) or a smaller base.

Same probe against a Pixel 8:

kolm export my-redactor.kolm --preview --device pixel-8 --quant int4
{
  "device":                "Pixel 8 (12GB)",
  "quant":                 "int4",
  "size_mb":               1741,
  "estimated_latency_ms":  41.2,
  "tok_per_s":             24,
  "k_score_est":           0.910,
  "fits":                  true,
  "backend":               "onnx"
}
step 2

Run the export.

3 minutes

The actual export converts the LoRA-adapted base into the target binary format. CoreML uses Apple's coremltools; ONNX uses optimum; MLX uses mlx_lm; GGUF uses llama.cpp. kolm export shells out to whichever toolchain is on your PATH.

kolm export my-redactor.kolm --backend coreml --quant int4 --out ./exports
resolving base from manifest . llama-3.2-3b
applying LoRA adapter . ok
quantizing to int4 . ok
running coremltools convert . ok
verifying export against held-out cases . K=0.910 (-0.026 from fp16)
wrote ./exports/my-redactor.mlpackage  (1.7 GB)

For Android (ONNX Mobile):

kolm export my-redactor.kolm --backend onnx --quant int4 --opset 17 --out ./exports

For Apple Silicon Macs (MLX, useful for desktop apps):

kolm export my-redactor.kolm --backend mlx --quantize --out ./exports
checkpoint the export verb runs the held-out evaluation on the converted binary. The K= printed at the end is real and reproducible. If it dropped below 0.85 you do not ship.
step 3

Bundle and ship.

90 seconds

iOS. Drag the .mlpackage into your Xcode target. Call from Swift:

import CoreML

let model = try MyRedactor(configuration: MLModelConfiguration())
let out = try model.prediction(input: MyRedactorInput(text: input))
print(out.redacted)

Android. Drop the .onnx into your assets/ folder. Call from Kotlin via ONNX Runtime Mobile:

val env = OrtEnvironment.getEnvironment()
val session = env.createSession(assets.open("my-redactor.onnx").readBytes())
val out = session.run(mapOf("text" to OnnxTensor.createTensor(env, input)))
print(out["redacted"]?.value)

macOS. MLX is python or Swift. For a quick desktop app:

python -m mlx_lm.generate --model ./exports/my-redactor.mlx --prompt "..."
checkpoint the model runs on-device with no network call. The K-score is the same as the laptop K-score you saw in Golden Path 1, minus the small quantization loss the preview predicted.

What just happened

You took one .kolm and produced a device-native model in three commands. The K-score is reproducible; the size and latency match the preview. The same artifact, ejected, would let an external auditor trace every layer of the conversion. No remote inference, no API keys, no per-call cost.

Edge cases

  • Toolchain missing. The export fails with a clear "coremltools not found; install with pip install coremltools" message. kolm doctor --export diagnoses your PATH.
  • Quantization loss exceeds gate. Drop to int3 only if your task tolerates the K-loss. For PII redaction the loss is tiny; for code generation it is not.
  • Custom recipes that call out to web APIs. kolm refuses to export a recipe that requires network. kolm inspect --net shows which recipes are air-gap-safe.
  • FedRAMP / classified environments. The export step itself runs offline; kolm export --offline-only hard-fails if any sub-step would hit the network.

What to do next