A 2–3GB .kolm compiled from frontier weights, drafted by deterministic recipes, fit to your task with a LoRA. Loads cold in 1.5s. Streams 35-90 tokens/sec on iPhone 13 and up. Offline. No per-token bill. Signed receipt for every output.
We don’t put 70B on a phone. We compile a 3B-class Specialist that behaves like 70B on your specific task, because it was distilled on those exact tasks, drafted by recipes that cover its structured outputs, and grounded in your data via Recall.
3B INT4 base + LoRA + recipe pack + sqlite-vec index. Fits below the 4GB iCloud restore cap; ships inside an App Store binary if you split via on-demand resources.
mmap the GGUF, page in the LoRA, warm the recipe LRU. NPU acceleration via ExecuTorch on iOS 17+ / Android 14+.
3B INT4 with recipe-drafted speculative decoding on a structured task. Sub-30ms per token, sustained on battery.
We test against a fixed device matrix on every release. The numbers below are conservative bands; battery-thermal-aware throttling is built into the runtime.
| Device | NPU / accelerator | Cold start | Tok/s (3B INT4) | Status |
|---|---|---|---|---|
| iPhone 15 Pro / 16 Pro | Neural Engine 17.0+ | 1.2s | 62-90 | supported |
| iPhone 13 / 14 / 14 Pro | Neural Engine 15.0+ | 1.8s | 34-48 | supported |
| Pixel 8 / 9 | Tensor G3 / G4 (TPU) | 1.4s | 52-78 | supported |
| Pixel 7 / Samsung S22+ | Tensor G2 / Snapdragon 8G1 | 2.1s | 28-42 | supported |
| iPhone 12 / older | Neural Engine A14 | 3.0s | 14-22 | degraded |
| Phones < 6GB RAM | any | n/a | n/a | not supported |
The mobile runtime ships as a React Native / Swift / Kotlin SDK. Embedded llama.cpp + ExecuTorch handle inference; recipe drafts + sqlite-vec recall handle the structured-output and grounding layers. You ship features, not inference plumbing.
llama.cpp for the GGUF base + LoRA. ExecuTorch binding for NPU offload. Compiled per-platform, signed, distributed via the SDK.
Deterministic-token subset of the model’s behavior, served from a recipe pack inside the artifact. 2-5× tokens/sec speedup on structured outputs (JSON, code, lists).
Multimodal embeddings indexed locally on the device. Photos, voice memos, share-sheet captures auto-embed and ground every inference call.
Drop a .kolm in your bundle (or stream it on first launch via on-demand resources). Open it. Call it. The SDK handles model loading, NPU offload, receipt signing, recall. (React Native module is in preview — ships with v7.0; the API below is the locked surface, today browser-WASM via WebView.)
// 1. install: npm i github:sneaky-hippo/kolmogorov-stack (RN module in preview, ships v7.0) import { Specialist } from "@kolmogorov/react-native"; const writer = await Specialist.load("email-reply-1.0.0.kolm"); // 2. ground in user's data (auto-embeds on background thread) await writer.recall.indexFolder("~/Documents/correspondence"); // 3. infer. offline. signed. const reply = await writer.complete({ prompt: "draft a polite decline to this meeting request", context: incomingEmail, }); // reply.text — the draft // reply.receipt — HMAC-chained signature you can verify offline // reply.sources — recall chunks that grounded the draft
We refuse to oversell on-device. The hard truths.
Open-domain reasoning, novel research synthesis, anything outside the compiled task’s shape will be worse than calling Sonnet/Opus over the wire. Compiled means specialized.
Sustained inference is 10-15W on iPhone, comparable to a video call. We expose throttle hooks; you decide when to gate.
Recall scopes to your app’s sandbox by default. No system-wide harvesting. Crossing into Photos/Files requires an explicit user grant and a visible scope receipt.
A pre-compiled Specialist your users own, on a device they own, with receipts they can verify offline. The opposite of a thin client.