use cases / UC-03 · mobile copilots

Hermes-class behavior on a phone shipped since 2021.

A 2–3GB .kolm compiled from frontier weights, drafted by deterministic recipes, fit to your task with a LoRA. Loads cold in 1.5s. Streams 35-90 tokens/sec on iPhone 13 and up. Offline. No per-token bill. Signed receipt for every output.

01 · the honest claim

Not a quantized 70B. A compiled 3B.

We don’t put 70B on a phone. We compile a 3B-class Specialist that behaves like 70B on your specific task, because it was distilled on those exact tasks, drafted by recipes that cover its structured outputs, and grounded in your data via Recall.

Artifact size
2.4GB typical

3B INT4 base + LoRA + recipe pack + sqlite-vec index. Fits below the 4GB iCloud restore cap; ships inside an App Store binary if you split via on-demand resources.

Cold start
1.5s p50

mmap the GGUF, page in the LoRA, warm the recipe LRU. NPU acceleration via ExecuTorch on iOS 17+ / Android 14+.

Tok/s on iPhone 15 Pro
62- 90 tok/s

3B INT4 with recipe-drafted speculative decoding on a structured task. Sub-30ms per token, sustained on battery.

02 · device matrix

Where it runs.

We test against a fixed device matrix on every release. The numbers below are conservative bands; battery-thermal-aware throttling is built into the runtime.

DeviceNPU / acceleratorCold startTok/s (3B INT4)Status
iPhone 15 Pro / 16 ProNeural Engine 17.0+1.2s62-90supported
iPhone 13 / 14 / 14 ProNeural Engine 15.0+1.8s34-48supported
Pixel 8 / 9Tensor G3 / G4 (TPU)1.4s52-78supported
Pixel 7 / Samsung S22+Tensor G2 / Snapdragon 8G12.1s28-42supported
iPhone 12 / olderNeural Engine A143.0s14-22degraded
Phones < 6GB RAManyn/an/anot supported
03 · runtime stack

Three layers. No native code you have to maintain.

The mobile runtime ships as a React Native / Swift / Kotlin SDK. Embedded llama.cpp + ExecuTorch handle inference; recipe drafts + sqlite-vec recall handle the structured-output and grounding layers. You ship features, not inference plumbing.

RT

Runtime: llama.cpp + ExecuTorch.

llama.cpp for the GGUF base + LoRA. ExecuTorch binding for NPU offload. Compiled per-platform, signed, distributed via the SDK.

DR

Recipe drafts.

Deterministic-token subset of the model’s behavior, served from a recipe pack inside the artifact. 2-5× tokens/sec speedup on structured outputs (JSON, code, lists).

RC

Recall: sqlite-vec.

Multimodal embeddings indexed locally on the device. Photos, voice memos, share-sheet captures auto-embed and ground every inference call.

04 · integration

Five lines of React Native.

Drop a .kolm in your bundle (or stream it on first launch via on-demand resources). Open it. Call it. The SDK handles model loading, NPU offload, receipt signing, recall. (React Native module is in preview — ships with v7.0; the API below is the locked surface, today browser-WASM via WebView.)

App.tsx
// 1. install: npm i github:sneaky-hippo/kolmogorov-stack  (RN module in preview, ships v7.0)
import { Specialist } from "@kolmogorov/react-native";

const writer = await Specialist.load("email-reply-1.0.0.kolm");

// 2. ground in user's data (auto-embeds on background thread)
await writer.recall.indexFolder("~/Documents/correspondence");

// 3. infer.  offline.  signed.
const reply = await writer.complete({
  prompt: "draft a polite decline to this meeting request",
  context: incomingEmail,
});
// reply.text       — the draft
// reply.receipt    — HMAC-chained signature you can verify offline
// reply.sources    — recall chunks that grounded the draft
05 · what we explicitly do not promise

The mobile runtime is bounded. Here’s where.

We refuse to oversell on-device. The hard truths.

!

It is not a frontier model.

Open-domain reasoning, novel research synthesis, anything outside the compiled task’s shape will be worse than calling Sonnet/Opus over the wire. Compiled means specialized.

!

Battery is real.

Sustained inference is 10-15W on iPhone, comparable to a video call. We expose throttle hooks; you decide when to gate.

!

Cross-app context is opt-in.

Recall scopes to your app’s sandbox by default. No system-wide harvesting. Crossing into Photos/Files requires an explicit user grant and a visible scope receipt.

Ship the AI experience that survives airplane mode.

A pre-compiled Specialist your users own, on a device they own, with receipts they can verify offline. The opposite of a thin client.