Device transfer: Browser WASM quickstart (wasm-llamacpp)

RuntimeWebAssembly + SIMD

BrowsersChrome 121+, Firefox 122+, Safari 17+

Quantint4 only on the web

Backendwasm-llamacpp (ggml + WASM SIMD)

Throughput (7B int4)about 9 tok/s

Memory budgetabout 4 GB per tab

Forecast for a 7B int4 artifact in the browser.

Output artifact	`phi-redactor.wasm-bundle.tar`
Bundle size	about 1.2 GB (wasm runtime + ggml weights)
Throughput	9 tok/s (7B INT4, single-thread WASM, modern laptop)
K-score (estimated)	0.86 (4-bit drift)
Fits in tab	yes (1.2 GB weights, 4 GB tab budget)
First load (cold)	about 22 seconds (decode + warm)
Subsequent loads	near-instant (weights cached in IndexedDB)

Step 1. Export from your `.kolm` on the source machine.

The wasm bundle packs the wasm-llamacpp runtime, the ggml quantized weights, and the kolm manifest into a single .tar that kolm serve can host. There is no install step on the source machine beyond having the kolm CLI itself.

$ kolm export phi-redactor.kolm --device browser-wasm --quant int4
# output: phi-redactor.wasm-bundle.tar (about 1.2 GB)

The bundle ships with a tiny static frontend (HTML + JS) that talks to the wasm runtime in-tab. No backend round-trip after the initial download.

Step 2. Serve the bundle.

Use kolm serve to host the bundle on a local port. This is a static HTTP server with the right MIME types and IndexedDB-friendly Cache-Control headers baked in. It does not run inference, the browser does.

$ kolm serve phi-redactor.wasm-bundle.tar --port 8080
# static UI now live at http://localhost:8080

For internal demos, deploy the same bundle to any static host (Vercel, S3, Cloudflare Pages). The wasm runtime does not need WebGPU, WebGL, or any server-side compute.

Step 3. Open the browser and click Run.

Open http://localhost:8080 in Chrome 121+. The page shows a textarea and a Run button. First click streams the weights from the server into IndexedDB, decodes, and warms the wasm runtime in about 22 seconds. Subsequent clicks are near-instant.

# end-user flow:
# 1. open http://localhost:8080
# 2. wait ~22s for first cold load (weights download, decode, warm)
# 3. type or paste input, click Run
# 4. stream tokens at ~9 tok/s
# 5. tab can be closed and reopened, weights stay cached in IndexedDB

No CLI required for the end user. The whole flow above is point-and-click. The only CLI usage is on whoever runs kolm serve (or who deploys to a static host once).

Throughput envelope across browsers and devices.

7B int4, modern laptop (M-series Mac, Ryzen 7+, Core i7 12th+): about 9 tok/s.
3B int4, modern laptop: about 22 tok/s. Better fit for interactive UX.
1B int4, modern laptop: about 65 tok/s. Snappy on any modern browser.
7B int4, iPhone 15 Pro (Safari 17): about 3 tok/s. WASM is single-thread.
7B int4, Android (Chrome 121): about 3 to 5 tok/s, depending on SoC.
13B int4 or higher: not viable on most browsers, memory budget exceeds 4 GB per tab.

WASM is single-thread by default. SharedArrayBuffer and threading are gated behind cross-origin isolation headers (Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy) which kolm serve sets correctly. Even with threading, browser WASM is 2 to 4x slower than native ggml. For a 70 tok/s native run, expect 15 to 25 tok/s in-browser at best. Treat the browser as the zero-install path, not the throughput path.

Troubleshooting.

Slow on iPhone (about 3 tok/s): expected. WASM is single-thread on iOS Safari (no SAB workers), and the A17 cores cannot match a laptop. Use the iPhone CoreML quickstart for native-throughput iPhone deployments.
Tab freezes during decode: upgrade to Chrome 121+ or Firefox 122+. SIMD-relaxed instructions are on by default in those versions and give a 1.8x throughput bump on ggml-style kernels. Older browsers fall back to scalar wasm.
Weights re-download every visit: IndexedDB hit the per-origin quota and evicted the cache. The bundle calls navigator.storage.persist() on first load, but the user has to accept the prompt. If declined, the cache is best-effort. Tell users to grant persistent storage in the browser's site-data settings.
"SharedArrayBuffer is not defined": the host did not set COOP/COEP headers. kolm serve sets them; a custom static host (Vercel, S3, Cloudflare) needs them too. Without them WASM falls back to single-thread, which works but is slower.

Verify the artifact stayed honest.

The browser bundle ships the same kolm.manifest.json as the desktop export, with K-score, SHA-256 of the weights, quant tier, and signature. Drop the bundle into /verify-prod to recompute the six checks in another browser tab. No upload, no backend.

# on the source machine
$ kolm verify phi-redactor.kolm --binder report.html

Browser (WASM): export, serve, click Run.

Forecast for a 7B int4 artifact in the browser.

Step 1. Export from your `.kolm` on the source machine.

Step 2. Serve the bundle.

Step 3. Open the browser and click Run.

Throughput envelope across browsers and devices.

Troubleshooting.

Verify the artifact stayed honest.

References.

Next.

Browser (WASM): export, serve, click Run.

Forecast for a 7B int4 artifact in the browser.

Step 1. Export from your .kolm on the source machine.

Step 2. Serve the bundle.

Step 3. Open the browser and click Run.

Throughput envelope across browsers and devices.

Troubleshooting.

Verify the artifact stayed honest.

References.

Next.

Step 1. Export from your `.kolm` on the source machine.