device transfer / browser wasm
Browser quickstart · WebAssembly · no install on the client

Browser (WASM): export, serve, click Run.

Runs in any modern browser via WebAssembly. Chrome 121+, Firefox 122+, Safari 17+. The end user installs nothing: weights load over HTTP, decode runs inside the tab, the model is cached in IndexedDB after first load. Throughput is honest single-thread WASM: about 9 tok/s for 7B int4 on a modern laptop.

RuntimeWebAssembly + SIMD
BrowsersChrome 121+, Firefox 122+, Safari 17+
Quantint4 only on the web
Backendwasm-llamacpp (ggml + WASM SIMD)
Throughput (7B int4)about 9 tok/s
Memory budgetabout 4 GB per tab

Forecast for a 7B int4 artifact in the browser.

Output artifactphi-redactor.wasm-bundle.tar
Bundle sizeabout 1.2 GB (wasm runtime + ggml weights)
Throughput9 tok/s (7B INT4, single-thread WASM, modern laptop)
K-score (estimated)0.86 (4-bit drift)
Fits in tabyes (1.2 GB weights, 4 GB tab budget)
First load (cold)about 22 seconds (decode + warm)
Subsequent loadsnear-instant (weights cached in IndexedDB)

Step 1. Export from your .kolm on the source machine.

The wasm bundle packs the wasm-llamacpp runtime, the ggml quantized weights, and the kolm manifest into a single .tar that kolm serve can host. There is no install step on the source machine beyond having the kolm CLI itself.

$ kolm export phi-redactor.kolm --device browser-wasm --quant int4
# output: phi-redactor.wasm-bundle.tar (about 1.2 GB)

The bundle ships with a tiny static frontend (HTML + JS) that talks to the wasm runtime in-tab. No backend round-trip after the initial download.

Step 2. Serve the bundle.

Use kolm serve to host the bundle on a local port. This is a static HTTP server with the right MIME types and IndexedDB-friendly Cache-Control headers baked in. It does not run inference, the browser does.

$ kolm serve phi-redactor.wasm-bundle.tar --port 8080
# static UI now live at http://localhost:8080

For internal demos, deploy the same bundle to any static host (Vercel, S3, Cloudflare Pages). The wasm runtime does not need WebGPU, WebGL, or any server-side compute.

Step 3. Open the browser and click Run.

Open http://localhost:8080 in Chrome 121+. The page shows a textarea and a Run button. First click streams the weights from the server into IndexedDB, decodes, and warms the wasm runtime in about 22 seconds. Subsequent clicks are near-instant.

# end-user flow:
# 1. open http://localhost:8080
# 2. wait ~22s for first cold load (weights download, decode, warm)
# 3. type or paste input, click Run
# 4. stream tokens at ~9 tok/s
# 5. tab can be closed and reopened, weights stay cached in IndexedDB

No CLI required for the end user. The whole flow above is point-and-click. The only CLI usage is on whoever runs kolm serve (or who deploys to a static host once).

Throughput envelope across browsers and devices.

WASM is single-thread by default. SharedArrayBuffer and threading are gated behind cross-origin isolation headers (Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy) which kolm serve sets correctly. Even with threading, browser WASM is 2 to 4x slower than native ggml. For a 70 tok/s native run, expect 15 to 25 tok/s in-browser at best. Treat the browser as the zero-install path, not the throughput path.

Troubleshooting.

Verify the artifact stayed honest.

The browser bundle ships the same kolm.manifest.json as the desktop export, with K-score, SHA-256 of the weights, quant tier, and signature. Drop the bundle into /verify-prod to recompute the six checks in another browser tab. No upload, no backend.

# on the source machine
$ kolm verify phi-redactor.kolm --binder report.html

References.

Next.