Runs in any modern browser via WebAssembly. Chrome 121+, Firefox 122+, Safari 17+. The end user installs nothing: weights load over HTTP, decode runs inside the tab, the model is cached in IndexedDB after first load. Throughput is honest single-thread WASM: about 9 tok/s for 7B int4 on a modern laptop.
| Output artifact | phi-redactor.wasm-bundle.tar |
|---|---|
| Bundle size | about 1.2 GB (wasm runtime + ggml weights) |
| Throughput | 9 tok/s (7B INT4, single-thread WASM, modern laptop) |
| K-score (estimated) | 0.86 (4-bit drift) |
| Fits in tab | yes (1.2 GB weights, 4 GB tab budget) |
| First load (cold) | about 22 seconds (decode + warm) |
| Subsequent loads | near-instant (weights cached in IndexedDB) |
.kolm on the source machine.The wasm bundle packs the wasm-llamacpp runtime, the ggml quantized weights, and the kolm manifest into a single .tar that kolm serve can host. There is no install step on the source machine beyond having the kolm CLI itself.
$ kolm export phi-redactor.kolm --device browser-wasm --quant int4 # output: phi-redactor.wasm-bundle.tar (about 1.2 GB)
The bundle ships with a tiny static frontend (HTML + JS) that talks to the wasm runtime in-tab. No backend round-trip after the initial download.
Use kolm serve to host the bundle on a local port. This is a static HTTP server with the right MIME types and IndexedDB-friendly Cache-Control headers baked in. It does not run inference, the browser does.
$ kolm serve phi-redactor.wasm-bundle.tar --port 8080 # static UI now live at http://localhost:8080
For internal demos, deploy the same bundle to any static host (Vercel, S3, Cloudflare Pages). The wasm runtime does not need WebGPU, WebGL, or any server-side compute.
Open http://localhost:8080 in Chrome 121+. The page shows a textarea and a Run button. First click streams the weights from the server into IndexedDB, decodes, and warms the wasm runtime in about 22 seconds. Subsequent clicks are near-instant.
# end-user flow: # 1. open http://localhost:8080 # 2. wait ~22s for first cold load (weights download, decode, warm) # 3. type or paste input, click Run # 4. stream tokens at ~9 tok/s # 5. tab can be closed and reopened, weights stay cached in IndexedDB
No CLI required for the end user. The whole flow above is point-and-click. The only CLI usage is on whoever runs kolm serve (or who deploys to a static host once).
Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy) which kolm serve sets correctly. Even with threading, browser WASM is 2 to 4x slower than native ggml. For a 70 tok/s native run, expect 15 to 25 tok/s in-browser at best. Treat the browser as the zero-install path, not the throughput path.navigator.storage.persist() on first load, but the user has to accept the prompt. If declined, the cache is best-effort. Tell users to grant persistent storage in the browser's site-data settings.kolm serve sets them; a custom static host (Vercel, S3, Cloudflare) needs them too. Without them WASM falls back to single-thread, which works but is slower.The browser bundle ships the same kolm.manifest.json as the desktop export, with K-score, SHA-256 of the weights, quant tier, and signature. Drop the bundle into /verify-prod to recompute the six checks in another browser tab. No upload, no backend.
# on the source machine $ kolm verify phi-redactor.kolm --binder report.html