HIPAA-Safe AI on a Laptop: PHI Never Leaves the Device

The shape of the HIPAA problem with frontier model APIs
The inversion: compile in a clean room, run on the device
A clinical compile pipeline, end to end
The audit story: what the OCR investigator actually wants
Worked example: a chart-summary assistant
What runs on the device, and what doesn't
FAQ

The shape of the HIPAA problem with frontier model APIs.

HIPAA does not say "you cannot use AI." It says you must control the disclosure of protected health information. Disclosure to a vendor is permissible only via a Business Associate Agreement that pushes obligations down the chain to that vendor. In theory this is workable: sign BAAs with OpenAI, Anthropic, AWS Bedrock, and the like, and you have legal cover.

In practice, three failures of the model keep showing up in our customer conversations.

Sub-processor drift. A frontier vendor's sub-processor list shifts every quarter. Each addition is a covered entity's compliance event. Most clinical teams discover the change weeks later, after a periodic review.
Logging blind spots. Even with "do not retain" flags, frontier vendors retain inputs for safety-classification and abuse-detection windows of 30-90 days. The retention is BAA-permitted; it is not BAA-erased. A 30-day window is a 30-day disclosure event.
Latency/cost cliff for offline use. Many clinical workflows need to work in an exam room with intermittent connectivity. Vendor APIs require a network round trip. The "offline mode" path is usually a smaller model that does not pass quality bars.

The cleanest fix is not to send the PHI at all. If the model lives on the user's device, there is no disclosure event. There is no third-party retention. There is no sub-processor surface. The BAA you would have signed with a vendor becomes unnecessary because the vendor is not in the pipe.

The inversion: compile in a clean room, run on the device.

Frontier-class behavior on PHI without sending PHI to a frontier vendor sounds contradictory until you separate training time from inference time.

At training (compile) time, the only data that touches a frontier API is de-identified or synthetic seed data the team has already cleared for outbound use. The compiler never sees the live patient corpus. It sees the task description, the de-identified examples, the verifier you wrote, and the budget. From these, it produces a portable artifact tuned to the task.

At inference (run) time, the artifact runs on the user's device — a clinician's laptop, a tablet at the bedside, an admin's workstation. The patient corpus lives in a local sqlite-vec index inside the artifact. The user's queries hit the local index, get retrieved chunks, run through the local LoRA-adapted base model, and produce an answer. No outbound traffic carries PHI. None. The artifact's manifest declares zero egress; the runtime enforces it.

The vendor never sees a real chart, real medication, or real patient name. The user's device sees all of it. The chain of custody for PHI is the user's device, full stop.

A clinical compile pipeline, end to end.

Concrete steps. Numbers from a real customer pipeline (anonymized).

Step 1 — De-identify a small seed set.

Take 50-200 representative chart entries. Run them through a HIPAA Safe Harbor de-identifier (the 18 identifiers go; structural information stays). The compiler's seed corpus is now de-identified. This is the only data that ever touches a frontier vendor.

Step 2 — Write the verifier.

For each task — "summarize the chart in 50 words", "extract medications and dosages", "flag billing-code mismatches" — write 5-10 ideal output examples and a 30-line scoring function. The compiler synthesizes the rest. The verifier is reviewable, version-controlled, and content-hashed.

Step 3 — Compile.

Run kolm compile with the de-identified seed set, your verifier, and a frontier API key. The compiler k-samples the frontier model on each example, scores the candidates with the verifier, picks the winners, and trains a LoRA adapter on top of a 3B-7B clinical-friendly base (BioMedLM, Llama-Med, or a Qwen-2.5 base). 8-25 minutes, depending on corpus size.

Step 4 — Sign the artifact.

The compiler emits a .kolm artifact: the base model + LoRA + recipe pack + verifier + held-out tests + manifest + HMAC signature. The K-score is on the cover; below 0.70, the artifact does not ship.

Step 5 — Ship to the device.

The artifact is a single file. Distribute it via your MDM, your IT package manager, or a signed download. The user opens it with kolm run artifact.kolm or via the Kolm desktop runtime. First run, the artifact attaches to a local on-device patient corpus (which the user — or your ops team — has already loaded into a local sqlite-vec via kolm recall ingest). All subsequent inference is local.

Step 6 — Recompile when the verifier or seed set changes.

Drift is recompile-driven. New tasks, new edge cases, new de-identified examples — they go into the next compile. The compiler caches what it can. The frontier vendor never sees a new piece of PHI; it only ever sees de-identified seeds and synthetic verifier challenges.

vendor PHI exposure0 records

local PHI scopeuser's device only

inference egress0 bytes

compile datade-identified seeds

signatureHMAC-SHA256 chain

K-score gate≥ 0.70

Properties of a compiled clinical artifact, as enforced by manifest + runtime

The audit story: what the OCR investigator actually wants.

If your covered entity is investigated by HHS Office for Civil Rights, the conversation will not be about the model's accuracy. It will be about the chain of custody of PHI, every disclosure, every safeguard, every access log. A clinician using a frontier API has a conversation that goes: "we send the chart text to the vendor, here is the BAA, here is the encryption-in-transit attestation, here is the data-retention contract addendum." Each step is a defensible — but expanded — surface area.

A clinician using a compiled .kolm has a much shorter conversation. "PHI never left the device. Here is the artifact's manifest declaring zero outbound egress. Here is the compile-time receipt chain proving the model was trained on de-identified data. Here is the K-score and the held-out test set. Here is the device-level audit log." Six sentences. Every one of them is verifiable.

property	frontier API + BAA	compiled `.kolm`
PHI leaves device	yes (every call)	never
BAA required	vendor + sub-processors	none
retention exposure	30-90 day vendor window	device-controlled
offline mode	degraded or unavailable	primary mode
cost per call	~$0.005-0.05	$0.00
vendor lock-in	high	artifact is portable
compile-time seed PHI	—	de-identified only
auditable artifact	no	HMAC-chained manifest

Worked example: a chart-summary assistant.

An ambulatory clinic compiles "chart-summary in 50 words, hedge-aware, medications-listed". The team writes 12 ideal summaries against 12 de-identified charts. The verifier (auto-synthesized then hand-tightened) checks length, medication-name matching against a reference list, and absence of unsupported claims.

kolm compile "summarize chart in 50 words, hedge-aware" \
  --examples ./deidentified-charts/ \
  --verifier ./summary-verifier.js \
  --base qwen2.5-7b \
  --teacher claude-opus-4-7 \
  --k 8 \
  --out chart-summary.kolm

# 14 minutes later, the artifact lands.
# K-score: 0.82, size: 2.1 GB, base: qwen2.5-7b-int4, teacher: claude-opus-4-7

# Distribute via MDM. On the clinician's laptop:
kolm run chart-summary.kolm --recall ./local-charts/

The clinician's local local-charts/ directory is the patient corpus. kolm recall ingest embeds and indexes it into a sqlite-vec inside the artifact's data dir. From that point on, the assistant answers questions like "summarize Mr. Jones's last visit" by retrieving from the local index, running the LoRA-adapted base, and emitting a 50-word summary — entirely on-device. Audit trail is the local ~/.kolm/log/audit.jsonl, signed and append-only.

What runs on the device, and what doesn't.

On the device: the base model (1.5-4 GB INT4 weights), the LoRA (10-100 MB), the recall index (sqlite-vec, sized by corpus), the recipe pack (5-20 MB), the verifier (small JS), and the audit log. Inference uses llama.cpp + sqlite-vec — both open-source, both portable across macOS, Linux, Windows, iOS, Android.

Not on the device: the frontier API key (it lived in the compile-time clean room and never reaches the device), the original training data (it stays on the compile environment's storage), and any vendor sub-processor relationship (there is no vendor in the runtime pipeline).

FAQ.

Is this the same as "running an open-source model locally"?

No. Off-the-shelf open-source models are not tuned to your task and will produce unreliable clinical output. The compile step is what closes the gap: a 7B-class base behaves like a frontier model on your task because the frontier model was its teacher during distillation.

What about HHS's emerging AI rule?

HHS's 2026 Health AI rule (and equivalent state acts) emphasize transparency, auditability, and minimum-disclosure. A compiled .kolm directly satisfies the minimum-disclosure standard (zero PHI egress) and provides the auditable receipt chain that transparency requires. We track regulatory developments at /healthcare.

What if my clinical team needs a model bigger than the device can run?

This is the core insight: a 70B model on the patient's phone is a fantasy, but a 7B compiled artifact whose behavior on your specific clinical tasks matches a 70B is achievable, because the 70B was its teacher. Read the compile primer →

What about prompt-injection attacks against the device-local model?

Prompt-injection still exists, the same way SQL injection still exists in any software product that takes user input. The compile-time verifier catches a meaningful fraction (it gates outputs by structure, length, and reference-vocabulary match). For higher-assurance deployments, the recipe pack is restricted to approved prefix-shapes; out-of-distribution prompts trigger a fallback to a more conservative branch.

Can I get a BAA from Kolmogorov?

Enterprise customers do, but the surface is much smaller: only the compile-time orchestrator handles your de-identified seeds. The runtime artifact has no relationship to Kolmogorov at all once it ships to the device. We are happy to walk you through the security architecture during procurement.

read How an AI compiler works →

The four-stage pipeline behind every .kolm.

read Verified inference →

Cryptographic receipts for every label.

vertical Healthcare →

The full regulatory and architectural picture.

act Run kolm compile →

Compile a clinical assistant in fifteen minutes.

HIPAA-safe AI on a laptop, where PHI never leaves the device.

Contents