Same SDK. Local endpoint.
A .kolm artifact serves the OpenAI Chat Completions schema on a local port. Change the base_url in your existing client, leave the rest of your code alone. The receipt your code already discards comes back signed.
Two snippets. One difference.
Cloud Chat Completions
Your existing OpenAI client. Bytes cross the public internet to OpenAI's servers; every prompt and response is logged by a third party. Cost scales linearly with usage.
from openai import OpenAI
client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":"redact PHI: John Doe MRN 8847"}],
)
print(resp.choices[0].message.content)
Local Chat Completions
Same SDK. The base_url points at kolm serve --http phi-redactor.kolm running on localhost:8765. No keys, no quotas, no remote logs. The response carries a signed receipt header.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8765/v1",
api_key="not-used",
)
resp = client.chat.completions.create(
model="phi-redactor.kolm",
messages=[{"role":"user","content":"redact PHI: John Doe MRN 8847"}],
)
print(resp.choices[0].message.content)
What works under the OpenAI schema, what does not yet.
| Endpoint | Status | Notes |
|---|---|---|
| /v1/chat/completions | ok | full schema. system / user / assistant / tool roles. temperature, top_p, max_tokens, stop sequences, seed honored. |
| /v1/chat/completions (stream) | ok | SSE with the same data: {...} framing. finish_reason emitted on the last chunk. |
| tool_calls / function_call | ok | Constrained decoding via union-schema grammar. tool_choice maps to a hard constraint when set. |
| response_format = json_object | ok | JSON output guaranteed via grammar-constrained sampling, not retry-on-parse. |
| response_format = json_schema | ok | Structured outputs. The schema compiles to a regular expression that the sampler enforces token by token. |
| logprobs / top_logprobs | ok | Returned on every choice when the runtime exposes them (vLLM, transformers). |
| /v1/embeddings | ok | Served from a paired *.embed.kolm when present. Matryoshka truncation honored via dimensions. |
| /v1/models | ok | Lists every .kolm under ~/.kolm/artifacts/. Model id is the artifact filename. |
| /v1/audio/transcriptions | partial | Available when the artifact ships a Whisper LoRA adapter. response_format=verbose_json returns word timestamps. |
| /v1/images/generations | no | Out of scope. kolm is a fine-tuning + serving stack for language and audio; image generation is a different runtime. |
Three guarantees that the OpenAI schema gives you for free.
The schema is a contract
OpenAI publishes the request/response shape. We implement it byte-for-byte against the latest schema (2026-04). If a client deserializes a cloud response, it deserializes ours.
Streaming framing is portable
SSE is just data: JSON deltas with a [DONE] sentinel. The wire format does not depend on the model behind it. Same library, same code path.
The receipt is additive
An x-kolm-receipt header rides every response. Existing clients ignore unknown headers; kolm-aware clients pin the artifact CID into their audit log.
From OpenAI to a local endpoint in an afternoon.
-
Capture your traffic for a week
Point the OpenAI SDK at the kolm capture proxy. Real prompts get tagged, anonymized, and stored locally; the cloud call still goes through, so production behavior is unchanged.
export OPENAI_BASE_URL=http://localhost:8765/v1/capture kolm capture --provider openai --as your-task
-
Compile a .kolm from the capture
Distill the captured pairs into a signed artifact. The default base is Qwen2.5-3B-Instruct; the compiler picks a smaller model if the task pattern allows it.
kolm distill --namespace your-task --out your-task.kolm
-
Serve it locally
One command boots an OpenAI-compatible HTTP server on the artifact. Speculative decoding via the artifact's declared draft model; FP8 KV cache on Hopper or Blackwell.
kolm serve --http your-task.kolm --port 8765
-
Repoint your SDK
One line in your client. The
api_keyfield is required by the SDK but unused by the local endpoint; pass any non-empty string.client = OpenAI(base_url="http://localhost:8765/v1", api_key="local")
-
Diff against the cloud baseline
Run the kolm evaluator against your eval set with both endpoints. The local artifact ships only if the K-score meets the gate the cloud baseline cleared.
kolm eval your-task.kolm --baseline openai:gpt-4o-mini
The endpoint your code already knows.
Migration is not a rewrite. It is a base_url change and an honest eval. Capture for a week, distill into a signed artifact, point your client at localhost:8765, and confirm the K-score holds before you cut over. If it does not, the cloud endpoint stays in place. The receipt is additive on every call either way.