OpenAI migration · drop-in base_url swap

Same SDK. Local endpoint.

A .kolm artifact serves the OpenAI Chat Completions schema on a local port. Change the base_url in your existing client, leave the rest of your code alone. The receipt your code already discards comes back signed.

Endpoint

/v1/chat/completions

Streaming

SSE, byte-for-byte

Tools

function_call + tool_calls

Bytes leave host

zero

The one-line change

Two snippets. One difference.

Before · api.openai.com

Cloud Chat Completions

Your existing OpenAI client. Bytes cross the public internet to OpenAI's servers; every prompt and response is logged by a third party. Cost scales linearly with usage.

from openai import OpenAI

client = OpenAI(
  api_key=os.environ["OPENAI_API_KEY"],
)
resp = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{"role":"user","content":"redact PHI: John Doe MRN 8847"}],
)
print(resp.choices[0].message.content)

$0.0006 / 1K tokens · ~120 ms TTFB · logged remotely

After · local .kolm

Local Chat Completions

Same SDK. The base_url points at kolm serve --http phi-redactor.kolm running on localhost:8765. No keys, no quotas, no remote logs. The response carries a signed receipt header.

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:8765/v1",
  api_key="not-used",
)
resp = client.chat.completions.create(
  model="phi-redactor.kolm",
  messages=[{"role":"user","content":"redact PHI: John Doe MRN 8847"}],
)
print(resp.choices[0].message.content)

$0 / token · ~8 ms TTFB on M3 · receipt on every call

Endpoint compatibility

What works under the OpenAI schema, what does not yet.

Endpoint	Status	Notes
/v1/chat/completions	ok	full schema. system / user / assistant / tool roles. temperature, top_p, max_tokens, stop sequences, seed honored.
/v1/chat/completions (stream)	ok	SSE with the same `data: {...}` framing. `finish_reason` emitted on the last chunk.
tool_calls / function_call	ok	Constrained decoding via union-schema grammar. `tool_choice` maps to a hard constraint when set.
response_format = json_object	ok	JSON output guaranteed via grammar-constrained sampling, not retry-on-parse.
response_format = json_schema	ok	Structured outputs. The schema compiles to a regular expression that the sampler enforces token by token.
logprobs / top_logprobs	ok	Returned on every choice when the runtime exposes them (vLLM, transformers).
/v1/embeddings	ok	Served from a paired `*.embed.kolm` when present. Matryoshka truncation honored via `dimensions`.
/v1/models	ok	Lists every `.kolm` under `~/.kolm/artifacts/`. Model id is the artifact filename.
/v1/audio/transcriptions	partial	Available when the artifact ships a Whisper LoRA adapter. `response_format=verbose_json` returns word timestamps.
/v1/images/generations	no	Out of scope. kolm is a fine-tuning + serving stack for language and audio; image generation is a different runtime.

Why this works

Three guarantees that the OpenAI schema gives you for free.

The schema is a contract

OpenAI publishes the request/response shape. We implement it byte-for-byte against the latest schema (2026-04). If a client deserializes a cloud response, it deserializes ours.

Streaming framing is portable

SSE is just data: JSON deltas with a [DONE] sentinel. The wire format does not depend on the model behind it. Same library, same code path.

The receipt is additive

An x-kolm-receipt header rides every response. Existing clients ignore unknown headers; kolm-aware clients pin the artifact CID into their audit log.

Migrate in five steps

From OpenAI to a local endpoint in an afternoon.

Capture your traffic for a week

Point the OpenAI SDK at the kolm capture proxy. Real prompts get tagged, anonymized, and stored locally; the cloud call still goes through, so production behavior is unchanged.
```
export OPENAI_BASE_URL=http://localhost:8765/v1/capture
kolm capture --provider openai --as your-task
```
Compile a .kolm from the capture

Distill the captured pairs into a signed artifact. The default base is Qwen2.5-3B-Instruct; the compiler picks a smaller model if the task pattern allows it.
```
kolm distill --namespace your-task --out your-task.kolm
```
Serve it locally

One command boots an OpenAI-compatible HTTP server on the artifact. Speculative decoding via the artifact's declared draft model; FP8 KV cache on Hopper or Blackwell.
```
kolm serve --http your-task.kolm --port 8765
```
Repoint your SDK

One line in your client. The api_key field is required by the SDK but unused by the local endpoint; pass any non-empty string.
```
client = OpenAI(base_url="http://localhost:8765/v1", api_key="local")
```
Diff against the cloud baseline

Run the kolm evaluator against your eval set with both endpoints. The local artifact ships only if the K-score meets the gate the cloud baseline cleared.
```
kolm eval your-task.kolm --baseline openai:gpt-4o-mini
```

The endpoint your code already knows.

Migration is not a rewrite. It is a base_url change and an honest eval. Capture for a week, distill into a signed artifact, point your client at localhost:8765, and confirm the K-score holds before you cut over. If it does not, the cloud endpoint stays in place. The receipt is additive on every call either way.

Compile your first .kolm OpenAI schema docs Set up capture

Same SDK. Local endpoint.

Two snippets. One difference.

Cloud Chat Completions

Local Chat Completions

What works under the OpenAI schema, what does not yet.

Three guarantees that the OpenAI schema gives you for free.

The schema is a contract

Streaming framing is portable

The receipt is additive

From OpenAI to a local endpoint in an afternoon.

Capture your traffic for a week

Compile a .kolm from the capture

Serve it locally

Repoint your SDK

Diff against the cloud baseline

The endpoint your code already knows.