OpenAI migration · drop-in base_url swap

Same SDK. Local endpoint.

A .kolm artifact serves the OpenAI Chat Completions schema on a local port. Change the base_url in your existing client, leave the rest of your code alone. The receipt your code already discards comes back signed.

Endpoint
/v1/chat/completions
Streaming
SSE, byte-for-byte
Tools
function_call + tool_calls
Bytes leave host
zero
The one-line change

Two snippets. One difference.

Before · api.openai.com

Cloud Chat Completions

Your existing OpenAI client. Bytes cross the public internet to OpenAI's servers; every prompt and response is logged by a third party. Cost scales linearly with usage.

from openai import OpenAI

client = OpenAI(
  api_key=os.environ["OPENAI_API_KEY"],
)
resp = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{"role":"user","content":"redact PHI: John Doe MRN 8847"}],
)
print(resp.choices[0].message.content)
$0.0006 / 1K tokens · ~120 ms TTFB · logged remotely
After · local .kolm

Local Chat Completions

Same SDK. The base_url points at kolm serve --http phi-redactor.kolm running on localhost:8765. No keys, no quotas, no remote logs. The response carries a signed receipt header.

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:8765/v1",
  api_key="not-used",
)
resp = client.chat.completions.create(
  model="phi-redactor.kolm",
  messages=[{"role":"user","content":"redact PHI: John Doe MRN 8847"}],
)
print(resp.choices[0].message.content)
$0 / token · ~8 ms TTFB on M3 · receipt on every call
Endpoint compatibility

What works under the OpenAI schema, what does not yet.

EndpointStatusNotes
/v1/chat/completions ok full schema. system / user / assistant / tool roles. temperature, top_p, max_tokens, stop sequences, seed honored.
/v1/chat/completions (stream) ok SSE with the same data: {...} framing. finish_reason emitted on the last chunk.
tool_calls / function_call ok Constrained decoding via union-schema grammar. tool_choice maps to a hard constraint when set.
response_format = json_object ok JSON output guaranteed via grammar-constrained sampling, not retry-on-parse.
response_format = json_schema ok Structured outputs. The schema compiles to a regular expression that the sampler enforces token by token.
logprobs / top_logprobs ok Returned on every choice when the runtime exposes them (vLLM, transformers).
/v1/embeddings ok Served from a paired *.embed.kolm when present. Matryoshka truncation honored via dimensions.
/v1/models ok Lists every .kolm under ~/.kolm/artifacts/. Model id is the artifact filename.
/v1/audio/transcriptions partial Available when the artifact ships a Whisper LoRA adapter. response_format=verbose_json returns word timestamps.
/v1/images/generations no Out of scope. kolm is a fine-tuning + serving stack for language and audio; image generation is a different runtime.
Why this works

Three guarantees that the OpenAI schema gives you for free.

01

The schema is a contract

OpenAI publishes the request/response shape. We implement it byte-for-byte against the latest schema (2026-04). If a client deserializes a cloud response, it deserializes ours.

02

Streaming framing is portable

SSE is just data: JSON deltas with a [DONE] sentinel. The wire format does not depend on the model behind it. Same library, same code path.

03

The receipt is additive

An x-kolm-receipt header rides every response. Existing clients ignore unknown headers; kolm-aware clients pin the artifact CID into their audit log.

Migrate in five steps

From OpenAI to a local endpoint in an afternoon.

  1. Capture your traffic for a week

    Point the OpenAI SDK at the kolm capture proxy. Real prompts get tagged, anonymized, and stored locally; the cloud call still goes through, so production behavior is unchanged.

    export OPENAI_BASE_URL=http://localhost:8765/v1/capture
    kolm capture --provider openai --as your-task
  2. Compile a .kolm from the capture

    Distill the captured pairs into a signed artifact. The default base is Qwen2.5-3B-Instruct; the compiler picks a smaller model if the task pattern allows it.

    kolm distill --namespace your-task --out your-task.kolm
  3. Serve it locally

    One command boots an OpenAI-compatible HTTP server on the artifact. Speculative decoding via the artifact's declared draft model; FP8 KV cache on Hopper or Blackwell.

    kolm serve --http your-task.kolm --port 8765
  4. Repoint your SDK

    One line in your client. The api_key field is required by the SDK but unused by the local endpoint; pass any non-empty string.

    client = OpenAI(base_url="http://localhost:8765/v1", api_key="local")
  5. Diff against the cloud baseline

    Run the kolm evaluator against your eval set with both endpoints. The local artifact ships only if the K-score meets the gate the cloud baseline cleared.

    kolm eval your-task.kolm --baseline openai:gpt-4o-mini

The endpoint your code already knows.

Migration is not a rewrite. It is a base_url change and an honest eval. Capture for a week, distill into a signed artifact, point your client at localhost:8765, and confirm the K-score holds before you cut over. If it does not, the cloud endpoint stays in place. The receipt is additive on every call either way.