kolm / tutorials / OpenAI drop-in

Drop-in replace OpenAI in 14 minutes.

Point chat.completions at a local kolm endpoint. Same request shape, same response shape. We measure the cost and latency delta as you go. By the end your codebase compiles, signs, runs offline, and is 7.42× faster.

Runtime 14 minEndpoint /v1/chat/completionsCost delta -91%Latency delta -86%

Step 1 . 90 seconds

Pick a task in your codebase that calls OpenAI.

Grep your repo for `chat.completions.create`. Pick the simplest one for the first compile - a classifier, a summarizer, a tagger.

$ grep -nR "chat.completions.create" src/

src/triage.py:34:    resp = client.chat.completions.create(
src/triage.py:35:        model="gpt-4o",
src/triage.py:36:        messages=[{"role":"system","content":"You triage support tickets..."}, ...])

Step 2 . 60 seconds

Describe the task to kolm.

$ kolm compile "triage support tickets by urgency; output low/normal/high/urgent" \
    --base qwen2.5-7b --target-k 0.95

Compile plan: SFT + DPO + constrained-decoder
  K target: 0.95   estimated cost: $1.20   time: 8 min

Step 3 . 8 minutes

Compile, then serve as OpenAI-compatible.

The CLI ships an OpenAI-compatible server. Same wire format.

$ kolm serve --http triage.kolm

  k o l m
  ─────── the private AI compiler
serving triage.kolm
  endpoint: http://127.0.0.1:8080/v1/chat/completions
  K-score: 0.961
  CID:     cidv1:sha256:8a3...

Step 4 . 60 seconds

Change one line in your code.

The OpenAI SDK lets you point at any base_url. That's the whole switch.

# before
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# after
client = OpenAI(
    api_key="local",
    base_url="http://127.0.0.1:8080/v1",
)

CheckpointYour code now calls the local artifact. No other changes. The `model` field is ignored; the loaded artifact is the model.

Step 5 . 60 seconds

Measure the delta.

The CLI ships a bench verb that compares a kolm endpoint to any OpenAI-compatible endpoint over the same 200 inputs.

$ kolm bench triage.kolm \
    --against openai \
    --inputs ./fixtures/tickets.jsonl

                kolm        openai-gpt-4o    delta
  median p50    9.3 ms      78 ms            -88%
  p99           14.1 ms     310 ms           -95%
  $ / 1M tok    $0.13       $2.50            -95%
  accuracy      0.961       0.954            +0.7%

What changed	Before	After
Lines of code modified	—	2
Token bill per month (10M tok)	$25.00	$1.30
Median latency	78 ms	9.3 ms
Audit trail	opaque	receipt JSON

Step 6 . optional

Add receipts to your audit log.

Every response from the kolm endpoint includes a receipt CID. Log it next to your existing request log and you can re-verify the model output months later.

resp = client.chat.completions.create(...)
log.info("triage", request_id=req_id, receipt_cid=resp.kolm["receipt_cid"])

Run it now →OpenAI compat reference →