The job each one does
LangSmith answers "what happened in this LLM call and how do I fix it?" It catches a regression in production, surfaces a slow span, lets you re-run a trace with a different prompt, scores outputs against a rubric, ships the better prompt. Every team running LLMs in production needs something in this lane and LangSmith is the most polished answer.
kolm answers "given enough of these calls, can the smaller model take over?" The captured pairs aren't just for debug; they're labelled training data. The verifier catches the same regressions LangSmith catches but uses them to filter the training set, not just alert a human. The output isn't a fixed prompt; it's a smaller model that has learned the behavior.
You can run them at the same time on the same traffic. They consume the same trace stream and produce non-overlapping deliverables.
Where LangSmith wins
Honest concession. LangSmith is the better human-in-the-loop tool. The trace UI is mature, the eval harness handles complex multi-step agents, the cost analytics are real, the LangChain integration is native. If your team's bottleneck is "we can't see what the agent is doing in production," LangSmith is the right answer and we're not trying to compete on that surface.
kolm captures less detail per trace and never tries to be a debug surface. We expose /v1/bridges/observations for sanity checks and that's it. The capture is task-shaped, not engineer-shaped.
Side-by-side
| LangSmith | kolm | |
|---|---|---|
| What it is | LLM observability + eval platform | Capture-then-compile to portable artifact |
| Capture surface | SDK trace ingestion (Python, JS) | Drop-in proxy for OpenAI + Anthropic |
| Trace UI | first-class - flame graph, span detail, replay | basic - observation list per namespace |
| Eval harness | yes - LLM judge + custom evaluators | yes - K-score on held-out test set |
| Output | A better prompt or chain config | A signed .kolm file (≤3 GB) |
| Trains a model | no - prompt + chain optimization only | yes - distillation + LoRA from captured pairs |
| Runs offline | no - hosted dashboard | yes - the artifact runs anywhere |
| Receipts / signing | no - traces, not artifacts | HMAC-SHA256 receipt chain on every output |
| Pricing model | Per trace ingested + retained | Flat per compile, then $0 marginal inference |
| Compose with the other | yes - dual-write traces | yes - dual-write captures |
When to use LangSmith
Use LangSmith when the question is about the calls themselves - debug, replay, prompt iteration, cost analytics, eval. Anything where a human needs to look at a trace and decide what to change.
# trace every call, browse them, score them: import { Client } from "langsmith"; const client = new Client(); # LangChain auto-traces; or wrap manually with @traceable
When to use kolm
Use kolm when the question is about the model - "can a smaller, signed, portable model do this task as well as the frontier?" The captured pairs become labelled training data; the verifier becomes the quality gate.
# point traffic at the kolm capture proxy: ANTHROPIC_BASE_URL=https://kolm.ai/v1/capture/anthropic # after enough pairs, compile: kolm compile "summarize support tickets" \ --namespace support \ --base qwen2.5-7b ok wrote support-summarize.kolm k_score=0.89 signature=hmac-sha256
Can I use both?
Yes - and for most production teams the right move is to run both. LangSmith on the inbound side for visibility on the frontier calls; kolm on the same calls (or a sampled subset) to accumulate captures into a compileable namespace. When a namespace hits the threshold you compile it, sub in the .kolm for the cheap path, and let LangSmith keep watching the long-tail traffic that still goes to the frontier.
Verdict
If your problem is "I can't see what the agent is doing," use LangSmith. It's the right tool. We don't try to compete on trace UI.
If your problem is "I'd like the model to do this without calling the frontier," use kolm. The capture loop is the same; the deliverable is a signed file instead of a dashboard.
Adjacent comparisons: vs OpenPipe · vs fine-tuning · vs RAG · full comparison table