The recall stack, fine-tuned.
A buyer who needs RAG over their own corpus does not need the OpenAI embedding API. They need an embedder that's good at their queries against their docs, plus a reranker that's good at their relevance signal. Both are fine-tunes on the buyer's captures; both land in the recall index alongside the LoRA adapter.
Two stages, two models, one stack
Modern retrieval is two-stage. A fast embedder converts queries and documents to dense vectors; an approximate nearest-neighbour index returns the top-K (typically 50-200) candidates per query in milliseconds. A slower, more accurate reranker then scores each (query, candidate) pair and returns the top-N (typically 5-20) for the LLM to read.
The embedder is a bi-encoder: query and document are encoded independently, then their similarity is cosine. The reranker is a cross-encoder: query and document are concatenated and passed jointly through the model, which scores them in one forward. Bi-encoders are O(N) at index time, O(1) at query time; cross-encoders are O(N) at query time but much more accurate per pair.
The embedder: InfoNCE plus Matryoshka
The standard contrastive loss is InfoNCE: given a query, a positive document, and a batch of negatives, the embedding model should produce high similarity for (query, positive) and low similarity for (query, neg_i). With a large batch of in-batch negatives this is the cheapest way to get a competent embedder.
def info_nce(q, d, temperature):
# q, d: [batch, dim], pre-normalised
logits = q @ d.T / temperature
labels = arange(len(q)) # diagonal: each query's positive is its own row
return cross_entropy(logits, labels)
Matryoshka Representation Learning (Kusupati et al 2022) layers on top: at every step the loss is computed not just at the full embedding dimension, but at a nested sequence of dimensions (64, 128, 256, 512, 768). The total loss is the sum across all dims. The trained model emits a 768-dim vector whose first 64 entries are themselves a competent 64-dim embedding, whose first 128 are a competent 128-dim, and so on.
def matryoshka_info_nce(q, d, dims, temperature):
total = 0
for k in dims:
q_k = normalise(q[:, :k])
d_k = normalise(d[:, :k])
total = total + info_nce(q_k, d_k, temperature)
return total / len(dims)
The product: one model serves four latency tiers. A buyer who wants fast in-browser recall can truncate to 64 dims; a buyer who wants best quality uses the full 768. The recall index in the .kolm artifact ships both the full vectors and a marker for which dims are valid, so the runtime can pick the tier per-query.
The reranker: cross-encoder, BCE or MSE
The reranker is a sequence-classification head on top of any encoder. Inputs are (query, document) concatenated; output is a single scalar score.
class Reranker(nn.Module):
def __init__(self, base_model_id):
self.encoder = AutoModelForSequenceClassification.from_pretrained(
base_model_id, num_labels=1
)
def forward(self, query, doc):
text = f"{query} [SEP] {doc}"
return self.encoder(**tokenize(text)).logits.squeeze(-1)
Two label shapes are supported. Binary relevance (label in {0, 1}) uses BCE-with-logits loss; the model produces a logit you sigmoid at inference for a probability. Pointwise score (label in [0, 1] or any real) uses MSE; the model produces a raw score.
The kolm config picks BCE by default because most buyer captures are binary relevance signals (the user clicked / the answer was correct). Pointwise MSE is the right path when the buyer has explicit relevance scores (e.g., a 1-5 star rating).
The minimal calls
from apps.trainer.embedding import EmbeddingTrainer, EmbedConfig
from apps.trainer.reranker import RerankerTrainer, RerankConfig
embedder = EmbeddingTrainer(
model_id="BAAI/bge-base-en-v1.5",
config=EmbedConfig(
matryoshka_dims=(64, 128, 256, 512, 768),
temperature=0.05,
loss_shape="triplet",
),
)
embedder.train(triplets_dataset) # (query, pos, neg) rows
reranker = RerankerTrainer(
model_id="BAAI/bge-reranker-base",
config=RerankConfig(shape="binary"),
)
reranker.train(pairs_dataset) # (query, doc, label) rows
Both models save to the registry alongside the LoRA adapter. The .kolm manifest lists three artifacts: the LoRA, the embedder, and the reranker. The runtime picks the embedder dim per-query and the reranker top-N per-query, both budgeted against latency.
What lands in the receipt
"embedding": {
"base_model": "BAAI/bge-base-en-v1.5",
"papers": ["arXiv:1807.03748", "arXiv:2205.13147"],
"matryoshka_dims": [64, 128, 256, 512, 768],
"temperature": 0.05,
"loss_shape": "triplet",
"train_examples": 18432
},
"reranker": {
"base_model": "BAAI/bge-reranker-base",
"papers": ["arXiv:2402.03216"],
"shape": "binary",
"train_examples": 9216
}
The buyer's auditor can confirm which encoder family and which loss shape produced each model. Both blocks are covered by the artifact's CID and the canonical-JSON manifest hash; swapping either model invalidates the receipt chain.
How much capture data is enough
| Capture count | Embedder recipe | Reranker recipe |
|---|---|---|
| 0-200 | Stay with base BGE; do not fine-tune | Stay with base; do not fine-tune |
| 200-2k | Synth pairs with Magpie, then fine-tune | Use base; collect more before fine-tuning |
| 2k-20k | Fine-tune with InfoNCE + in-batch negatives | Fine-tune with BCE; sample hard negatives |
| 20k+ | Full Matryoshka recipe with hard-negative mining | Full BCE with PRM-style hard negatives |
Edge cases worth naming
Hard-negative mining matters. In-batch negatives only get you so far. The published recipe for the BGE family mines hard negatives from the index itself: for each (query, positive), retrieve top-50 candidates with the current embedder and label the bottom-ranked but still-in-top-50 as hard negatives. The kolm trainer does this per epoch.
The reranker should see the embedder's misses. Reranker training data is most useful when it covers the cases the embedder gets wrong. The capture-loop bridge pre-filters to those: for each ranked retrieval, the bridge logs the queries where the user clicked a result that was not in the top-3 (the embedder ranked it lower). Those pairs go into reranker fine-tuning.
Matryoshka does not always pay. If the buyer has only one latency budget and never plans to truncate, Matryoshka is a small accuracy hit (~1-2% NDCG at full dim) for no real gain. The trainer config defaults Matryoshka on because the buyer rarely knows their latency profile a priori; turning it off via spec is a one-line override.
Where this fits in the kolm compile loop
Embedder and reranker fine-tunes are sibling stages to the LoRA fine-tune. The buyer's spec can name any combination: LoRA-only (the default), LoRA + embedder, LoRA + embedder + reranker, embedder-only, reranker-only. Each named stage runs against its own captures (or the same captures with different label shapes), each lands in the receipt, each gets its own K-score evaluation against a vertical pack (MTEB-style for the embedder, BEIR-style for the reranker).
The combined artifact ships all three. The runtime loads them together; the inference path is embedder → ANN index → reranker → LLM with retrieved context. The 0.85 K-score gate covers the overall accuracy of this pipeline on the buyer's eval pack, not the individual components.
Citations
Oord, A. v.d. et al. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748, 2018 (InfoNCE).
Kusupati, A. et al. Matryoshka Representation Learning. arXiv:2205.13147, 2022.
Chen, J. et al. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216, 2024.