research . training . 10 min read

Voice adapters.

A radiology dictation system fails on "hypoechoic". A bank's call-center transcript loses every account number ending in a colloquial reading. The general-purpose Whisper model is excellent and not enough. The fix is not a new ASR architecture; it is a 40 MB LoRA adapter on Whisper's decoder cross-attention, trained on 5-50 hours of paired domain audio, gated by held-out WER, and shipped as a signed artifact next to the policy.

May 14, 2026 · Kolmogorov research · apps/trainer/audio.py

Why Whisper is the base

Radford et al (2022) trained Whisper on 680k hours of multilingual paired audio/text scraped from the open web. The model is an encoder-decoder transformer: the encoder consumes 80-channel mel-spectrograms in 30-second windows; the decoder is a standard causal transformer with cross-attention to encoder outputs and generates the transcript token-by-token. The released checkpoints (tiny/base/small/medium/large-v3) cover the full quality/cost curve. Apache-2.0 license. The weights are stable; the tokenizer is standardized; the inference path has been hardware-tuned across vLLM/MLX/llama.cpp/CTranslate2.

For a buyer who needs a domain-tuned voice model, the question is not "train your own ASR from scratch". It is "which adapter shape on Whisper".

The narrow change: LoRA on decoder cross-attention only

Whisper's encoder learns acoustic features (formants, prosody, spectro-temporal structure). Domain-tuning rarely needs new acoustic features; the audio still sounds like speech. What it needs is to redirect the decoder's attention to encoder activations in the way the domain vocabulary requires (medical terms map to specific spectral templates, account-number readings need to refuse hallucinated digits). The narrow change is therefore to keep the encoder frozen and only LoRA-adapt the decoder's cross-attention projections.

from peft import LoraConfig

# encoder.* parameters are not in target_modules => encoder stays frozen
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "decoder.layers.*.encoder_attn.q_proj",
        "decoder.layers.*.encoder_attn.k_proj",
        "decoder.layers.*.encoder_attn.v_proj",
        "decoder.layers.*.encoder_attn.out_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM",
)

The trainable parameter count for Whisper-large-v3 (1.55B total) under this configuration is roughly 11 M. The adapter on disk is 40 MB. Training fits on a single 24 GB consumer GPU at bf16 with gradient checkpointing and batch size 8.

We have tested the alternative of LoRA-ing the encoder as well; the WER improvement was 0.4 percentage points on radiology dictation and the adapter doubled in size. The narrow change is the right tradeoff.

The 30-second mel-spectrogram window

Whisper expects exactly 30 seconds of audio per forward pass, resampled to 16 kHz, converted to an 80-channel log-mel spectrogram with a 25-ms window and 10-ms hop. Audio shorter than 30 seconds is right-padded with zeros (and the attention mask covers the pad). Audio longer than 30 seconds is split into 30-second chunks with a stride; the chunk transcripts are stitched at decode time.

The data path that matters: do not retrain the mel front-end. Use WhisperFeatureExtractor as-is. If the buyer's audio is 8 kHz telephony or 22 kHz studio, resample to 16 kHz before extraction. Mismatched sample rates produce subtle but persistent WER degradation that is not obvious in the loss curve.

The audio collator

The non-obvious engineering detail is the labels-side padding. The Whisper tokenizer pads labels with the pad-token id (50257 for multilingual checkpoints), which the cross-entropy loss would otherwise count as a real token. The collator replaces pad-token ids with -100 after padding, which is the magic ignore-index value that nn.CrossEntropyLoss recognizes.

def __call__(self, features):
    input_features = [{"input_features": f["input_features"]} for f in features]
    batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
    label_features = [{"input_ids": f["labels"]} for f in features]
    labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
    labels = labels_batch["input_ids"].masked_fill(
        labels_batch.attention_mask.ne(1), -100
    )
    if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
        labels = labels[:, 1:]
    batch["labels"] = labels
    return batch

The other detail: strip the leading <|startoftranscript|> token from the labels before the loss is computed. The model adds it on its own as the decoder-start token; including it in the targets shifts the loss by one position and produces silently degraded WER.

The WER metric

MetricDefinitionWhat it does not capture
WER(substitutions + insertions + deletions) / reference-word-countWhether the errors hit semantically critical tokens (a missed "not" vs a missed "the")
CERCharacter-level edit distance / reference-character-countWhether the misspelling is plausible to a downstream NLP system
WER on focus termsWER restricted to the buyer's domain vocabulary listGeneral-purpose readability
Held-out WER varianceStandard deviation of WER across speakersCatastrophic failure on a single speaker (the average is fine)

kolm reports overall WER, focus-term WER, and held-out per-speaker variance on every save_steps. The audit log carries all three; the K-score's accuracy axis A is computed against the focus-term WER, not the overall, because the focus terms are why the buyer commissioned the adapter.

Edge cases worth naming

Code-switching. A Mandarin/English bilingual call center transcript benefits from Whisper-large-v3's multilingual decoder out of the box. Switching the language token mid-utterance is harder; the practical fix is to chunk by silence and re-detect the language token per chunk. The LoRA adapter alone does not solve this.

The encoder is the bottleneck for audio degradation. If the audio is noisy (background music, lossy compression, low bit-rate telephony), no decoder-side adapter can recover it. The encoder must be the layer that hears noise-corrupted speech; the supervised front-end (a denoising preprocessing step or a Whisper variant trained on noisy audio like the noisy-student-distilled forks) is the right intervention.

Hallucinated speakers and timestamps. Whisper-large-v3 will sometimes generate plausible-sounding text during long silences. The standard mitigation is voice-activity detection (Silero VAD) ahead of feature extraction; kolm's reference recipe runs Silero with a 500-ms minimum-speech window and skips chunks below that threshold.

The 30-second boundary. A medical sentence that spans the 30-second chunk boundary gets two truncated transcripts. The stitcher concatenates them, but a domain term that crossed the boundary may be transcribed wrong on both sides. The reference recipe uses a 2-second overlap between chunks and aligns by greedy longest-common-subsequence.

What the receipt records

"audio": {
  "method": "whisper_lora_cross_attn",
  "base_model": "openai/whisper-large-v3",
  "lora_r": 16,
  "lora_alpha": 32,
  "target_modules": "decoder.encoder_attn.{q,k,v,out}_proj",
  "encoder_frozen": true,
  "n_hours_train": 28.4,
  "n_hours_eval": 3.1,
  "sample_rate_hz": 16000,
  "wer_overall": 0.041,
  "wer_focus": 0.018,
  "wer_speaker_variance": 0.007,
  "papers": [
    "arXiv:2212.04356",
    "arXiv:2106.09685",
    "arXiv:2311.00430"
  ]
}

The canonical-JSON manifest hash covers the block, so a tampered receipt invalidates the artifact signature. A reviewer can confirm the encoder was frozen, the focus-term WER landed below the buyer's stated threshold, and the per-speaker variance is bounded.

Where this fits in the kolm compile loop

Voice adapters are an independent artifact from text adapters; they sit alongside the text LoRA in the buyer's deployment and are loaded by the runtime when the request hits /v1/audio/transcriptions. The capture loop is the same shape (recordings + reference transcripts in, signed adapter out, K-score gate on the way through); the only domain-specific surface is the WER metric in place of accuracy. A buyer who needs domain text and domain voice ships both adapters from one binder.

Citations

Radford, A. et al. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356, 2022. The Whisper paper.

Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, 2021.

Gandhi, S. et al. Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling. arXiv:2311.00430, 2023.

Vaswani, A. et al. Attention Is All You Need. arXiv:1706.03762, 2017. Encoder-decoder cross-attention.