VLM fine-tuning on one A100.
A vision-language model has two towers and a projector. The trick to fine-tuning one cheaply is to keep the vision tower frozen, attach LoRA to the language tower, and let the projector ride along. The trick to fine-tuning one correctly is to remember that the image tokens are real tokens and the collator has to pad them like anything else.
The three parts of a VLM
A modern VLM (Qwen2.5-VL, LLaVA-1.6, Cambrian-1, Idefics-3, Phi-3.5-vision) has three components: a vision encoder (typically a ViT or SigLIP, 300M-1B params), a projector (MLP or Q-Former that maps image features to the language model's embedding space, 5-50M params), and a language tower (an autoregressive transformer, 1.5B-72B params). The language tower is the expensive part. The vision encoder is mostly fixed in the post-pretraining era; modern VLMs share encoders across families.
For the typical fine-tuning task (a chart-QA distill, a document-extraction adapter, a screenshot-grounded agent), the work is almost entirely in the language tower. The vision encoder already sees the buyer's images well enough; the projector already maps them to the right subspace. What changes is how the language tower reasons over the result, and that lives in the attention and MLP projections.
What the kolm trainer freezes by default
VLMTrainConfig.freeze_vision=True walks the named modules and marks any whose name contains vision, visual, or image_encoder as requires_grad=False. This catches the encoder reliably across Qwen2.5-VL (visual), LLaVA (vision_tower), and Idefics (vision_model). The projector stays trainable; LoRA attaches to the language tower's projection matrices only.
The LoRA target modules default to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, which is the standard set for Qwen and Llama architectures. PEFT's auto-discovery handles cases where the model uses different names; you pass target_modules="all-linear" for that route, and PEFT figures out which leaves are linear layers in the language tower.
The processor is the load-bearing piece
The Hugging Face AutoProcessor for a VLM bundles a tokenizer and an image processor. It exposes apply_chat_template(messages, tokenize=False, add_generation_prompt=False) which knows how to insert image-token placeholders for each image in the messages list, in the format the model expects. Qwen2.5-VL uses <|vision_start|>...<|vision_end|>; LLaVA uses <image>. The trainer does not need to know which.
The collator builds the batch by rendering each example's chat template, gathering the image list, and calling the processor with both arguments. The processor returns a dict containing input_ids, attention_mask, and one of pixel_values (LLaVA-style) or image_grid_thw + pixel_values (Qwen2.5-VL-style). All of these get passed to the model's forward as kwargs.
The minimal call
from apps.trainer.vlm import vlm_trainer, VLMTrainConfig
trainer = vlm_trainer(
model_id="Qwen/Qwen2.5-VL-7B-Instruct",
train_dataset=chart_qa_dataset, # rows of (image, prompt, response)
config=VLMTrainConfig(
lora_r=16,
lora_alpha=32,
freeze_vision=True,
learning_rate=1e-4,
max_seq_length=2048,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
bf16=True,
),
)
trainer.train()
The dataset shape is a list of dicts with keys image (a PIL.Image, a path, or a base64 string), prompt (the user turn), and response (the assistant turn). The collator builds the chat template, hands the rendered text and the image list to the processor, and masks the labels so the loss only flows over the response tokens.
What lands in the receipt
"vlm_train": {
"model_id": "Qwen/Qwen2.5-VL-7B-Instruct",
"papers": ["arXiv:2502.13923", "arXiv:2304.08485"],
"lora_r": 16,
"lora_alpha": 32,
"lora_targets": ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
"freeze_vision": true,
"frozen_modules": ["visual.patch_embed", "visual.blocks.*", "visual.merger"],
"trainable_modules": ["language_model.layers.*.self_attn.q_proj.lora_A", "..."],
"learning_rate": 1e-4,
"max_seq_length": 2048,
"train_examples": 4096
}
The buyer's auditor can confirm which sub-modules saw a gradient and which did not. The trainable-modules list is glob-summarised in the receipt and exact-listed in a sidecar manifest because the verbose form is several hundred entries for a 7B model.
Sizing on real hardware
| Model | VRAM (bf16 + LoRA r=16) | Step time (A100-80GB) |
|---|---|---|
| Qwen2.5-VL-3B | ~14 GB | ~0.6 s/step at seq_len=2048 |
| Qwen2.5-VL-7B | ~32 GB | ~1.4 s/step at seq_len=2048 |
| Qwen2.5-VL-72B | ~196 GB | multi-GPU, FSDP+QLoRA territory |
| LLaVA-Next-7B | ~28 GB | ~1.2 s/step at seq_len=2048 |
The 7B fits with room to spare on a single A100-80GB at batch size 1, gradient accumulation 8, with the vision tower frozen and bf16 throughout. The 3B is the sweet spot for buyers iterating on a chart-QA or screenshot-grounded distill: cheap enough to retrain hourly, capable enough that the final K-score on the distill exceeds GPT-4-vision on the buyer's narrow distribution.
Edge cases worth naming
The image token count is variable. Qwen2.5-VL emits a variable number of tokens per image depending on image_grid_thw (typically 64-1280 tokens at 224-1024 pixel sides). The trainer's max_seq_length covers text-plus-image; if the buyer's images are large the text budget shrinks. The collator truncates the text first and warns if it had to.
Multi-image prompts. The processor handles them, but the collator's pad-to-longest path can produce empty image-token slots for examples with fewer images in a batch. The fix is to pad image lists to a fixed max-images-per-example (default 4) at dataset prep time; the documentation is explicit about this in the Qwen2.5-VL README.
Vision-tower fine-tuning is rarely worth it. If the buyer's K-score is missing on a chart-QA task, the cause is almost always the language tower failing to ground; unfreezing the vision encoder doubles VRAM and rarely moves the needle. The only honest case for vision fine-tuning is a deeply out-of-distribution image domain (X-rays, satellite imagery, microscopy), and even there the right answer is usually a domain-specific vision encoder swap, not fine-tuning the original.
The projector is fragile. The trainer leaves the projector trainable by default because freezing it can degrade the alignment between the vision and language towers. If the buyer's K-score is unstable across runs at the same hyperparameters, the projector is the first suspect; a lower learning rate on the projector group (or freezing it after warm-up) usually fixes it.
Where this fits in the kolm compile loop
VLM fine-tuning lives in the same pipeline as text-only fine-tuning. The trainer plumbing is shared; only the model loader, the collator, and the dataset shape differ. The K-score evaluator runs the same A/S/L/C/V formula against the buyer's eval pack; the only difference is that the eval examples carry images alongside the prompt-response pair. The receipt chain, the CID, the verifier, and the registry surface are identical.
For the buyer who has a vision-grounded task (chart-QA on financial reports, layout extraction on legal documents, screenshot agents that read web UIs), the workflow is: capture pairs through kolm capture with the image attached, write a spec that names the base VLM, run kolm compile, ship the .kolm. The trainer detects the VLM family from the model id and routes through apps/trainer/vlm.py automatically.
Citations
Wang, P. et al. Qwen2.5-VL Technical Report. arXiv:2502.13923, 2025.
Liu, H. et al. Visual Instruction Tuning. arXiv:2304.08485, 2023.
Tong, S. et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv:2406.16860, 2024.