How we run a fine-tuning project
Five stages. Each one has a gate - we don't move forward if the gate fails. This is how fine-tunes actually ship instead of getting stuck on a single bad eval.
Discovery & data audit
We look at what data you actually have - volume, format, quality, PII exposure. Most fine-tunes fail because the dataset wasn't ready. We tell you straight what's needed.
Data prep & synthesis
Cleaning, deduplication, PII scrubbing, structured-output normalisation. Where data is thin, we generate high-quality synthetic pairs and validate them against a held-out set.
Technique selection
LoRA, QLoRA, full FT, DPO, or instruction-tuning - picked from the goal. Most projects do not need full fine-tuning, and we'll tell you when LoRA is the better answer.
Training & evaluation
Runs on H100 / A100 with DeepSpeed ZeRO + PEFT. Every checkpoint is gated against domain benchmarks, holdouts, and adversarial probes before promotion.
Quantization & deployment
AWQ / GPTQ / GGUF quantization to fit the target hardware. Served via vLLM, TensorRT-LLM, or llama.cpp depending on the deployment.
Picking the right approach
The single most common fine-tuning mistake is over-engineering. Most projects want LoRA - not full FT. Here's our actual decision table.
| Technique | What it changes | Compute | When we use it |
|---|---|---|---|
| LoRA | Adapter weights only (~0.1% of params) | Modest (1× A100) | Most projects. Fast, cheap, adapter-swappable. |
| QLoRA | LoRA on a quantized base model | Small (1× consumer GPU) | Tight memory or budget. Surprisingly good results. |
| Full fine-tune | All model weights | Heavy (8× H100+) | Deep behaviour shift or new vocabulary. |
| DPO / RLHF | Preference-aligned weights | Medium (4× A100) | Tone, safety, or output-format alignment. |
| Instruction tuning | Behavioural format | Medium | Bringing a base model to chat or tool-use behaviour. |
| Continued pretraining | Base weights on new domain corpus | Heavy | Massive vocabulary shift (legal, biomed, code). |
What we've fine-tuned
Fine-tuning isn't just an LLM thing. We've shipped customised checkpoints across every major modality.
LLMs
Domain Q&A, code, structured extraction, vertical chat. Llama / Qwen / Mistral / DeepSeek.
VLMs
Document AI, screen understanding, visual QA. Qwen-VL, LLaVA, InternVL, Pixtral.
Speech (STT)
Domain-vocabulary Whisper / Parakeet for call centres, healthcare, finance.
Speech (TTS / voice cloning)
XTTS-v2, F5-TTS, Kokoro voice cloning from 5–10 min of samples.
Image (LoRA)
Brand-style Flux LoRAs, product-shot LoRAs, character LoRAs.
Video (LoRA)
LTX-Video brand-style LoRAs for cinematic consistency across generated shots.
What we measure before we ship
Domain benchmark
Customer-supplied held-out evaluation set. The metric the business actually cares about.
Perplexity & loss curves
Standard training signals. Useful for sanity, not for promotion.
Adversarial probes
Jailbreaks, prompt injections, edge-case inputs. Run on every checkpoint.
Regression against base
Did we lose general capability? Score against MMLU / GSM8K / HumanEval.
Latency + memory
Production constraint. A model that breaks the latency budget doesn't ship.
Cost per request
End-to-end inference cost vs the hosted alternative. Always reported.