Fine-Tuning · From base model to your domain

From a Hugging Face checkpoint to your production domain model.

LoRA, QLoRA, full fine-tuning, DPO. We've shipped fine-tunes across LLMs, VLMs, STT, TTS, and image / video LoRAs - primarily on the Hugging Face ecosystem. Methodology that actually generalises, not a one-off notebook.

40+
open-source models fine-tuned to date
6
modalities (LLM, VLM, STT, TTS, image, video)
97%
of GPT-4 accuracy on customer benchmarks
Weeks
from kickoff to deployed checkpoint
Methodology

How we run a fine-tuning project

Five stages. Each one has a gate - we don't move forward if the gate fails. This is how fine-tunes actually ship instead of getting stuck on a single bad eval.

STEP 01

Discovery & data audit

We look at what data you actually have - volume, format, quality, PII exposure. Most fine-tunes fail because the dataset wasn't ready. We tell you straight what's needed.

STEP 02

Data prep & synthesis

Cleaning, deduplication, PII scrubbing, structured-output normalisation. Where data is thin, we generate high-quality synthetic pairs and validate them against a held-out set.

STEP 03

Technique selection

LoRA, QLoRA, full FT, DPO, or instruction-tuning - picked from the goal. Most projects do not need full fine-tuning, and we'll tell you when LoRA is the better answer.

STEP 04

Training & evaluation

Runs on H100 / A100 with DeepSpeed ZeRO + PEFT. Every checkpoint is gated against domain benchmarks, holdouts, and adversarial probes before promotion.

STEP 05

Quantization & deployment

AWQ / GPTQ / GGUF quantization to fit the target hardware. Served via vLLM, TensorRT-LLM, or llama.cpp depending on the deployment.

Techniques

Picking the right approach

The single most common fine-tuning mistake is over-engineering. Most projects want LoRA - not full FT. Here's our actual decision table.

TechniqueWhat it changesComputeWhen we use it
LoRAAdapter weights only (~0.1% of params)Modest (1× A100)Most projects. Fast, cheap, adapter-swappable.
QLoRALoRA on a quantized base modelSmall (1× consumer GPU)Tight memory or budget. Surprisingly good results.
Full fine-tuneAll model weightsHeavy (8× H100+)Deep behaviour shift or new vocabulary.
DPO / RLHFPreference-aligned weightsMedium (4× A100)Tone, safety, or output-format alignment.
Instruction tuningBehavioural formatMediumBringing a base model to chat or tool-use behaviour.
Continued pretrainingBase weights on new domain corpusHeavyMassive vocabulary shift (legal, biomed, code).
Modalities

What we've fine-tuned

Fine-tuning isn't just an LLM thing. We've shipped customised checkpoints across every major modality.

LLMs

Domain Q&A, code, structured extraction, vertical chat. Llama / Qwen / Mistral / DeepSeek.

VLMs

Document AI, screen understanding, visual QA. Qwen-VL, LLaVA, InternVL, Pixtral.

Speech (STT)

Domain-vocabulary Whisper / Parakeet for call centres, healthcare, finance.

Speech (TTS / voice cloning)

XTTS-v2, F5-TTS, Kokoro voice cloning from 5–10 min of samples.

Image (LoRA)

Brand-style Flux LoRAs, product-shot LoRAs, character LoRAs.

Video (LoRA)

LTX-Video brand-style LoRAs for cinematic consistency across generated shots.

Evaluation

What we measure before we ship

Domain benchmark

Customer-supplied held-out evaluation set. The metric the business actually cares about.

Perplexity & loss curves

Standard training signals. Useful for sanity, not for promotion.

Adversarial probes

Jailbreaks, prompt injections, edge-case inputs. Run on every checkpoint.

Regression against base

Did we lose general capability? Score against MMLU / GSM8K / HumanEval.

Latency + memory

Production constraint. A model that breaks the latency budget doesn't ship.

Cost per request

End-to-end inference cost vs the hosted alternative. Always reported.

Have a dataset and a use case?

Bring sample data and a metric you care about. We'll come back with a technique recommendation and a realistic timeline.

Scope a fine-tuning project
See case studies