AI Development & Enterprise AI Solutions

Methodology

How we run a fine-tuning project

Five stages. Each one has a gate - we don't move forward if the gate fails. This is how fine-tunes actually ship instead of getting stuck on a single bad eval.

STEP 01

Discovery & data audit

We look at what data you actually have - volume, format, quality, PII exposure. Most fine-tunes fail because the dataset wasn't ready. We tell you straight what's needed.

STEP 02

Data prep & synthesis

Cleaning, deduplication, PII scrubbing, structured-output normalisation. Where data is thin, we generate high-quality synthetic pairs and validate them against a held-out set.

STEP 03

Technique selection

LoRA, QLoRA, full FT, DPO, or instruction-tuning - picked from the goal. Most projects do not need full fine-tuning, and we'll tell you when LoRA is the better answer.

STEP 04

Training & evaluation

Runs on H100 / A100 with DeepSpeed ZeRO + PEFT. Every checkpoint is gated against domain benchmarks, holdouts, and adversarial probes before promotion.

STEP 05

Quantization & deployment

AWQ / GPTQ / GGUF quantization to fit the target hardware. Served via vLLM, TensorRT-LLM, or llama.cpp depending on the deployment.

Techniques

Picking the right approach

The single most common fine-tuning mistake is over-engineering. Most projects want LoRA - not full FT. Here's our actual decision table.

Technique	What it changes	Compute	When we use it
LoRA	Adapter weights only (~0.1% of params)	Modest (1× A100)	Most projects. Fast, cheap, adapter-swappable.
QLoRA	LoRA on a quantized base model	Small (1× consumer GPU)	Tight memory or budget. Surprisingly good results.
Full fine-tune	All model weights	Heavy (8× H100+)	Deep behaviour shift or new vocabulary.
DPO / RLHF	Preference-aligned weights	Medium (4× A100)	Tone, safety, or output-format alignment.
Instruction tuning	Behavioural format	Medium	Bringing a base model to chat or tool-use behaviour.
Continued pretraining	Base weights on new domain corpus	Heavy	Massive vocabulary shift (legal, biomed, code).

Modalities

What we've fine-tuned

Fine-tuning isn't just an LLM thing. We've shipped customised checkpoints across every major modality.

LLMs

Domain Q&A, code, structured extraction, vertical chat. Llama / Qwen / Mistral / DeepSeek.

VLMs

Document AI, screen understanding, visual QA. Qwen-VL, LLaVA, InternVL, Pixtral.

Speech (STT)

Domain-vocabulary Whisper / Parakeet for call centres, healthcare, finance.

Speech (TTS / voice cloning)

XTTS-v2, F5-TTS, Kokoro voice cloning from 5–10 min of samples.

Image (LoRA)

Brand-style Flux LoRAs, product-shot LoRAs, character LoRAs.

Video (LoRA)

LTX-Video brand-style LoRAs for cinematic consistency across generated shots.

Evaluation

What we measure before we ship

Domain benchmark

Customer-supplied held-out evaluation set. The metric the business actually cares about.

Perplexity & loss curves

Standard training signals. Useful for sanity, not for promotion.

Adversarial probes

Jailbreaks, prompt injections, edge-case inputs. Run on every checkpoint.

Regression against base

Did we lose general capability? Score against MMLU / GSM8K / HumanEval.

Latency + memory

Production constraint. A model that breaks the latency budget doesn't ship.

Cost per request

End-to-end inference cost vs the hosted alternative. Always reported.

Proof

A case study from the rotation

Legal Tech & Compliance

Domain-adapted 7B LLM cuts inference costs by 60% vs hosted GPT-4

A legal services portal needed contract-analysis AI inside a private VPC. We fine-tuned Llama-3-8B on 12M lines of contract data and shipped it on their own A100 cluster.

Read the full case study

From a Hugging Face checkpoint to your production domain model.