A legal services portal needed contract-analysis AI inside a private VPC. We fine-tuned Llama-3-8B on 12M lines of contract data and shipped it on their own A100 cluster.
The client was sending sensitive contract data through a hosted GPT-4 endpoint. Cost was climbing past $80k/month, latency was unpredictable, and their enterprise customers were starting to ask hard questions about where the data lived. They needed a model that matched GPT-4 on contract Q&A, ran inside their VPC, and cost less than half as much to operate.
We benchmarked Llama-3-8B, Mistral-7B, and Qwen-2-7B on a held-out set of 8,400 real contract Q&A pairs. Llama-3-8B won on extraction accuracy after LoRA tuning.
12M lines of contract clauses normalised, deduplicated, and synthetically expanded into a 240k-pair instruction dataset. PII scrubbed at the token level.
QLoRA on 4×A100, rank 64, 3 epochs. ~14 hours per training run. Eval gate of 95% of GPT-4 accuracy before promoting any checkpoint.
AWQ 4-bit quantization, served via vLLM with PagedAttention. Sub-25ms first-token on the client's existing A100 hardware. JSON-mode output enforced via grammar constraints.
Containerised inference API behind their VPC, monitored via Prometheus + Grafana. Rolling-update deploys, A/B traffic split for safe promotion.
“The fine-tuned model matches what we were getting from GPT-4 on every benchmark we care about - and it lives inside our network.”
- VP Engineering, legal-tech client (name withheld)
A 30-minute call. We'll tell you whether we can help - and if not, who can.