Legal Tech & Compliance · Case Study

Domain-adapted 7B LLM cuts inference costs by 60% vs hosted GPT-4

A legal services portal needed contract-analysis AI inside a private VPC. We fine-tuned Llama-3-8B on 12M lines of contract data and shipped it on their own A100 cluster.

60%
lower cost per request vs hosted GPT-4
97%
of GPT-4 accuracy on contract Q&A benchmark
25ms
first-token latency on A100
0
data leaves the customer VPC
Problem

What they were stuck on

The client was sending sensitive contract data through a hosted GPT-4 endpoint. Cost was climbing past $80k/month, latency was unpredictable, and their enterprise customers were starting to ask hard questions about where the data lived. They needed a model that matched GPT-4 on contract Q&A, ran inside their VPC, and cost less than half as much to operate.

Approach

How we built it

STEP 01

Model selection

We benchmarked Llama-3-8B, Mistral-7B, and Qwen-2-7B on a held-out set of 8,400 real contract Q&A pairs. Llama-3-8B won on extraction accuracy after LoRA tuning.

STEP 02

Data preparation

12M lines of contract clauses normalised, deduplicated, and synthetically expanded into a 240k-pair instruction dataset. PII scrubbed at the token level.

STEP 03

Fine-tuning

QLoRA on 4×A100, rank 64, 3 epochs. ~14 hours per training run. Eval gate of 95% of GPT-4 accuracy before promoting any checkpoint.

STEP 04

Quantization & serving

AWQ 4-bit quantization, served via vLLM with PagedAttention. Sub-25ms first-token on the client's existing A100 hardware. JSON-mode output enforced via grammar constraints.

STEP 05

Deployment

Containerised inference API behind their VPC, monitored via Prometheus + Grafana. Rolling-update deploys, A/B traffic split for safe promotion.

Stack

What we used

MetaLlama-3-8B (base)HuggingFacePEFT / QLoRAvLLMvLLMAWQ quantizationNvidiaNvidia A100Private VPC
Outcomes

What changed

60%lower cost per request vs hosted GPT-4
97%of GPT-4 accuracy on contract Q&A benchmark
25msfirst-token latency on A100
0data leaves the customer VPC

The fine-tuned model matches what we were getting from GPT-4 on every benchmark we care about - and it lives inside our network.

- VP Engineering, legal-tech client (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us