Legal Tech & Compliance · Case Study

Domain-adapted 7B LLM cuts inference costs by 60% vs hosted GPT-4

A legal services portal needed contract-analysis AI inside a private VPC. We fine-tuned Llama-3-8B on 12M lines of contract data and shipped it on their own A100 cluster.

60%

lower cost per request vs hosted GPT-4

97%

of GPT-4 accuracy on contract Q&A benchmark

25ms

first-token latency on A100

data leaves the customer VPC

Problem

What they were stuck on

The client was sending sensitive contract data through a hosted GPT-4 endpoint. Cost was climbing past $80k/month, latency was unpredictable, and their enterprise customers were starting to ask hard questions about where the data lived. They needed a model that matched GPT-4 on contract Q&A, ran inside their VPC, and cost less than half as much to operate.

Approach

How we built it

STEP 01

Model selection

We benchmarked Llama-3-8B, Mistral-7B, and Qwen-2-7B on a held-out set of 8,400 real contract Q&A pairs. Llama-3-8B won on extraction accuracy after LoRA tuning.

STEP 02

Data preparation

12M lines of contract clauses normalised, deduplicated, and synthetically expanded into a 240k-pair instruction dataset. PII scrubbed at the token level.

STEP 03

Fine-tuning

QLoRA on 4×A100, rank 64, 3 epochs. ~14 hours per training run. Eval gate of 95% of GPT-4 accuracy before promoting any checkpoint.

STEP 04

Quantization & serving

AWQ 4-bit quantization, served via vLLM with PagedAttention. Sub-25ms first-token on the client's existing A100 hardware. JSON-mode output enforced via grammar constraints.

STEP 05

Deployment

Containerised inference API behind their VPC, monitored via Prometheus + Grafana. Rolling-update deploys, A/B traffic split for safe promotion.

Stack

What we used

Llama-3-8B (base)PEFT / QLoRAvLLMAWQ quantizationNvidia A100Private VPC

Outcomes

What changed

60%lower cost per request vs hosted GPT-4

97%of GPT-4 accuracy on contract Q&A benchmark

25msfirst-token latency on A100

0data leaves the customer VPC

“The fine-tuned model matches what we were getting from GPT-4 on every benchmark we care about - and it lives inside our network.”

- VP Engineering, legal-tech client (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us

More work

Conversational Voice AI

Domain-adapted 7B LLM cuts inference costs by 60% vs hosted GPT-4

What they were stuck on

How we built it

Model selection

Data preparation

Fine-tuning

Quantization & serving

Deployment

What we used

What changed

Have a similar problem? Let's scope it.

Sub-300ms voice assistant on edge hardware

Brand-styled video b-roll using LTX-Video

On-prem clinical transcription, 99.1% accuracy

Flux + brand LoRA at SKU scale

Edge VLM defect detection on Jetson

Air-gapped policy RAG on DGX Spark

Whisper fine-tune for Uzbek STT

TRELLIS image-to-3D for AR catalogues

Seed-VC voice conversion for localisation

PersonaPlex full-duplex voice concierge