Fintech · Air-gapped · Case Study

Air-gapped RAG over 4M policy documents using Qwen 3 32B on DGX Spark

A regulated financial institution needed a RAG assistant over millions of policy and compliance documents - and it had to run inside a fully air-gapped environment. We deployed Qwen 3 32B (quantized) on Nvidia DGX Spark with a custom hybrid retrieval stack.

4M docs

indexed with hybrid retrieval, sub-1s query

14ms

first-token on DGX Spark (32B AWQ)

100%

air-gapped · zero outbound network from production

60%

reduction in compliance team time per inquiry

Problem

What they were stuck on

The compliance team was answering hundreds of internal “is this allowed?” questions a week, manually grepping through 4 million documents across SharePoint, internal wikis, and regulator filings. They couldn't use any hosted LLM - internal policy and regulator agreements forbid external network paths. They needed a chat-style assistant, accurate citations, and the entire system inside an air-gapped network.

Approach

How we built it

STEP 01

Air-gapped infrastructure design

Two-tier setup: a build environment with internet (for model + dependency downloads, audited), and an air-gapped production tier with no outbound network. Models, weights, and dependency bundles are transferred via signed offline artefacts.

STEP 02

Hybrid retrieval stack

BM25 (Tantivy) + dense (BGE-M3) hybrid index over the 4M-doc corpus. Reranker (BGE-reranker-large) on top-100 candidates. Domain-tuned chunking for policy and regulatory text structure.

STEP 03

Generation model

Qwen 3 32B, AWQ 4-bit quantized, fine-tuned on 18k internal Q&A pairs to learn the bank's policy language and citation format. Strict JSON-mode output with mandatory citation IDs.

STEP 04

DGX Spark deployment

Two DGX Spark units in the air-gapped tier - one primary, one failover. ~14ms first-token for the 32B model at quantized weights. Wall-outlet power, zero datacenter dependency.

STEP 05

Audit + traceability

Every query + retrieved chunks + generated answer logged to an immutable append-only store. Compliance can replay any answer down to the source chunk months later.

Stack

What we used

Qwen 3 32B (AWQ)BGE-M3 dense retrievalTantivy BM25BGE-rerankerNvidia DGX Spark (2×)Air-gapped VPC

Outcomes

What changed

4M docsindexed with hybrid retrieval, sub-1s query

14msfirst-token on DGX Spark (32B AWQ)

100%air-gapped · zero outbound network from production

60%reduction in compliance team time per inquiry

“Our regulators were the first to ask if we were sending data to a hosted LLM. The answer is no - the whole thing fits on two desktops in our own network.”

- Head of Compliance Engineering, financial institution (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us

More work

Legal Tech & Compliance

Air-gapped RAG over 4M policy documents using Qwen 3 32B on DGX Spark

What they were stuck on

How we built it

Air-gapped infrastructure design

Hybrid retrieval stack

Generation model

DGX Spark deployment

Audit + traceability

What we used

What changed

Have a similar problem? Let's scope it.

Domain-adapted 7B LLM cuts inference costs by 60%

Sub-300ms voice assistant on edge hardware

Brand-styled video b-roll using LTX-Video

On-prem clinical transcription, 99.1% accuracy

Flux + brand LoRA at SKU scale

Edge VLM defect detection on Jetson

Whisper fine-tune for Uzbek STT

TRELLIS image-to-3D for AR catalogues

Seed-VC voice conversion for localisation

PersonaPlex full-duplex voice concierge