Fintech · Air-gapped · Case Study

Air-gapped RAG over 4M policy documents using Qwen 3 32B on DGX Spark

A regulated financial institution needed a RAG assistant over millions of policy and compliance documents - and it had to run inside a fully air-gapped environment. We deployed Qwen 3 32B (quantized) on Nvidia DGX Spark with a custom hybrid retrieval stack.

4M docs
indexed with hybrid retrieval, sub-1s query
14ms
first-token on DGX Spark (32B AWQ)
100%
air-gapped · zero outbound network from production
60%
reduction in compliance team time per inquiry
Problem

What they were stuck on

The compliance team was answering hundreds of internal “is this allowed?” questions a week, manually grepping through 4 million documents across SharePoint, internal wikis, and regulator filings. They couldn't use any hosted LLM - internal policy and regulator agreements forbid external network paths. They needed a chat-style assistant, accurate citations, and the entire system inside an air-gapped network.

Approach

How we built it

STEP 01

Air-gapped infrastructure design

Two-tier setup: a build environment with internet (for model + dependency downloads, audited), and an air-gapped production tier with no outbound network. Models, weights, and dependency bundles are transferred via signed offline artefacts.

STEP 02

Hybrid retrieval stack

BM25 (Tantivy) + dense (BGE-M3) hybrid index over the 4M-doc corpus. Reranker (BGE-reranker-large) on top-100 candidates. Domain-tuned chunking for policy and regulatory text structure.

STEP 03

Generation model

Qwen 3 32B, AWQ 4-bit quantized, fine-tuned on 18k internal Q&A pairs to learn the bank's policy language and citation format. Strict JSON-mode output with mandatory citation IDs.

STEP 04

DGX Spark deployment

Two DGX Spark units in the air-gapped tier - one primary, one failover. ~14ms first-token for the 32B model at quantized weights. Wall-outlet power, zero datacenter dependency.

STEP 05

Audit + traceability

Every query + retrieved chunks + generated answer logged to an immutable append-only store. Compliance can replay any answer down to the source chunk months later.

Stack

What we used

QwenQwen 3 32B (AWQ)BGE-M3 dense retrievalTantivy BM25BGE-rerankerNvidiaNvidia DGX Spark (2×)Air-gapped VPC
Outcomes

What changed

4M docsindexed with hybrid retrieval, sub-1s query
14msfirst-token on DGX Spark (32B AWQ)
100%air-gapped · zero outbound network from production
60%reduction in compliance team time per inquiry

Our regulators were the first to ask if we were sending data to a hosted LLM. The answer is no - the whole thing fits on two desktops in our own network.

- Head of Compliance Engineering, financial institution (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us