A regulated financial institution needed a RAG assistant over millions of policy and compliance documents - and it had to run inside a fully air-gapped environment. We deployed Qwen 3 32B (quantized) on Nvidia DGX Spark with a custom hybrid retrieval stack.
The compliance team was answering hundreds of internal “is this allowed?” questions a week, manually grepping through 4 million documents across SharePoint, internal wikis, and regulator filings. They couldn't use any hosted LLM - internal policy and regulator agreements forbid external network paths. They needed a chat-style assistant, accurate citations, and the entire system inside an air-gapped network.
Two-tier setup: a build environment with internet (for model + dependency downloads, audited), and an air-gapped production tier with no outbound network. Models, weights, and dependency bundles are transferred via signed offline artefacts.
BM25 (Tantivy) + dense (BGE-M3) hybrid index over the 4M-doc corpus. Reranker (BGE-reranker-large) on top-100 candidates. Domain-tuned chunking for policy and regulatory text structure.
Qwen 3 32B, AWQ 4-bit quantized, fine-tuned on 18k internal Q&A pairs to learn the bank's policy language and citation format. Strict JSON-mode output with mandatory citation IDs.
Two DGX Spark units in the air-gapped tier - one primary, one failover. ~14ms first-token for the 32B model at quantized weights. Wall-outlet power, zero datacenter dependency.
Every query + retrieved chunks + generated answer logged to an immutable append-only store. Compliance can replay any answer down to the source chunk months later.
“Our regulators were the first to ask if we were sending data to a hosted LLM. The answer is no - the whole thing fits on two desktops in our own network.”
- Head of Compliance Engineering, financial institution (name withheld)
A 30-minute call. We'll tell you whether we can help - and if not, who can.