Three reasons customers cut over
We don't push on-prem for every workload. We push it when one of these three pressures is real.
Privacy
Data never leaves your network. No third-party API, no shared tenancy, no surprise retention policy. Run inside your VPC, on-prem rack, or sealed edge device.
Latency
Predictable, sub-300ms responses for voice agents. Sub-25ms first-token for LLMs. No internet round-trip, no noisy-neighbour throttling.
Cost
Once the hardware is paid for, inference is essentially free. We routinely cut customer LLM spend 50–70% within 6 months of cutover.
The Nvidia DGX Spark changes the math on on-prem AI.
128 GB of unified memory in a desktop chassis means a quantized 70B model fits - and runs - on a single machine drawing wall-outlet power. We are one of the early deployment partners shipping production workloads on Spark today: local RAG, fine-tuning, and inference for teams that couldn't justify a datacenter rack.
What we ship on Spark
- Private RAG over internal docs (Llama 3.3 / Qwen 3 70B quantized)
- Local LoRA fine-tuning for 7B–13B domain models
- VLM document AI (Qwen-VL, InternVL) on confidential PDFs
- Multi-modal pipelines without per-token API fees
Spec sheets, not slogans
Each tier solves a different shape of problem. We size to the workload - not the other way around.
Nvidia DGX Spark
- Unified memory
- Up to 128 GB
- Best for
- Local 70B quantized · LoRA fine-tune
- First token (7B)
- Sub-20ms
- Power
- Wall outlet · <1500W
- Form factor
- Desktop · silent
- 70B models on your desk
- Fine-tune locally without a datacenter
- Ship in days, not quarters
Nvidia Jetson Orin AGX
- Unified memory
- 64 GB
- Best for
- Voice agents · vision pipelines
- First token (7B)
- ~45ms (INT4)
- Power
- 15W – 60W
- Form factor
- Embedded board
- Disconnected operation
- Silent · fanless variants
- Industrial form factor
RTX Workstation (4090 / 6000 Ada)
- VRAM
- 24–48 GB (single / dual)
- Best for
- Prototyping · LoRA training · Flux gen
- First token (7B)
- Sub-25ms
- Power
- 800W – 1200W
- Form factor
- Tower workstation
- Most cost-effective per GPU-hour
- Standard wall power
- Easy hardware service
On-Prem H100 / A100 Cluster
- VRAM
- 640 GB+ SXM5 (per 8x node)
- Best for
- 70B+ full FT · high-throughput APIs
- First token (7B)
- Sub-10ms
- Power
- Datacenter racks
- Form factor
- Rack-mounted
- Multi-tenant inference at scale
- Full parameter fine-tuning
- Highest absolute throughput
What runs where
Skim this before our call. ✓✓ = recommended · ✓ = feasible with tradeoffs · - = not recommended.
| Workload | Jetson | RTX WS | DGX Spark | H100 cluster |
|---|---|---|---|---|
| 7B LLM inference | ✓ | ✓✓ | ✓✓ | ✓✓ |
| 70B LLM (quantized) | - | ✓ | ✓✓ | ✓✓ |
| 70B LLM (FP16) | - | - | ✓ | ✓✓ |
| LoRA fine-tune 7B–13B | - | ✓✓ | ✓✓ | ✓✓ |
| Full FT 70B | - | - | ✓ | ✓✓ |
| LTX-Video / Wan 2.1 | - | ✓ | ✓✓ | ✓✓ |
| Flux.1 image gen | - | ✓✓ | ✓✓ | ✓✓ |
| Whisper STT (real-time) | ✓✓ | ✓✓ | ✓✓ | ✓✓ |
| TTS voice agent <300ms | ✓✓ | ✓✓ | ✓✓ | ✓✓ |
| 1k+ rps inference API | - | - | ✓ | ✓✓ |