A voice-agent startup needed real-time TTS + STT on Jetson Orin boards with no internet dependency. We compiled Whisper + Kokoro onto the device and hit sub-300ms roundtrip.
The client deploys voice assistants in environments with poor or no connectivity - industrial floors, retail back-of-house, in-vehicle. Cloud STT/TTS round-trips were averaging 900ms+ and dropping entire conversations during network blips. They needed everything on-device, under 300ms, with natural-sounding speech.
Nvidia Jetson Orin AGX, 64GB unified RAM. Selected for thermal envelope, GPU acceleration, and silent operation in customer-facing settings.
Distil-Whisper (medium) compiled with TensorRT, INT8 quantization. Streaming decoder with 80ms chunks for low first-word latency.
Kokoro TTS, quantized and exported to ONNX, with a custom phoneme cache for the client's common product vocabulary. Voice cloned from 8 minutes of brand-voice samples.
Custom Rust-based audio runtime managing VAD, streaming STT, intent routing, and TTS playback. End-to-end pipeline runs in a single process with shared GPU context.
OTA update channel for model rollouts, on-device telemetry sent in batches when network is available. Zero per-device cloud cost in steady state.
“We stopped worrying about network drops the day this shipped. The agent just works.”
- CTO, voice-agent startup (name withheld)
A 30-minute call. We'll tell you whether we can help - and if not, who can.