Conversational Voice AI · Case Study

Sub-300ms voice assistant running fully offline on edge hardware

A voice-agent startup needed real-time TTS + STT on Jetson Orin boards with no internet dependency. We compiled Whisper + Kokoro onto the device and hit sub-300ms roundtrip.

280ms
average glass-to-glass speech roundtrip
100%
offline operation - no cloud dependency
60+
languages supported by the deployed STT
$0
per-conversation cloud cost in steady state
Problem

What they were stuck on

The client deploys voice assistants in environments with poor or no connectivity - industrial floors, retail back-of-house, in-vehicle. Cloud STT/TTS round-trips were averaging 900ms+ and dropping entire conversations during network blips. They needed everything on-device, under 300ms, with natural-sounding speech.

Approach

How we built it

STEP 01

Hardware target

Nvidia Jetson Orin AGX, 64GB unified RAM. Selected for thermal envelope, GPU acceleration, and silent operation in customer-facing settings.

STEP 02

STT pipeline

Distil-Whisper (medium) compiled with TensorRT, INT8 quantization. Streaming decoder with 80ms chunks for low first-word latency.

STEP 03

TTS pipeline

Kokoro TTS, quantized and exported to ONNX, with a custom phoneme cache for the client's common product vocabulary. Voice cloned from 8 minutes of brand-voice samples.

STEP 04

Orchestration

Custom Rust-based audio runtime managing VAD, streaming STT, intent routing, and TTS playback. End-to-end pipeline runs in a single process with shared GPU context.

STEP 05

Deployment

OTA update channel for model rollouts, on-device telemetry sent in batches when network is available. Zero per-device cloud cost in steady state.

Stack

What we used

NvidiaNvidia Jetson Orin AGXOpenAIDistil-Whisper + TensorRTHuggingFaceKokoro TTS (ONNX, quantized)Custom Rust runtimeINT8 quantization
Outcomes

What changed

280msaverage glass-to-glass speech roundtrip
100%offline operation - no cloud dependency
60+languages supported by the deployed STT
$0per-conversation cloud cost in steady state

We stopped worrying about network drops the day this shipped. The agent just works.

- CTO, voice-agent startup (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us