Conversational Voice AI · Case Study

Sub-300ms voice assistant running fully offline on edge hardware

A voice-agent startup needed real-time TTS + STT on Jetson Orin boards with no internet dependency. We compiled Whisper + Kokoro onto the device and hit sub-300ms roundtrip.

280ms

average glass-to-glass speech roundtrip

100%

offline operation - no cloud dependency

60+

languages supported by the deployed STT

per-conversation cloud cost in steady state

Problem

What they were stuck on

The client deploys voice assistants in environments with poor or no connectivity - industrial floors, retail back-of-house, in-vehicle. Cloud STT/TTS round-trips were averaging 900ms+ and dropping entire conversations during network blips. They needed everything on-device, under 300ms, with natural-sounding speech.

Approach

How we built it

STEP 01

Hardware target

Nvidia Jetson Orin AGX, 64GB unified RAM. Selected for thermal envelope, GPU acceleration, and silent operation in customer-facing settings.

STEP 02

STT pipeline

Distil-Whisper (medium) compiled with TensorRT, INT8 quantization. Streaming decoder with 80ms chunks for low first-word latency.

STEP 03

TTS pipeline

Kokoro TTS, quantized and exported to ONNX, with a custom phoneme cache for the client's common product vocabulary. Voice cloned from 8 minutes of brand-voice samples.

STEP 04

Orchestration

Custom Rust-based audio runtime managing VAD, streaming STT, intent routing, and TTS playback. End-to-end pipeline runs in a single process with shared GPU context.

STEP 05

Deployment

OTA update channel for model rollouts, on-device telemetry sent in batches when network is available. Zero per-device cloud cost in steady state.

Stack

What we used

Nvidia Jetson Orin AGXDistil-Whisper + TensorRTKokoro TTS (ONNX, quantized)Custom Rust runtimeINT8 quantization

Outcomes

What changed

280msaverage glass-to-glass speech roundtrip

100%offline operation - no cloud dependency

60+languages supported by the deployed STT

$0per-conversation cloud cost in steady state

“We stopped worrying about network drops the day this shipped. The agent just works.”

- CTO, voice-agent startup (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us

More work

Legal Tech & Compliance

Sub-300ms voice assistant running fully offline on edge hardware

What they were stuck on

How we built it

Hardware target

STT pipeline

TTS pipeline

Orchestration

Deployment

What we used

What changed

Have a similar problem? Let's scope it.

Domain-adapted 7B LLM cuts inference costs by 60%

Brand-styled video b-roll using LTX-Video

On-prem clinical transcription, 99.1% accuracy

Flux + brand LoRA at SKU scale

Edge VLM defect detection on Jetson

Air-gapped policy RAG on DGX Spark

Whisper fine-tune for Uzbek STT

TRELLIS image-to-3D for AR catalogues

Seed-VC voice conversion for localisation

PersonaPlex full-duplex voice concierge