Speech & Audio · Open-source models in production

Voice, transcription, and audio generation - deployed locally or in your cloud.

Whisper, Parakeet, XTTS-v2, F5-TTS, Kokoro, MusicGen, Stable Audio Open. Real-time voice agents, branded voice cloning, batch transcription pipelines, and generative audio - running on the hardware you already own.

280ms
glass-to-glass voice agent on edge
60+
languages supported across STT + TTS
6s
of reference audio for voice clone
0
audio leaves your network
Speech-to-text

Transcription that ships

Real-time, batch, or edge - different STT models win different battles. We pick to match the use case.

OpenAI

Whisper large-v3

OpenAI

Reference-quality multilingual STT. Default for offline batch transcription.

99+ languagesBatch
HuggingFace

Distil-Whisper

Hugging Face

6× faster than Whisper at 1% WER cost. Default for real-time / streaming.

Real-timeDistilled
Nvidia

Nvidia Parakeet

Nvidia

Best-in-class English ASR. Excellent for call-centre and contact-centre workflows.

EnglishLow WER
Text-to-speech

Voices that sound human, including yours

From sub-100ms edge TTS to multilingual cloned voices. Every TTS deployment includes a rights-handling step - voice cloning without consent is not something we ship.

Coqui

XTTS-v2

Coqui

Multilingual voice cloning from 6s of reference audio. Default for branded voices.

Voice clone16 languages
HuggingFace

F5-TTS

SWivid

High-quality flow-matching TTS. Excellent expressivity and rhythm.

ExpressiveOS
HuggingFace

Kokoro

Hexgrad

Tiny, fast, surprisingly natural. Our default for edge / on-device voice agents.

EdgeTiny
HuggingFace

ChatTTS

2noise

Conversational TTS with natural turn-taking artefacts. Great for dialogue.

Conversational
Music & SFX

Generative audio for content workflows

Meta

MusicGen

Meta

Text-to-music with melody conditioning. Default for stems and background scoring.

Text-to-musicStems

Stable Audio Open

Stability

Open-weight text-to-audio model. Strong on sound effects and short loops.

SFXLoops
Use cases

Where this lands

Voice agents

Sub-300ms TTS + STT on Jetson or DGX Spark. Fully offline if needed.

Voice cloning

Brand-voice cloning from 5–10 min of reference audio with full rights handling.

Transcription pipelines

Batch + real-time transcription, diarisation, speaker ID, language detection.

Audio content

Background music, intros, SFX, branded jingles via MusicGen and Stable Audio.

Have a voice or audio workload?

Bring the use case and the latency / quality / language targets. We'll come back with a model + deployment plan.

Talk to us about speech & audio
See case studies