Transcription that ships
Real-time, batch, or edge - different STT models win different battles. We pick to match the use case.
Whisper large-v3
OpenAIReference-quality multilingual STT. Default for offline batch transcription.
Distil-Whisper
Hugging Face6× faster than Whisper at 1% WER cost. Default for real-time / streaming.
Nvidia Parakeet
NvidiaBest-in-class English ASR. Excellent for call-centre and contact-centre workflows.
Voices that sound human, including yours
From sub-100ms edge TTS to multilingual cloned voices. Every TTS deployment includes a rights-handling step - voice cloning without consent is not something we ship.
XTTS-v2
CoquiMultilingual voice cloning from 6s of reference audio. Default for branded voices.
F5-TTS
SWividHigh-quality flow-matching TTS. Excellent expressivity and rhythm.
Kokoro
HexgradTiny, fast, surprisingly natural. Our default for edge / on-device voice agents.
ChatTTS
2noiseConversational TTS with natural turn-taking artefacts. Great for dialogue.
Generative audio for content workflows
MusicGen
MetaText-to-music with melody conditioning. Default for stems and background scoring.
Stable Audio Open
StabilityOpen-weight text-to-audio model. Strong on sound effects and short loops.
Where this lands
Voice agents
Sub-300ms TTS + STT on Jetson or DGX Spark. Fully offline if needed.
Voice cloning
Brand-voice cloning from 5–10 min of reference audio with full rights handling.
Transcription pipelines
Batch + real-time transcription, diarisation, speaker ID, language detection.
Audio content
Background music, intros, SFX, branded jingles via MusicGen and Stable Audio.