Media · Voice localisation · Case Study

Voice-preserving localisation with Seed-VC - same actor, every language

A creator economy platform wanted to localise English-language hero creators into 9 languages - without losing the creators' vocal identity. We built a Seed-VC + Whisper + Qwen3-TTS pipeline that keeps the original voice across every language.

target languages shipped (EN → ES, FR, DE, IT, PT, JA, KO, ZH, HI)

~12 min

to localise a 5-min video end-to-end on DGX Spark

per-token cloud cost · everything runs locally

100%

creators consented · voice embeddings auditable + revocable

Problem

What they were stuck on

Conventional dubbing replaces the original voice with a voice actor - which destroys the creator's brand. Existing AI dubbing tools produce generic synthetic voices that sound nothing like the original. The platform wanted: same creator voice, native pronunciation in 9 target languages, lip-sync-ready output, at a cost that scales to thousands of videos a month.

Approach

How we built it

STEP 01

Voice fingerprint

Extracted a high-quality voice embedding for each enrolled creator from 2–4 min of clean reference audio. Verified with consent + identity check before any cloning ran.

STEP 02

Source transcription + translation

Whisper large-v3 transcribed the source audio with word-level timestamps. A domain-tuned LLM translated to the target language while preserving timing-friendly phrasing.

STEP 03

Native-pronunciation TTS

Qwen3-TTS in CustomVoice mode synthesised the translated text with native phoneme accuracy in the target language - but in a generic voice.

STEP 04

Seed-VC voice transfer

Seed-VC Standard converted the generic-voice synthesis to the original creator's vocal identity, preserving pitch contour, prosody, and tone. The output sounds like the creator speaking the target language fluently.

STEP 05

Timing align + delivery

Final pass aligns segment boundaries to the source video for lip-sync compatibility, exports WAV/MP4 stems for the editing pipeline.

Stack

What we used

Seed-VC StandardWhisper large-v3Qwen3-TTS (CustomVoice)Custom translator (Qwen 3 fine-tune)Nvidia DGX SparkAsync render queue

Outcomes

What changed

9target languages shipped (EN → ES, FR, DE, IT, PT, JA, KO, ZH, HI)

~12 minto localise a 5-min video end-to-end on DGX Spark

$0per-token cloud cost · everything runs locally

100%creators consented · voice embeddings auditable + revocable

“Our creators' voices are their brand. This is the first dubbing pipeline that doesn't erase them.”

- VP Content, creator-economy platform (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us

More work

Legal Tech & Compliance

Voice-preserving localisation with Seed-VC - same actor, every language

What they were stuck on

How we built it

Voice fingerprint

Source transcription + translation

Native-pronunciation TTS

Seed-VC voice transfer

Timing align + delivery

What we used

What changed

Have a similar problem? Let's scope it.

Domain-adapted 7B LLM cuts inference costs by 60%

Sub-300ms voice assistant on edge hardware

Brand-styled video b-roll using LTX-Video

On-prem clinical transcription, 99.1% accuracy

Flux + brand LoRA at SKU scale

Edge VLM defect detection on Jetson

Air-gapped policy RAG on DGX Spark

Whisper fine-tune for Uzbek STT

TRELLIS image-to-3D for AR catalogues

PersonaPlex full-duplex voice concierge