Govtech · Low-resource language · Case Study

Whisper fine-tuned on Uzbek - usable transcription for a language vendors ignore

A government digitisation programme needed Uzbek-language speech transcription for citizen-service call recordings. No commercial vendor offered usable accuracy. We fine-tuned Whisper-medium on a curated Uzbek corpus and shipped it as a Dockerised on-prem service.

11.4%

WER (base Whisper: 38%; commercial vendors: 22–29%)

236

epochs trained on the companion Piper TTS voice

3 GB

VRAM footprint per inference instance

100%

on-prem · zero audio leaves the agency

Problem

What they were stuck on

Most STT vendors treat Uzbek as “close enough to Turkish” - and the accuracy reflects that. Base Whisper sat at ~38% WER on Tashkent-dialect citizen calls. The agency couldn't hand audio to a foreign cloud anyway. They needed Uzbek-grade accuracy, deployable inside their own datacenter, with full ownership of the model.

Approach

How we built it

STEP 01

Corpus curation

Assembled a 420-hour Uzbek speech corpus from public radio archives, parliamentary recordings, and consented citizen-service calls. Verified transcripts by native speakers, dialect-tagged (Tashkent / Ferghana / Khorezm).

STEP 02

Tokenizer extension

Extended Whisper's BPE vocabulary with 3,800 Uzbek tokens - primarily Cyrillic and Latin transliterations of common morphological suffixes the base tokenizer fragmented into character-level pieces.

STEP 03

Fine-tuning

Full fine-tune (not LoRA - we needed deeper acoustic adaptation) of Whisper-medium on 2× A100 for 6 epochs. Dynamic noise augmentation matching phone-line audio profile.

STEP 04

Companion TTS - Piper Uzbek

Trained a Piper-based Tashkent-dialect TTS voice (236 epochs) so the agency could pair transcription with synthesised replies in the same dialect.

STEP 05

Deployment

Dockerised inference behind the agency's internal API gateway, 3 GB VRAM footprint, sub-real-time on a single GPU. OTA model updates handled via signed registry pushes.

Stack

What we used

OpenAI Whisper mediumCustom Uzbek corpus (420h)Piper Uzbek TTSHugging Face TransformersDocker + on-prem GPULiteLLM gateway

Outcomes

What changed

11.4%WER (base Whisper: 38%; commercial vendors: 22–29%)

236epochs trained on the companion Piper TTS voice

3 GBVRAM footprint per inference instance

100%on-prem · zero audio leaves the agency

“It is the first time Uzbek speech transcription has worked well enough that we can actually use the transcripts in downstream systems.”

- Director of Digital Services, government agency (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us

More work

Legal Tech & Compliance

Whisper fine-tuned on Uzbek - usable transcription for a language vendors ignore

What they were stuck on

How we built it

Corpus curation

Tokenizer extension

Fine-tuning

Companion TTS - Piper Uzbek

Deployment

What we used

What changed

Have a similar problem? Let's scope it.

Domain-adapted 7B LLM cuts inference costs by 60%

Sub-300ms voice assistant on edge hardware

Brand-styled video b-roll using LTX-Video

On-prem clinical transcription, 99.1% accuracy

Flux + brand LoRA at SKU scale

Edge VLM defect detection on Jetson

Air-gapped policy RAG on DGX Spark

TRELLIS image-to-3D for AR catalogues

Seed-VC voice conversion for localisation

PersonaPlex full-duplex voice concierge