Govtech · Low-resource language · Case Study

Whisper fine-tuned on Uzbek - usable transcription for a language vendors ignore

A government digitisation programme needed Uzbek-language speech transcription for citizen-service call recordings. No commercial vendor offered usable accuracy. We fine-tuned Whisper-medium on a curated Uzbek corpus and shipped it as a Dockerised on-prem service.

11.4%
WER (base Whisper: 38%; commercial vendors: 22–29%)
236
epochs trained on the companion Piper TTS voice
3 GB
VRAM footprint per inference instance
100%
on-prem · zero audio leaves the agency
Problem

What they were stuck on

Most STT vendors treat Uzbek as “close enough to Turkish” - and the accuracy reflects that. Base Whisper sat at ~38% WER on Tashkent-dialect citizen calls. The agency couldn't hand audio to a foreign cloud anyway. They needed Uzbek-grade accuracy, deployable inside their own datacenter, with full ownership of the model.

Approach

How we built it

STEP 01

Corpus curation

Assembled a 420-hour Uzbek speech corpus from public radio archives, parliamentary recordings, and consented citizen-service calls. Verified transcripts by native speakers, dialect-tagged (Tashkent / Ferghana / Khorezm).

STEP 02

Tokenizer extension

Extended Whisper's BPE vocabulary with 3,800 Uzbek tokens - primarily Cyrillic and Latin transliterations of common morphological suffixes the base tokenizer fragmented into character-level pieces.

STEP 03

Fine-tuning

Full fine-tune (not LoRA - we needed deeper acoustic adaptation) of Whisper-medium on 2× A100 for 6 epochs. Dynamic noise augmentation matching phone-line audio profile.

STEP 04

Companion TTS - Piper Uzbek

Trained a Piper-based Tashkent-dialect TTS voice (236 epochs) so the agency could pair transcription with synthesised replies in the same dialect.

STEP 05

Deployment

Dockerised inference behind the agency's internal API gateway, 3 GB VRAM footprint, sub-real-time on a single GPU. OTA model updates handled via signed registry pushes.

Stack

What we used

OpenAIOpenAI Whisper mediumCustom Uzbek corpus (420h)Piper Uzbek TTSHugging Face TransformersDocker + on-prem GPULiteLLM gateway
Outcomes

What changed

11.4%WER (base Whisper: 38%; commercial vendors: 22–29%)
236epochs trained on the companion Piper TTS voice
3 GBVRAM footprint per inference instance
100%on-prem · zero audio leaves the agency

It is the first time Uzbek speech transcription has worked well enough that we can actually use the transcripts in downstream systems.

- Director of Digital Services, government agency (name withheld)

Have a similar problem? Let's scope it.

A 30-minute call. We'll tell you whether we can help - and if not, who can.

Talk to us