A government digitisation programme needed Uzbek-language speech transcription for citizen-service call recordings. No commercial vendor offered usable accuracy. We fine-tuned Whisper-medium on a curated Uzbek corpus and shipped it as a Dockerised on-prem service.
Most STT vendors treat Uzbek as “close enough to Turkish” - and the accuracy reflects that. Base Whisper sat at ~38% WER on Tashkent-dialect citizen calls. The agency couldn't hand audio to a foreign cloud anyway. They needed Uzbek-grade accuracy, deployable inside their own datacenter, with full ownership of the model.
Assembled a 420-hour Uzbek speech corpus from public radio archives, parliamentary recordings, and consented citizen-service calls. Verified transcripts by native speakers, dialect-tagged (Tashkent / Ferghana / Khorezm).
Extended Whisper's BPE vocabulary with 3,800 Uzbek tokens - primarily Cyrillic and Latin transliterations of common morphological suffixes the base tokenizer fragmented into character-level pieces.
Full fine-tune (not LoRA - we needed deeper acoustic adaptation) of Whisper-medium on 2× A100 for 6 epochs. Dynamic noise augmentation matching phone-line audio profile.
Trained a Piper-based Tashkent-dialect TTS voice (236 epochs) so the agency could pair transcription with synthesised replies in the same dialect.
Dockerised inference behind the agency's internal API gateway, 3 GB VRAM footprint, sub-real-time on a single GPU. OTA model updates handled via signed registry pushes.
“It is the first time Uzbek speech transcription has worked well enough that we can actually use the transcripts in downstream systems.”
- Director of Digital Services, government agency (name withheld)
A 30-minute call. We'll tell you whether we can help - and if not, who can.