Speech for Nepali — ASR and TTS

About a third of adult Nepalis cannot read well enough to use a text interface comfortably. A larger fraction can read but find typing on a phone slow and error-prone, especially in Devanagari. For these users, speech is not a feature — it is the primary way they will use software. This section is about what works today for Nepali speech, and what is missing.

Two halves of the speech problem

Speech AI has two sides: ASR (automatic speech recognition) turns audio into text; TTS (text-to-speech) turns text into audio. Both are needed for a conversational app — ASR to take the user’s voice question, TTS to read the system’s reply back. They have different difficulty curves in Nepali.

ASR for Nepali is improving fast. OpenAI’s Whisper model can transcribe Nepali tolerably out of the box, especially in its larger sizes. Distil-Whisper and other community variants run on phones with acceptable latency. For clean read speech — a journalist reading a script, a teacher reading a lesson — accuracy is now genuinely useful.

ASR for conversational Nepali is much harder. The moment you introduce code-mixing, regional accents (Far Western Nepali, Tarai-influenced Hindi-Nepali, Newari-influenced Kathmandu Nepali), background noise, multiple speakers, or fast informal speech, error rates climb. A health worker dictating a clinical note on a bus is a hard problem; the same health worker reading from a clipboard in a quiet room is a solved problem.

TTS for Nepali is patchier. Google’s Text-to-Speech and a few open-source projects produce intelligible Nepali, but the voice tends to be flat — neutral pitch, neutral honorific register, slight South Asian English accent leaking through on loan words. There is no widely-deployed Nepali voice that sounds like the average village radio announcer. The market is not yet large enough to attract the investment that would close this gap.

What a phone can actually do

Useful speech on the phone you actually have looks like this. Whisper-small (around 244 million parameters) runs comfortably on a mid-range Android with on-device inference, transcribing short utterances in roughly 1–2 seconds. A distilled TTS engine can read short replies aloud without round-tripping to a server.

The implication is striking. A health worker in a rural posting could, today, dictate visit notes to her phone in Nepali, have them transcribed locally and stored, and sync them when she reaches a tower. No cloud dependency, no monthly fee, no patient data leaving the phone. The technology exists. What is missing is the product.

The dataset gap

If you wanted to make a noticeable improvement in Nepali speech AI for the country tomorrow, the single highest-leverage thing you could do is build a good public Nepali speech corpus. The Mozilla Common Voice project has a Nepali portion, but it is much smaller than its Hindi or Bengali equivalents, and skewed toward read speech rather than conversational.

A serious corpus would include:

Multiple regional dialects — Eastern, Western, Far Western, Madhesh, Newar-influenced Kathmandu, plus the major non-Nepali languages of the country (Maithili, Bhojpuri, Tamang, Tharu) for whom speech AI is even more poorly served.
Both genders, all adult age bands. Voice models built mostly on younger male voices recognise everyone else worse — a known and consistent finding worldwide.
Both clean read speech and noisy conversational speech, in roughly equal proportion.
Per-utterance metadata — district, age band, gender, recording device, ambient noise condition — that lets researchers measure where the model fails.

This is months-to-years of focused work. It is also exactly the kind of thing an NGO–university partnership can do well, and is probably the most useful AI-adjacent project that is not yet being run at scale in the country.

Check your understanding

Quick check

—

A team is choosing between investing engineering effort in (i) better Nepali ASR for clean read speech, (ii) better Nepali ASR for noisy conversational speech, or (iii) better Nepali TTS that sounds local. Given the state of the technology in 2026, which has the most room for impactful new work?

ASR for clean read speech — it is the easiest to improve.
ASR for noisy conversational and code-mixed speech, plus TTS that sounds genuinely Nepali — both are weak today and unblock real product use cases.
Neither — both are essentially solved problems for Nepali in 2026.
TTS only — ASR for Nepali is hopeless.

What comes next

Language and writing close here. Chapter 3 changes domain entirely — to money. The country runs on remittances, mobile payments, and informal credit; AI is already deeply inside the first two and almost entirely absent from the third. We look at where AI already sits, and where it could.