Chapter 02 · Section II · 16 min read
Reading Devanagari — OCR and digitisation
Why Nepali OCR is harder than English OCR, what works in 2026, and the paper archives that turn into useful datasets once they stop being paper.
If language is the spoken and digital substance of Nepal, writing is the archival substance — and most of it is still paper. Walk into any malpot (land revenue) office, any government ministry, any jilla court, and you will find filing cabinets full of forms, deeds, judgments, registers. To turn this country’s accumulated written knowledge into something a model can read, the first technology you need is good Devanagari OCR.
Why Devanagari is harder than Latin
Latin OCR is essentially a solved problem. Tesseract, the open-source workhorse, handles printed English with near-perfect accuracy on a clean scan. Modern transformer-based engines do the same job on handwriting, multi-column layouts, and rotated images.
Devanagari is harder, and for specific reasons.
The script joins. In English, the letters in book sit independently next to one another. In Nepali, the letters in पुस्तक are bound together by a continuous top line (the shirorekha), and several letters fuse into compound conjuncts (स्त in पुस्तक is one such conjunct). A character segmentation algorithm trained on English will simply break.
Mātrā signs sit above and below. The vowel marks — ि, ी, े, ै, ो, ौ, ं, ः — attach above, below, before, or after the consonant they modify. A model has to detect not just the base character but the small ornament hanging off it, which on a faded photocopy is often only a few pixels.
Compound characters multiply the alphabet. English has 26 letters. Devanagari has roughly 50 base symbols, but combined with conjuncts and mātrās the number of visual units a recognition model must learn runs into the thousands. The model is doing a harder classification problem than its English counterpart.
What works in 2026
Three families of tool are usable today:
-
Google’s Cloud Vision API with Devanagari support handles clean modern print well — say, a printed government circular or a published book. It struggles with handwriting, faded photocopies, and complex multi-column government forms.
-
Open-source models from the IndicOCR / Bharat OCR family, often fine-tuned from transformer backbones, perform respectably on printed Nepali and have the advantage of running on your own hardware. They are the default starting point for a serious in-house effort.
-
Custom fine-tuned models — taking an open backbone and training on a few thousand labelled samples of your specific document type (court judgment, land deed, citizenship certificate). For a high-volume vertical application this approach beats the generic options decisively. For a small one-off project it is overkill.
For handwritten Nepali — which is most of the country’s older paper — none of these is solved. Field-level handwritten amounts on a lalpurja (land deed), notes in the margin of a school register, signatures on a court filing: error rates remain high enough that a human-in-the-loop is still mandatory.
The paper archives that matter
If you wanted to set Nepali public-good AI back by a decade, you would lose access to certain paper archives. Conversely, digitising them well is one of the highest-leverage things the country could do. A non-exhaustive list:
- The Supreme Court and high court judgment archives — fifty-plus years of case law, mostly scanned PDFs of varying quality. A searchable, structured corpus would transform Nepali legal research overnight.
- Land revenue records (malpot offices) — every district holds its own paper moth registers, plus newer scanned blueprints. Some districts are partially digitised; many are not.
- The National Archives, Ministry of Culture — manuscripts, royal records, historical documents.
- School and university transcripts — the country’s educational history sits in tens of thousands of paper certificates verified individually by mail and stamp.
- Health post registers — the front-line data of the public health system, kept on paper at facility level.
- Old newspapers — Gorkhapatra has been printed continuously since 1901. A scanned, OCR’d, searchable corpus of Gorkhapatra would be one of the most valuable Nepali-language text resources in existence.
A practical workflow
A working Devanagari OCR pipeline, in 2026, usually looks like this:
- Pre-processing — deskew, denoise, increase contrast. This is where 30% of the accuracy battle is fought, and where a researcher more interested in models often under-invests.
- Layout detection — split the page into text blocks, tables, stamps, signatures, marginalia.
- Recognition — run a Devanagari-aware OCR model on each text block.
- Post-processing — apply a Nepali language model to correct obvious mis-recognitions (“रामलाल” misread as “रागलाल” is the kind of error the LLM cleans up).
- Human verification — at least for any high-stakes use, sample-check outputs against original scans.
The last step is unsexy and the one most often skipped. It is also what separates a working pipeline from a credible one.
Check your understanding
Quick check
—Which of the following is a real reason Devanagari OCR is technically harder than English OCR?
What comes next
If writing is one mode of the language, speech is the other — and for much of rural Nepal, the more important one. The next section is about Nepali ASR (automatic speech recognition) and TTS (text-to-speech): what works on a phone today, and what a public-good Nepali voice corpus could unlock.