Advanced neural network techniques · Introduction to AI

Two architectural ideas — both from the last fifteen years — are responsible for almost every famous AI system you have ever interacted with. Convolutions made image recognition work, and through it, Devanagari OCR on your phone. Transformers made language work, and through them, ChatGPT, Claude, Gemini, Google Translate, and the underlying engines behind virtually every modern chatbot.

You do not need to read the original research papers to understand them. You need the one-paragraph intuition for each, and to know when each one is the right tool. That is what this section delivers.

Convolutions, in one idea

A naive fully-connected neural network that reads an image does a strange thing: it treats each pixel as if it were independent of every other pixel. The neuron does not know that pixel (5, 7) is next to pixel (5, 8). A more sensible architecture would use the spatial structure.

That is what a convolutional neural network (CNN) does. Instead of having each neuron look at the entire image, a CNN slides a small filter across the image. The filter is just a tiny patch of weights — say 3×3 — that produces a single number for each position it covers. After applying the filter at every position, you have a new “image” of feature responses.

What does the filter detect? Whatever its weights have been trained to detect. After training:

Some filters become edge detectors (high response where there is a vertical line; low elsewhere).
Some become corner detectors.
Some become detectors for specific textures.

The next layer of the CNN applies more filters to the output of the first layer. Those see combinations of edges — curves, junctions. The layer after that sees combinations of those — small shapes. By the time you reach the final layers, the network is responding to parts of objects — the head of the letter क, the loop of ध, the stroke of ज्ञ.

This is why CNNs are good at images: the architecture itself encodes the prior knowledge that images have spatial structure and that the same visual pattern can appear at any position.

What CNNs power in Nepal

Devanagari OCR. The character classifier on your phone is a CNN. It learned, from tens of thousands of labelled Devanagari character images, what each letter looks like across handwriting styles, fonts, and lighting.
Crop disease detection. A photo of a paddy leaf can be classified — by a small CNN trained on agricultural data — into healthy, blast-infected, bacterial-blight, etc. Small enough to run on a phone.
License plate reading. Nepali traffic license plates, with their Devanagari-and-numeric script, are read by a CNN at the back of every modern parking system.

None of these needs to be a large model. A CNN with a few million parameters, trained in hours on a laptop GPU, is enough.

Transformers, in one idea

Language is sequential. A sentence is a sequence of words; each word’s meaning depends on the others around it. Before 2017, neural networks for language processed words one at a time, accumulating state — a class of model called recurrent neural networks (RNNs). They worked, but they were slow to train and had trouble with long-range dependencies (a word at the start of a sentence influencing a word at the end).

The transformer, introduced in 2017 by Google researchers, replaced sequential processing with attention. The core idea: when processing any one word, the network is allowed to attend — look at — every other word in the input simultaneously, weighting them by how relevant each is. Mathematically, this is a lot of weighted sums, layered the way a CNN layers convolutions.

This sounds technical, but the practical consequence is enormous. Transformers can be trained in parallel (one word’s processing doesn’t have to wait for the previous one). They handle long-range dependencies effortlessly. And — most importantly — they scale. Doubling the size of a transformer, given enough data, reliably produces a better model. CNNs scale; transformers scale better.

Every famous language model in 2026 is a transformer: ChatGPT, Claude, Gemini, the underlying engine of Google Translate’s Nepali model, Devanagari speech-to-text, Nepali text-to-speech. The architecture is identical in shape; what differs is size, training data, and post-training refinement.

What transformers do, what they don’t

Transformers are extraordinary at:

Sequence-to-sequence tasks: translation, summarisation, code generation.
Language modelling: predicting the next word given context.
Question answering when the answer lives in the input.

They are not magical at:

Truth. A transformer predicts plausible text. Whether that text is true is a function of its training data and post-training, not of the architecture.
Reasoning long chains accurately. Transformers can mimic reasoning patterns, but produce confident wrong answers when the chain is long.
Knowing what they don’t know. Transformers have no calibrated uncertainty. They will answer a question about Newari history they have never seen with the same confidence as one they have memorised.

This is why every deployed system using a transformer also has guardrails: retrieval (look up an answer in a database rather than relying on the model’s memory), confidence thresholds, human-in-the-loop review. The architecture is powerful; it is not a finished AI system.

A quick anatomy of a modern Nepali AI system

To make this concrete: imagine a Nepali-language chatbot deployed by a hospital in Pokhara to answer patient questions about appointments and medications.

Speech input. A small CNN-and-RNN combo converts spoken Nepali to text. Trained on Nepali audio data.
Language understanding. A transformer (small or large depending on budget) reads the patient’s text question and produces a structured representation of what they want.
Retrieval. The system looks up the answer in the hospital’s actual database — not in the transformer’s weights.
Language generation. A transformer turns the retrieved answer into a polite Nepali sentence.
Speech output. A separate model converts the sentence back to spoken Nepali.

The whole system is five small models cooperating. None of them is GPT-class. Each can be trained, audited, and updated on its own. This is what real, working AI looks like — not one giant brain, but a small ensemble of specialised parts.

Check your understanding

Quick check

—

Which architecture is the dominant choice in 2026 for language models like ChatGPT, Claude, and Google Translate's Nepali model?

Convolutional neural networks (CNNs)
Transformers
Decision trees
Naive Bayes

What comes next

We close the neural-networks chapter here. We have travelled from a single neuron to the architectures behind every famous AI system. The final chapter — Implications — steps back from the engineering and asks the questions that matter for Nepal: who benefits, who is harmed, who decides, and what AI literacy demands of a country trying to use these tools well.