ailiteracynepal 🇳🇵
Text size

Chapter 01 · Section II · 14 min read

LLMs and foundation models

A short, jargon-light look at what's actually inside ChatGPT, Claude, and Gemini — and why the same model can do so many different things.

You do not need to be an engineer to use generative AI well. You do need a mental model of what’s behind the chat box, because the model’s behaviour starts making much more sense once you have one.

This section is that mental model — short, deliberately imprecise where precision would not help you, and grounded in things you can verify yourself.

A large language model is a next-word predictor

Underneath ChatGPT, Claude, Gemini, and every chatbot of their class is a large language model (LLM): a neural network that has been trained to do one thing — predict the next word (technically: the next token) given everything that came before.

It sounds laughably simple. The whole reason these systems can write essays, summarise contracts, or draft Nepali emails is that if you can predict the next word reliably enough, you can keep predicting and produce arbitrarily long, coherent text.

The model never “writes a paragraph.” It writes one word. Then it reads what’s been written so far — including the word it just produced — and writes the next word. Then the next. Then the next. The illusion of intent, paragraph structure, even reasoning, all emerges from this loop.

What “large” actually means

The “large” in LLM is about two numbers: parameters and training data.

  • A model has a number of parameters — the learned weights inside the neural network. GPT-2 in 2019 had roughly 1.5 billion. GPT-4 class models in 2025 have hundreds of billions or more.
  • A model is trained on tokens — chunks of text. Modern models train on trillions of tokens.

More parameters + more training data + more compute = a model that handles a wider range of tasks more competently. This pattern — scaling — has been the central insight of the last five years of AI research. It is also the reason these models are expensive to build: a frontier model costs tens of millions to hundreds of millions of US dollars to train.

The practical implication: you cannot train one of these. Almost no one can. What you can do — and what this course teaches — is use one well.

Foundation models: one model, many uses

The term foundation model describes a different idea. The same base LLM, after training, can be adapted to many specific uses without being retrained from scratch.

  • The base GPT-4 powers ChatGPT, Microsoft Copilot, the GitHub coding assistant, dozens of customer support chatbots, and translation features inside Word.
  • The same Claude model powers a research tool, a code editor, a customer-service deployment, and an internal knowledge base for several companies.

This one model, many uses property is what “foundation” refers to. It is also why the same model that helps a Nepali high-school student write a history essay can also help a Khalti engineer write a SQL query. The foundation is broad enough to support both.

Why models “know” so much

The training data for a modern LLM includes a large fraction of the public web, books, academic papers, code repositories, and miscellaneous reference material. Over the course of training, the model has read more text than any human ever could.

This is not the same as knowing facts. The model has not stored a structured database of facts. It has stored statistical patterns — what word usually follows what, what kind of paragraph usually follows what kind of question, what tone is appropriate when. When you ask “who was Bhanubhakta Acharya?”, the model produces a plausible answer because it has seen many discussions of Bhanubhakta in its training data, and the patterns suffice for a useful summary.

The corollary: where the training data on a topic was thin — say, the genealogy of a specific Newari family, or events in a small village in Karnali — the model has no foundation to draw on. It will still try to answer, because that is what it is trained to do. The answer may be partly or wholly invented. We return to this in Chapter 5.

Why models are fluent in many languages

One question Nepali learners often ask: how is ChatGPT good at Nepali at all, given that it must have been trained mostly on English?

The answer is two-fold. First, some Nepali text exists in the training data — Wikipedia, news sites, government documents, blog posts. It is much less than English text, but it is not zero. Second, the model learns abstract structure — grammar, semantics, the relationships between concepts — partly from one language, partly from another. The structure transfers.

This is why a model can write a credible Nepali email even if you suspect it was trained mostly on English. It is also why a model is worse at low-resource languages like Limbu or Tharu — the structural patterns from English transfer further than the specific vocabulary.

What this means for using these tools

A few things you should now expect:

  • The model is much better in well-resourced languages (English, then a handful of others including Nepali). It degrades for low-resource languages.
  • The model is much better on widely-discussed topics than niche ones. Bhanubhakta, yes; the specific history of a small Tamang village, no.
  • The model is much better at imitating a style or task than at being factually correct about specifics. Drafting an email — easy. Citing the exact section of the Muluki Civil Code — risky without verification.

If you keep those three patterns in mind, most of what is to come in this course will feel like extension rather than surprise.

Check your understanding

Quick check

Underneath ChatGPT or Claude is, at its core, a:

What comes next

We’ve looked at the engine. The next section is about what you can put into the engine and what comes out — the modalities: text, image, audio, video, code. Each has its own strengths, weaknesses, and best-known tools.