Chapter 01 · Section III · 14 min read
Text, image, audio, video
The four main "modalities" generative AI handles today — what each is good for, what the best-known tools are, and where each one breaks.
“Generative AI” is not one tool. It is a category, organised loosely around the kind of output the model produces. The output type is called the modality, and almost every practical question about which tool to use comes down to: what modality do I need?
This section is a tour. Four modalities, what each one is good for, and the rough state of the art in 2026.
Text
The most mature and most useful modality. Tools you will recognise:
- ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google) — the big general-purpose chat models. All three handle Nepali competently and English exceptionally.
- Microsoft Copilot — a Microsoft-branded wrapper around GPT, integrated into Office.
- Smaller and open-source: Llama, Mistral, Qwen — usable but require more technical setup; relevant for organisations that need data to stay on their own servers.
What text models are good at:
- Drafting (emails, reports, posts, summaries).
- Editing (tightening, restructuring, changing tone).
- Translation, especially between major languages.
- Extracting structured data from messy text.
- Answering general-knowledge questions where errors are tolerable.
- Code generation for common tasks.
Where they break:
- Specific facts about people, places, events outside the training data. They confidently invent.
- Math beyond a few steps. They will often produce confident wrong arithmetic.
- Anything that requires up-to-date information unless the tool has live web access.
- Sustained reasoning over long chains.
The text models are the workhorses. Most of what this course teaches applies to them.
Image
The most spectacular modality. Three major players in 2026:
- Midjourney — strongest aesthetic; subscription-based; runs in its own app.
- DALL·E 3 — built into ChatGPT Plus; weaker than Midjourney aesthetically but tightly integrated with text models.
- Stable Diffusion (and many open-source variants) — runs on your own hardware if you want, infinitely customisable, requires more setup.
There are dozens of others — Ideogram, Flux, Imagen, the image generators inside Adobe and Canva — but the pattern is the same. You write a prompt, the model produces an image.
What image models are good at:
- Stylised art, illustrations, mood boards.
- Concept design and visual exploration.
- Stock-photo-style images where details don’t need to be exact.
Where they break:
- Text inside images. Most still misspell Devanagari (and even English) when asked to render text on a poster or sign.
- Hands and fingers. A famous failure mode — five fingers becomes six becomes four with eerie inconsistency.
- Specific people. Asking for “Prime Minister of Nepal in 2024” produces someone vaguely Nepali-looking but not the actual person.
- Cultural specifics. Asking for “a traditional Newari home” produces an image with Newari vibes but architectural details that are often wrong.
Audio: speech and voice
Audio splits into a few sub-modalities, each with its own tools.
Speech-to-text (transcription). Recording in, text out. The dominant tool is OpenAI Whisper, which handles Nepali surprisingly well. Used widely now for journalism, meeting transcription, podcast subtitles, and accessibility.
Text-to-speech (voice synthesis). Text in, audio out. ElevenLabs is the best-known voice cloning service; OpenAI, Google, and many smaller services also offer this. Nepali voice synthesis has been catching up rapidly — multilingual models from 2024 onwards produce passably natural Nepali, though clearly synthetic to a careful ear.
Music generation. Suno and Udio generate full-length songs from a text prompt. Quality is striking, copyright is messy, the cultural implications for working musicians are real.
Where audio breaks:
- Speakers with strong regional accents are transcribed less accurately.
- Voice synthesis of Nepali still has subtle prosody and intonation errors that betray it as AI.
- Music generation cannot match the cultural depth of, say, classical raga or specific Nepali traditional forms.
Video
The newest and most rapidly improving modality, and the one most prone to overhype.
In 2026, the leading tools are Sora (OpenAI), Veo (Google), Runway, and Pika. They produce short clips — typically 5 to 60 seconds — from a text prompt. Quality has improved dramatically year over year. Cost is still high (cents to dollars per second of video).
What they’re good at:
- Short stylistic clips, B-roll, dreamlike sequences.
- Adapting existing video to different styles.
Where they break:
- Anything longer than about a minute without losing continuity.
- Anything that needs an actor’s specific identity or a specific real place.
- Realistic physics over long sequences — objects appear and disappear, or move strangely.
The honest assessment in 2026: video generation is impressive, not yet useful for most production work outside short-form social content and prototyping. That will change. Watch this modality especially closely.
Code
A special case. Code is technically text, but specialised tools — GitHub Copilot, Cursor, Windsurf, Claude Code — make code generation feel like its own modality.
Useful capabilities in 2026:
- Autocompleting boilerplate.
- Translating between languages.
- Writing tests and documentation.
- Explaining existing code.
- Debugging error messages.
Where they break:
- Anything that requires understanding the larger system the code lives in.
- Subtle correctness — generated code that compiles and looks right but is wrong in a non-obvious way.
- Security-sensitive code, where confident-but-wrong is dangerous.
Most professional software developers in 2026 use one of these tools daily. Most also have war stories about code the tool produced confidently and incorrectly.
Choosing which modality you need
A simple question to start with: what output type am I trying to produce?
- A draft email or report → text model.
- An illustration for a presentation → image model.
- A transcript of a Nepali meeting → speech-to-text.
- A voiceover for a video → text-to-speech.
- A short clip → video model (with realistic expectations).
- A few lines of code → coding assistant.
The mistake to avoid is using the wrong tool. ChatGPT will not reliably generate an image (it can call DALL·E, but DALL·E is the actual image model). A coding assistant will not draft a polite Nepali email well. Use the tool that was designed for the modality you need.
Check your understanding
Quick check
—A journalist wants to publish accurate Devanagari text inside an AI-generated poster for a story. Which is the most realistic expectation in 2026?
What comes next
We’ve named the modalities. We move now from what these tools are to how you talk to them. Chapter 2 is about the single skill that decides whether you get useful output: the prompt.