Chapter 04 · Section I · 14 min read
Image generation
How modern image generators work, how to prompt them, and where the cracks still show in 2026.
In 2020, generating an image from a text prompt was a research curiosity. By 2026, it is a feature inside Word and PowerPoint, a Rs. 1,500-a-month subscription that anyone with a phone can use. Image generation is the most visually striking application of generative AI, and the one most likely to be misused — for good and bad reasons.
This section is about using it well. We cover what these tools actually do, how to prompt them, and where the cracks still show.
How image models actually work
A modern image generator is a different beast from a language model, but the family resemblance is close. The dominant architecture is the diffusion model, which works roughly like this:
- Training. Show the model billions of (image, caption) pairs. Let it learn what kinds of images correspond to what kinds of descriptions.
- Generation. Start with pure noise — a TV-static image. Then, guided by your text prompt, iteratively denoise the image, step by step, until a coherent picture emerges that matches the prompt.
That description is technical but the consequence is intuitive: each generation is a new image. The model is not retrieving a stored picture; it is composing one from learned patterns, in the direction your prompt pulls it.
The tools you will use:
- Midjourney — strongest aesthetic; runs in its own app and on Discord; subscription.
- DALL·E 3 — integrated into ChatGPT Plus; weaker aesthetically but easy to access.
- Stable Diffusion (and variants — Flux, SDXL) — open-source; you can run them yourself on a decent GPU; infinite customisation.
- Ideogram — known specifically for handling text inside images, which most others botch.
- Adobe Firefly, Canva, Imagen — image generation inside design tools you already use.
For most casual use in 2026, whatever is in the tool you already pay for is the right answer. The aesthetic differences between models are real but small.
The structure of a good image prompt
Image prompts have a different shape from text prompts. Five elements, in roughly this order:
- Subject. What is in the picture?
- Setting/context. Where? What’s around the subject?
- Style. Photo? Painting? Illustration? Which style of painting?
- Mood/atmosphere. Bright? Moody? Cinematic? Dreamlike?
- Composition/camera. Close-up? Wide shot? From above? Bokeh?
A weak prompt:
A Nepali woman
This will get you something, but generic. A focused prompt:
A 30-year-old Nepali woman in a colourful kurtha-suruwal, standing in the courtyard of a Newari brick home in Patan at golden hour, soft natural light, photographic style, shallow depth of field, warm tones, looking off-camera
Each clause adds specificity. The model has more to anchor on.
Style anchors that work
Some style cues that consistently affect output across most models:
Photographic styles: “photograph, 35mm film, golden hour, shallow depth of field, bokeh, cinematic lighting” — produces realistic photos.
Illustration styles: “watercolour illustration, ink and wash, pen drawing, vector flat, isometric, cartoon” — produces artistic outputs.
Era/movement: “Studio Ghibli style, 1960s travel poster, Soviet propaganda poster, Mughal miniature painting, ukiyo-e woodblock” — borrows the aesthetic language of recognisable styles. Useful but risks blandness if overused.
Mood: “soft, dreamy, ominous, austere, joyful, melancholy, sun-soaked, rain-washed” — sets the emotional register.
Camera language: “close-up, wide shot, low angle, drone shot, over-the-shoulder, fisheye, double exposure” — affects composition.
A working habit: keep a small text file of phrases that worked well for you. Paste from it for related projects. The phrase library is one of the largest practical productivity gains you’ll make in image generation.
Negative prompts
Most tools accept a negative prompt — things to avoid. Useful when the model keeps adding something you don’t want.
Common examples:
- “no text, no watermarks, no logos” — for stock-photo-style imagery.
- “no extra fingers, no deformed hands” — for portraits.
- “no Hollywood stereotypes of Nepal” — when the model defaults to mountains-and-monks for any Nepali subject.
Negative prompts are like guardrails. Use them sparingly; over-using them can suppress useful detail along with the unwanted ones.
Where the cracks still show
A list, in 2026, of failure modes you should expect.
Text inside images. Most models still write garbled letters. “Welcome to Pokhara” on a sign comes out as gibberish that looks like English but isn’t. Ideogram and Flux handle text better; for important text overlays, generate the image without text and add text in a graphics editor.
Hands. Hands have improved but still misbehave — six fingers, three knuckles in the wrong place, a thumb attached oddly. For close-ups of hands or interactions, expect retries.
Specific real people. Even with names in the prompt, models produce someone vaguely resembling the named person, not the actual person. The output is aesthetically right and factually wrong.
Specific cultural artefacts. Asking for “a traditional Newari woman’s mhecha (silver belt)” might produce something that has the vibe of Newari jewellery but isn’t a mhecha. Models are imprecise about culturally specific details outside the dominant Western training data.
Counts. “Five chickens in a courtyard” often produces three or seven chickens. Counting is unreliable in image generation.
Reflections, shadows, and symmetry. Mirrors that show impossible reflections; shadows that go the wrong direction; faces that are subtly asymmetric. The model has learned what these look like, not how they work physically.
A worked example: NGO hero image
You’re making a website for a Nepali NGO that works on rural water access. You want a hero image: real-feeling but inviting, conveying the work without being cliché.
Iterative session:
Prompt 1:
A photograph of rural water access in Nepal
Result: a generic mountain village. Cliché. Stock-feeling.
Prompt 2 — add subject specificity:
A 9-year-old Nepali girl filling a steel pot from a tap stand in a rural village in Sindhupalchok, mid-morning, soft sunlight, photographic, slightly faded colours, candid moment, photojournalistic style
Better. Specific, grounded.
Prompt 3 — refine after seeing output:
Same as above, but add: clear blue plastic pipe visible in the background, simple home with corrugated tin roof, no posed expression, looking down at the water flowing
The output is now usable. Three iterations, five minutes total. Without the iteration, you would have either accepted a worse image or given up.
Quick rules for working with image generators
- Iterate. First output is rarely the best. Try 3–4 variations.
- Be specific. Vague prompts produce vague output.
- Generate without text, add text separately. Don’t fight the tools on what they can’t do.
- Save your phrases. Build a small personal library of what works.
- Verify cultural specifics. Don’t trust the model on details a local would catch.
- Match the tool to the modality. Need photoreal? Pick a tool known for photoreal. Need illustration? Pick one known for that.
Check your understanding
Quick check
—A designer wants a poster with the Nepali text “स्वच्छ पानी, स्वस्थ जीवन” prominently visible. In 2026, the most reliable workflow is:
Quick check
—A weak image prompt like “a Nepali woman” tends to produce generic output. The most reliable single fix is to:
What comes next
We’ve covered generation from scratch. The next section is about control — image-to-image editing, masking, modifying an existing image rather than generating a new one. This is where image AI starts replacing actual design work.