Editing and control · ailiteracy.nepal

Generation from a blank slate is the visible use of image AI. Editing — taking an image you already have and modifying it — is, for most practical work, the useful use. This section is about the techniques that move you from “I asked the model for an image” to “I asked the model to modify my image.”

Three modes of edit

Modern image tools support three different kinds of editing, with very different uses.

Image-to-image (img2img). You provide a starting image and a prompt. The model produces a new image that takes inspiration from the input. You control how closely the output stays to the input via a “strength” slider — low strength stays close, high strength drifts further. Useful for: trying variations of a sketch, restyling a photo, applying a different mood.

Inpainting. You provide an image and mask an area you want changed. The model regenerates only the masked region, keeping the rest. Useful for: removing an object, changing one element (a tree to a building), fixing a hand, swapping a face. The most useful image editing technique for practical work.

Outpainting. The opposite of inpainting — extend an existing image beyond its borders. Useful for: changing aspect ratios, adding sky to a tightly cropped photo, making panoramas. Less commonly needed but striking when you need it.

A worked example: rescuing a photo

You took a photograph of your team at a workshop in Pokhara. The composition is great, but there’s a power line cutting across the sky, and one team member’s face is half-obscured by the camera’s lens flare.

Without inpainting, your options are: live with it, hire a photo editor, or struggle with Photoshop’s clone-stamp tool. With inpainting:

Open the image in the tool (Photoshop’s Generative Fill, Adobe Firefly, or any modern editor with built-in AI).
Mask the power line. Prompt: “clear blue sky”. Generate. Power line is gone, sky filled naturally.
Mask the lens flare. Prompt: “clear face, sharp focus, matching skin tone”. Generate. Flare gone, face restored convincingly.
Done. Five minutes.

The photograph is now usable. Nothing of the original was reshot. The intent of the original — a team photo at a workshop — is preserved. This is image editing at its most useful: not making new art but finishing the work you actually have.

Reference images and ControlNet

For more advanced control, some tools let you provide reference images that constrain different aspects of the output.

Composition reference — generate an image with the same poses/composition as a reference.
Style reference — match the colour palette and brushwork of a reference image.
Depth reference — match the 3D structure of a scene.
Pose reference — generate a person in the exact pose shown in a reference.

These features (sometimes called ControlNet in the open-source world) are powerful and a bit fiddly. For most users, they are overkill. For designers and serious users, they are the bridge from “AI as toy” to “AI as production tool.”

A practical entry point: many tools now let you simply upload a reference image and write “match the style of this image” or “match the composition of this image.” The complexity is hidden behind simple UI.

Iteration as a craft

Image work is fundamentally iterative. A typical professional workflow might look like:

Generate 4–8 initial variations.
Pick the most promising 1–2.
Inpaint to fix the broken bits (hands, text, lighting).
Upscale to higher resolution.
Touch up in a regular editor for final polish.

Each step is fast — minutes, not hours — but each step matters. The end product is rarely the first generation. It is the result of small, deliberate refinements.

Where editing tools still struggle

Three honest limits:

Preserving identity exactly. Inpainting a face usually slightly changes it. For photos of specific people, this is a problem — the edited photo doesn’t look like the same person. Tools are improving fast; in 2026 the issue is reduced but not gone.
Maintaining consistency across many edits. Sequential edits accumulate drift. The fifth edited version is noticeably less the same image than the first. For high-fidelity preservation, do all your edits in one pass when possible.
Edits that require understanding 3D structure. Removing one chair from a row of chairs (where the next chair was hidden behind it) requires inferring what was occluded. Models sometimes get this wonderfully right and sometimes hilariously wrong.

Practical advice

Three habits that compound:

Mask precisely. A loose mask gives the model freedom to change things you wanted preserved. A tight mask constrains it to the area you actually want changed. Time spent on the mask is the largest single factor in edit quality.

Use the right tool for the right job. Photoshop’s Generative Fill is great for casual touch-ups. Adobe Firefly is good for design-driven work. Open-source tools (with ControlNet) are best for fine control. Don’t try to do everything in one tool.

Save versions. Each round of edits should be saved as a new file. Sometimes the third version was better and you only realised later. Disk space is cheap.

Check your understanding

Quick check

—

You have a good photograph but there is an ugly power line crossing the sky. Which technique is most appropriate?

Generate a brand new image from scratch with a text prompt
Use inpainting — mask the power line, prompt "clear blue sky," and let the model fill in only that region
Take a new photograph
There is no way to fix this

What comes next

We close image work here. The next section is about audio and video — speech-to-text, voice cloning, music, and the rapidly improving video models. These modalities are evolving fastest, and the practical and ethical considerations around them are sharper than for text or images.