ailiteracynepal 🇳🇵
Text size

Chapter 04 · Section III · 14 min read

Voice and video

Speech-to-text, voice cloning, music, and video generation — the fastest-moving modalities, and the ones with the sharpest practical and ethical edges.

If text is the most useful modality and images are the most striking, audio and video are the fastest moving. The state of the art in any of these three sub-fields in 2024 looked quaint by 2026. The state of the art in 2026 will look quaint by 2028.

This section gives you a working map for 2026 — what tools to use, what they’re good for, and the ethical edges that are sharper here than in text or images.

Speech-to-text: transcription

This is the most clearly good application of generative audio AI. You record audio. You get back text. The dominant tool, by a large margin, is OpenAI Whisper — released as open source in 2022, refined and deployed widely since.

What Whisper does well:

  • Nepali transcription — surprisingly accurate, given how little Nepali training data was historically available. Quality on clear-voiced speakers approaches that of trained transcriptionists.
  • Multilingual code-switching — Nepali sentences mixed with English words (extremely common in real Nepali speech) are handled naturally.
  • Long-form audio — meeting recordings, podcasts, interviews, lectures.

Where it struggles:

  • Heavy regional accents transcribe with noticeable errors.
  • Multiple overlapping speakers — the transcript merges them.
  • Music or noise in the background degrades accuracy fast.
  • Speaker identification — Whisper produces a single text stream, not “Speaker A said X, Speaker B said Y.” You need a separate tool (called diarisation) for that.

A practical workflow that works well in 2026:

  1. Record clean audio (close-mic if possible).
  2. Transcribe with Whisper.
  3. Paste the transcript into a language model and ask for: cleaned formatting, speaker labels (if guessable from content), and a summary.
  4. Verify by spot-checking three minutes of audio against the transcript.

Used this way, a one-hour interview goes from “afternoon’s work” to “twenty minutes” — most of which is verification.

Text-to-speech: voice synthesis

The inverse direction: you give the model text, you get back audio of someone saying the text. The leading tools in 2026:

  • ElevenLabs — best-known, strong voice quality, voice cloning available, subscription.
  • OpenAI TTS — integrated into ChatGPT, good quality, fewer voices.
  • Google Cloud Text-to-Speech, Azure Speech — enterprise tools, broad language support including Nepali.
  • Open-source options (XTTS, OpenVoice) — usable for offline or sovereign deployments.

Nepali voice synthesis has improved dramatically. Multilingual frontier voices from 2024 onwards produce passably natural Nepali, though a careful listener can still detect the synthetic origin from prosody and intonation.

Practical uses:

  • Accessibility — making written content available to people who can’t or don’t read text.
  • Drafts of narration for videos, podcasts, audiobooks.
  • Localising content — generating Nepali audio versions of English videos, and vice versa.
  • Interactive systems — chatbots, voice assistants, customer-service IVRs.

The reality check: for published audio (a podcast, a film, public messaging) most professionals still record human voices. AI voices are good but recognisably synthetic to a careful listener, and credibility drops when the voice sounds wrong. For internal audio (drafts, accessibility, prototypes) AI voices are immediately useful.

Voice cloning: a sharper edge

A specific capability worth singling out. ElevenLabs and a few other tools let you upload a few minutes of a person’s recorded voice, and then generate new audio in that voice saying anything you type.

Legitimate uses:

  • A podcaster generating a quick correction in their own voice without re-recording.
  • An accessibility tool letting someone preserve their voice before losing the ability to speak.
  • A content creator generating quick alternative-language versions of their own content.

Illegitimate uses:

  • Cloning a politician’s voice to fake a speech.
  • Cloning a family member’s voice for a phone-scam impersonation.
  • Generating fake audio “evidence.”

The same tool does both. The ethical and legal frameworks have not caught up with the technical capability. As an AI-literate citizen, the most important habit is skepticism toward audio recordings of public figures or family members in unexpected contexts. “I heard their voice” is no longer sufficient evidence of anything in 2026.

Music generation

A modality we will mention briefly. Suno and Udio generate full-length songs from a text prompt — “upbeat Nepali folk song about migration to the Gulf, with sarangi and acoustic guitar, 2 minutes” — and produce something genuinely listenable.

Legitimate uses: background music for videos, jingle prototypes, demo tracks, exploration. Illegitimate uses: training the models on copyrighted music (the major legal question of the moment), passing AI-generated music off as your own creation.

For most users, music generation is a fun curiosity. For working musicians, it is a serious economic question. Nepali traditional and folk music has not been the focus of these tools’ training data — the cultural depth that distinguishes a real Newari Dapha song from an AI imitation is, for now, beyond what these systems can produce.

Video generation

The newest modality, and the one with the biggest gap between hype and practical usefulness.

Tools in 2026:

  • OpenAI Sora — strikingly good short clips, expensive.
  • Google Veo — similar tier, integrated with Google’s ecosystem.
  • Runway — popular with creators for shorter clips and stylistic effects.
  • Pika — accessible, good for short fun clips.

What they do well:

  • 5–30 second stylistic clips.
  • Adapting existing video to different styles.
  • B-roll, dreamlike sequences, social-media short content.

Where they break:

  • Longer than ~1 minute without losing continuity.
  • Anything that needs a specific real person or place.
  • Realistic physics over long sequences.
  • Lip-syncing dialogue convincingly (still an open problem).

The honest summary: video generation in 2026 is impressive enough to share screenshots of, but not yet useful enough to replace most production video work. Watch this modality closely; the trajectory of the last two years suggests it will close the gap fast.

A unifying skepticism

The common thread across audio and video: any single piece of audio or video that you encounter, especially of a public figure or in an emotionally charged context, might be synthetic. The cost of producing convincing fake audio or video has dropped to near-zero.

This does not mean trusting nothing. It means treating audio and video with the same critical instinct you already apply to written claims: who is the source, what is the chain of custody, why is this surfacing now. The visual evidence ladder has lost a rung, and the missing rung will not come back.

For Nepali context: rumour propagates fastest through audio (voice notes shared on Viber, WhatsApp, Messenger) and video (TikTok, Facebook reels). The same channels that already amplify misinformation will be amplified again by cheap synthesis. AI literacy here is, increasingly, media literacy.

Check your understanding

Quick check

A WhatsApp voice note appears to be from your relative asking you to send money urgently to a new phone number. In 2026, the most prudent response is:

What comes next

We’ve covered all the modalities. Chapter 5 steps back to the question every careful user has to answer: where do these systems fail, and when should you not reach for them? We start with the most famous failure mode — hallucination.