Chapter 03 · Section II · 14 min read
Analysis and extraction
Pulling structure out of messy text — classification, JSON output, summaries with citation, table-from-paragraph. The unglamorous tasks that compound the most over time.
If writing is the visible value of generative AI, analysis and extraction is the quieter, larger one. Most professional work involves taking messy input — emails, customer feedback, PDFs, meeting notes, transcripts — and turning it into structured information someone can act on. Modern models are exceptionally good at this, and the time savings compound.
This section is about three high-value patterns: extracting structured data, classifying documents, and summarising with traceability.
Pattern 1 — Extracting structured data
You have unstructured text. You want a clean, structured representation — a JSON object, a table, a list of fields. The model is dramatically better at this than most people realise.
A simple example. You receive 30 customer support tickets in plain text, each containing somewhere in it: the customer’s name, their phone number, the product they’re asking about, and the urgency of the issue.
Prompt:
Extract the following fields from each support ticket. Return your answer as a JSON array, one object per ticket. Use null for any field you cannot find. Do not invent values.
Fields:
- name (string)
- phone (string, 10-digit Nepali format)
- product (string)
- urgency (one of: low, medium, high)
Tickets: [paste 30 tickets]
Output: a JSON array you can paste into a spreadsheet, a database, or any downstream tool. What would have taken a person an hour takes the model under a minute, and it does not get tired or distracted.
Why this works so well: the model is good at pattern-matching — recognising a phone number, a product name, an urgency cue. The structure is doing most of the work; the model is finding the values.
Pattern 2 — Classification
You have a category to assign. Spam or not. Which department should this ticket go to? Praise, complaint, or suggestion? Critical or routine?
For straightforward classification, modern models are at near-human accuracy with no fine-tuning, no training, just a prompt.
The pattern:
- Define categories crisply. Each category needs a one-line description.
- Add 2–3 examples per category (few-shot, from the previous chapter).
- Ask for the category, optionally with confidence and a one-line justification.
Example:
Classify each customer complaint into ONE of these departments:
- billing — anything about charges, refunds, payment methods
- delivery — anything about shipping, lateness, missing items
- product — anything about defects, quality, instructions
- other — anything that doesn’t fit cleanly
For each, output: department, confidence (high/medium/low), one-sentence justification.
Examples:
- “Charged twice for one order.” → billing, high, “explicit double-charge complaint”
- “Package arrived broken.” → product, high, “physical defect on arrival”
- “Where is my order? It’s been 10 days.” → delivery, high, “shipping delay”
Now classify the following: [paste complaints]
This produces a routing table you can hand to operations. Or feed into automation. Or just use to triage your own day.
Pattern 3 — Summarising with traceability
A common problem with model summaries: you can’t tell what’s directly from the source and what the model filled in. The fix is to constrain the summary to include explicit pointers.
Summarise this 20-page report. For each statement in your summary, include the page number it was drawn from in parentheses. If a statement is your synthesis across multiple pages, mark it [synthesis]. Do not include any claim you cannot point to in the source.
This is sometimes called “grounding” the summary. It is dramatically slower for the model (and uses more tokens), but the output is auditable — you can spot-check claims against the source. For high-stakes work — legal documents, government reports, anything you might be quoted on — this pattern is essential.
A variant for shorter texts: ask the model to include direct quotes alongside its paraphrase. The friction of having to quote forces it to stay closer to the source.
A worked example: messy interview transcript → action items
A real pattern. You record a 45-minute team meeting and use Whisper (chapter 4) to transcribe it. You now have 6,000 words of raw transcript. You need a clean list of what was decided, who is doing what, and by when.
Prompt:
Below is a transcript of a 45-minute team meeting. Extract:
1. Decisions made. One per bullet. Format: “Decision: [what]. Decided at line ~[X].”
2. Action items. One per bullet. Format: “[Person] will [action] by [date].”
3. Open questions. Items raised but not resolved. One per bullet.
Rules:
- Do not invent decisions or actions. If you are unsure, list them under “Possibly raised — verify”.
- Use the actual names in the transcript.
- If a deadline wasn’t stated, write “[no date stated]”.
Transcript: [paste 6,000 words]
Output: a meeting summary you can send to the team in 2 minutes. You have a verification step — did anyone actually decide this? — but you have one structured artefact rather than 6,000 words to comb through.
Practical notes on extraction at scale
A few things that come up when you do this often:
Token limits. Each model has a context window — the maximum amount of text it can read in one prompt. Modern frontier models handle 100,000 to 200,000 tokens (~75,000 to 150,000 English words) per request. This is enough for most documents. For larger ones, split the source into chunks, run extraction on each, then merge.
Cost. Long extraction prompts use a lot of tokens. For high-volume work, switching to a cheaper model (Claude Haiku, GPT-4o-mini, Gemini Flash) often produces equivalent quality on extraction tasks at a fraction of the cost.
Verification. Extraction at scale will not be 100% accurate. Set the expectation: you will spot-check, and you will sometimes catch errors. For high-stakes extraction (legal, financial, medical) build in a human review step.
Consistency. For repeated extraction tasks, write the prompt once, save it, and reuse. Small wording changes can drift the output; locked-in prompts produce locked-in output.
What models are bad at, even here
Three honest limits on extraction:
-
Things the source doesn’t actually say. If the document doesn’t mention a price, the model cannot extract one. Asking it to “guess” produces hallucinated prices that look real.
-
Reasoning over many separate facts. “Who made the most decisions in this 30-page board minutes archive?” requires reading all 30 pages and counting. Some models do this well; many do it badly, missing decisions in the middle. Verify on these.
-
Subtle semantic distinctions. “Was this a direct complaint or an indirect one?” requires a judgement call the model may or may not match yours. For these, few-shot examples are essential.
Check your understanding
Quick check
—A small NGO has 200 messy support tickets in plain text and wants them in a CSV with name, phone, product, urgency. The most efficient approach is to:
What comes next
We’ve covered drafting, editing, and extraction. The third major text capability — code and translation — gets its own section. These are the tasks where the gap between what the model produces and what is correct matters most, and where careful verification matters most.