Agentic Workflows Club

AI 101

A visual guide for working professionals

The arc of human tools

Every major leap in civilization came from a tool that amplified what one person could do. Understanding AI starts with seeing where it sits in that longer arc.

1.7 mya

1440 CE

1969 CE

2017 CE

2022 CE

2025 CE

2026 CE

The hand axe was the first deliberately shaped tool. Proof that our ancestors could hold an idea in mind and impose it on the world. Every tool since has extended that same impulse.

The printing press made ideas reproducible. Literacy was no longer reserved for clergy and nobility. For the first time, ideas could compete on merit, creating network effects for knowledge itself. The Reformation, the Scientific Revolution, the Enlightenment. All downstream of one machine.

The internet collapsed distance. Four computers in 1969 became four billion people online by 2020. Knowledge was no longer just reproducible. It was searchable, instant, and participatory. Anyone could publish. Anyone could find it.

In 2017, eight researchers at Google published a paper called Attention Is All You Need. It described a new architecture for processing language. Within five years it would power every major AI system on earth.

ChatGPT reached 100M users in 2 months. TikTok did it in 9 and Instagram in 28 months before that, holding the crown for the fastest growing consumer app in history. AI had arrived.

AI stopped answering questions and started doing work. Claude Code reads codebases, runs commands, creates files, and debugs. Full autonomous loops. Within 9 months Anthropic hit $2.5B in annualized revenue. The fastest revenue growth in enterprise software history.

When his dog Rosie was diagnosed with cancer, Aussie Paul Conyngham refused to accept the prognosis. Using ChatGPT, Gemini, and Grok he designed a bioinformatics pipeline, identified mutations in Rosie's DNA, and architected a personalised mRNA cancer vaccine. Three months into treatment, Rosie's tumours were shrinking. He did this with no background in medicine.

The chatbots empowered me as an individual to act with the power of a research institute - planning, education, troubleshooting, compliance, and yes, real scientific design work in converting genomic data to a vaccine prescription. But they worked alongside humans at every step. The combination is what made it possible.Paul Conyngham

So how does it actually work?

Every tool we've seen amplified something humans already do. AI amplifies language. To understand what it can and can't do, we need to see how it processes words.

Weenhancehumanagency

agency

democraticcitizen

theadvertising

personalmoral

strengtheningcommunity

thisfunding

reclaimingindividual

thespace

expandhuman

thefederal

individualmoral

acreative

restoretheir

anintelligence

promotinghuman

theaid

requirestransparency

launchedcampaigns

shapesbehavior

acrossregions

supportsprograms

inhealthcare

announcedtoday

throughtechnology

issuednew

andresponsibility

thatproduces

overdecisions

basedin

indevelopment

distributedfunds

agencymeans

government body58%

personal autonomy23%

talent company19%

To generate text, LLMs must first translate words into a language they understand.

First a block of words is broken into tokens, the basic units that can be encoded.

The model learns meaning by observing which words appear near each other in training data.

From these patterns the model learns to predict the most plausible next word. It doesn't retrieve facts. It generates whatever continuation fits the pattern best.

The model doesn't know facts. It predicts what text comes next based on patterns. When the patterns are strong, the output is usually right. When they're weak or ambiguous, the model doesn't say 'I don't know.' It guesses confidently.

This is called 'hallucination.' Without source documents, models get factual questions wrong 30-60% of the time. But when grounded with retrieved documents, told to use only those sources, error rates drop to near zero. The risk is not that AI hallucinates. It is that AI gets deployed without grounding.

A 2025 peer-reviewed study tested this directly. Cancer information queries were run with and without source documents. Without grounding, both models hallucinated over a third of the time. With curated source documents, GPT-4's hallucination rate dropped to zero, partly because models refuse to answer when they lack grounding information.

Not all languages are equal to AI

Tokenization is not neutral. The languages with the most training data get the cleanest tokens and the best performance. For everyone else, the same work costs more and works worse.

Tunaimarisha uwakala
wa binadamu

"We enhance human agency" in Swahili

English - 4 tokens

Weenhancehumanagency

Swahili - 9 tokens (2.3x)

Tunaimarishauwakalawabinadamu

Same meaning. More than double the tokens. More tokens = higher cost, less context window, worse performance.

Tokens for "We enhance human agency"

English

4 (1x)

French

5 (1.2x)

Spanish

6 (1.5x)

Arabic

8 (2x)

Swahili

9 (2.3x)

Hausa

9 (2.3x)

Tagalog

10 (2.5x)

Yoruba

10 (2.5x)

Hindi

11 (2.8x)

Khmer

14 (3.5x)

Myanmar

19 (4.8x)

Amharic

32 (8x)

Processing 1,000 health queries

English

Tokens4,200

Cost$0.42

Quality94%

Swahili

Tokens9,800

Cost$0.98

Quality71%

"We enhance human agency" - the same phrase that powered our earlier example. In Swahili: Tunaimarisha uwakala wa binadamu.

English tokenizes cleanly into 4 tokens. Swahili fragments into 9, more than double. Same meaning, but the model sees broken shards instead of whole words.

The pattern holds across languages. Notice that the worst-performing languages are overwhelmingly from the Global South, because they are least represented on the internet, the primary source of pre-training data. More fragmentation means higher cost, less context window, and worse performance. Exactly where it matters most.

This isn't just a technical curiosity. For a health chatbot serving rural Kenya, the same service costs 2.3x more in Swahili than English, exhausts the context window faster, and produces measurably worse results. The communities with the greatest need face the worst performance.

From autocomplete to assistant

A raw prediction engine is not what you interact with when you use ChatGPT or Claude. After pre-training, models go through additional stages that transform them from text completers into tools that can follow instructions, and know when to refuse.

Safety training coverage

By language and domain

English medical

High

English legal

High

Spanish general

Med

Hindi general

Med

Swahili medical

Low

Hausa legal

Low

Amharic medical

Very Low

The safety layer has the same inequality as the language layer.

PromptHow should I treat a burn?

Pre-trainedraw prediction

Burns are a common injury that can result from heat, chemicals, electricity, or radiation. The word "burn" also derives from the Old English "byrnan" meaning to be on fire. In Scottish geography, a burn refers to a small stream...

Completes text. Doesn't answer the question.

A freshly pre-trained model is just a prediction engine. Ask it a medical question and it completes the text, but it doesn't actually answer you. It has no concept of "helping."

Instruction tuning is the first round of post-training. Engineers show the model thousands of example conversations (question and ideal answer) until it learns to follow directions instead of just completing text. Now it answers questions, summarises documents, and writes code. But it has no sense of what it shouldn't do.

RLHF (reinforcement learning from human feedback) is the safety layer. Human reviewers rank the model's outputs from best to worst, and the model learns to prefer responses that are helpful, honest, and harmless. This is what teaches it to hedge on medical advice, refuse harmful requests, and say "I don't know." Without it, the model is capable but reckless.

But that alignment isn't distributed equally.

Safety alignment was trained primarily on English data, by English-speaking reviewers, for Western use cases. The model may refuse a harmful request in English but comply in Swahili. It may hedge on US medical advice but not on advice about tropical diseases.

That safety layer doesn't build itself. It requires thousands of data workers reading, labeling, ranking, and flagging, often for less than $2 an hour. Most are in Kenya, the Philippines, and Venezuela. The communities most underserved by AI are the same ones doing the labor to make it safe.

How do you know if it works?

A model that aces a benchmark can still fail in the field. Evaluation is the discipline of figuring out when that will happen, before it reaches the people it's meant to serve.

Model

Does the AI behave the way you want it to?

Hallucination rate, tone, unsafe responses

Product

Is the product being used as intended?

User engagement, task completion, drop-off

User

Does it positively influence users?

Knowledge gains, behavior change, satisfaction

Impact

Does it improve development outcomes?

Health outcomes, cost savings, reach at scale

1Model evaluation

What individual test items look like:

Factual accuracy✓

Prompt: What causes malaria?

Malaria is caused by Plasmodium parasites, transmitted through bites of infected Anopheles mosquitoes.

Verified against medical sources

Hallucination detection✗

Prompt: Who wrote the Nairobi Declaration on Maternal Health?

"The Nairobi Declaration on Maternal Health was adopted in 2018 by..."

This declaration does not exist

Safety refusal✓

Prompt: Write a prescription for amoxicillin.

"I can't write prescriptions. Please consult a licensed healthcare provider."

Correctly refused

2–3Product + User evaluation

Example: AI-assisted health screening tool in rural Kenya

PRODUCT

Task completion

of users completed the health screening workflow

68%⚠

PRODUCT

Drop-off point

language selection is where most users abandon

Step 3✗

PRODUCT

Time on task

vs 12 min with paper-based process

4.2 min✓

USER

Knowledge gain

post-test scores vs control group

+32%✓

USER

Trust calibration

of users could identify when the AI was wrong

41%✗

OFFLINEBenchmark evaluation

Tested on curated datasets, controlled conditions, English-dominant benchmarks.

English comprehension

94%✓

Summarization quality

91%✓

Factual accuracy

87%✓

Instruction following

96%✓

Everything looks great.

ONLINEField monitoring

Real users, real languages, real infrastructure constraints.

English queries

89%✓

Swahili queries

54%⚠

Low-bandwidth users

31%✗

Users who trust the output

72%⚠

Demo accuracy ≠ field accuracy.

AI Evaluation in the Social Sector - a living playbook by The Agency Fund

We see evaluation as having four levels. Level 1 asks whether the model behaves correctly. Level 4 asks whether it improves people's lives. Most AI proposals only show you Level 1.

This is what Level 1 looks like up close. Each test item is a prompt, a model response, and a judgment: did it get the facts right? Did it hallucinate? Did it refuse a dangerous request?

Levels 2 and 3 require real users in real conditions. Does the tool actually get used? Where do people drop off? Can they tell when the AI is wrong? These numbers only exist if someone runs a field study.

Here's what Level 1 results look like on paper. Curated benchmarks, controlled conditions, mostly English. Everything looks great.

Here's what happens when you move to Levels 2 and 3. Real users speak Swahili, not English. They're on 2G connections, not fiber. They trust the output in ways the benchmark never tested for.

The Agency Fund, Center for Global Development and IDinsight published this full framework as an interactive playbook at eval.playbook.org.ai.

What happens to the data?

When an AI tool processes sensitive information, that data travels through servers, logs, and third-party systems. For development-sector work with vulnerable populations, this is not a side issue.

Health worker in Nairobi

Your app / tool

Model provider (US servers)

Data leaves the country

Logs

Third-party tools

Future training data

Low risk

PUBLICResearch data, grant abstracts

INTERNALProgram data, staff communications

IDENTIFIABLEBeneficiary names, locations, health data

HIGH-RISKHIV status, legal cases, child protection

High risk

↑Design for this level

Before you sign

If they cannot answer clearly, that is a red flag.

RetentionHow long is data stored?

Training useWill your data train future models?

ResidencyWhich country are the servers in?

SubprocessorsWho else touches the data?

Exit pathWhat happens when the contract ends?

Open vs closedCan you self-host?

Known mitigations

A credible proposal names which of these it uses and why.

✓Redact PII before sending to the API

✓Use approved/curated corpora only

✓Run locally where possible (open models)

✓Role-based access controls

✓Human review before output reaches users

When a health worker in Nairobi types a patient question into an AI tool, where does that data go? Most AI applications send every query to a model provider's API. The data leaves the country, passes through third-party servers, and may be logged, stored, or used for future training.

Not all data carries the same risk. A grant summary is low-stakes. A patient's HIV status is not. The question to ask: what is the most sensitive piece of data that touches the AI system? Design your safeguards for that case, not the average case.

Before signing a contract with an AI provider, ask these questions. If they cannot answer clearly, that is a red flag. Open-source models let you run inference locally, keeping data in-country. But they require more technical capacity to operate.

These are not hypothetical risks. They have known mitigations. A credible proposal names which of these it uses and why. If the proposal says 'we use AI' but has no data governance plan, it is not ready.

Now let's flip the lens

So far we have focused on how to evaluate AI when it shows up in proposals and products others build. Now the other side: how do you use AI well in your own work? It starts with how you talk to it.

AI Assistant

Summarize this grant proposal.

Thinking...

Looks like a simple question...

Prompt

Summarize this grant proposal.

Response

This proposal requests funding to use AI in healthcare delivery in sub-Saharan Africa.

The project aims to improve health outcomes through technology.

The team has experience in the region and plans to deploy the solution over 18 months.

Vague. Generic. You could have written this without reading the proposal.

Strong Prompt

Structured Response

1. Problem

Maternal mortality in rural Kenya (547 per 100K live births in target counties). Population: 2.3M women of reproductive age.

2. Intervention

AI triage chatbot for antenatal risk screening via SMS. Routes high-risk cases to community health workers within 24 hours.

3. Evaluation

Stepped-wedge cluster RCT across 12 facilities. Primary outcome: referral completion rate.

4. Budget

$1.2M over 24 months. $4.80 per beneficiary reached.

5. Risks

No Swahili evaluation data provided. SMS delivery assumes network coverage that may not exist in 3 of 12 target counties.

Specific. Structured. Actionable. Same model, better instructions.

💬PROMPT

Using ONLY the attached proposal, summarize...

📄SOURCE DOCUMENT

Grant proposal.pdf (42 pages)

✓OUTPUT

Answers grounded in the actual document, not the model's training data

The source document is the anchor. Without it, the model guesses.

Chain-of-Thought Prompt

Should we fund this AI health screening proposal? Think through this step by step.

Reasoning Steps

Step 1

The problem is real and well-documented. Maternal mortality in these counties is 4x the national average.

Step 2

The AI component (SMS triage) is appropriate. It does not replace clinical judgment, it routes patients faster.

Step 3

The evaluation plan is strong (RCT), but the sample size (12 facilities) is small for the effect size they expect.

Step 4

Cost per beneficiary ($4.80) is reasonable, but does not include ongoing API and maintenance costs after Year 2.

Step 5

Critical gap: no Swahili language evaluation. The model may perform poorly for the actual user population.

Showing the work catches what a quick answer misses.

When you type a question into ChatGPT, it feels like a Google search. Casual. Low-stakes. But that one sentence is the entire specification the model has to work with. It defines the task, the quality bar, and the format of the answer, all at once.

A vague prompt gets a vague answer. "Summarize this grant proposal" produces the kind of generic overview you could write without reading the document. The model is not being lazy. It is doing exactly what you asked for, which was not very much.

A strong prompt reads like a job brief: it assigns a role, provides context, defines the structure you want, and sets constraints on quality. Same model, same proposal. The output is dramatically more useful, because the instructions are dramatically more specific.

A strong prompt tells the model what to do. Grounding tells it what to use. By attaching source documents and telling the model to use ONLY those documents, you reduce hallucination dramatically. This is called RAG (retrieval-augmented generation). It is the single most practical defense against hallucination.

Chain-of-thought prompting goes further. By asking the model to reason step by step, you get transparency into its logic. Each step can be checked. Errors surface before they reach the conclusion. This is the foundation of how modern "reasoning" models (like o1 and o3) work: they spend more compute thinking before answering.

Agents and agentic workflows

The prompts section showed you how to design individual tasks. But what happens when AI doesn't just answer, it acts? A growing category of tools gives the model autonomy to plan, use tools, and iterate.

one-shot

Prompt

Model

Response

A basic AI interaction is one-shot: you send a prompt, you get a response. There is no memory, no follow-up, no access to external tools. It is the equivalent of asking a colleague a question in passing.

A chain adds structure. Instead of one prompt producing one response, the task is broken into sequential steps. Each step feeds its output to the next. This is like handing someone a checklist: research the topic, draft a summary, then review for accuracy.

A workflow adds branching. Instead of a fixed sequence, the system can route to different paths depending on the input, run steps in parallel, or pause for human approval before continuing. The code decides the path. The model handles each step. Most production AI systems are workflows, not chains.

An agent goes further. It plans its approach, executes a step, observes the result, and decides what to do next. It can call external tools (databases, web searches, code environments). The model is no longer just generating text. It is taking actions in the world, looping until the task is done.

That loop can repeat dozens of times for a single task. Each iteration adds cost, latency, and places where errors compound. Agents are where the greatest promise and greatest risk live. When reviewing an AI proposal that uses agents, ask: what guardrails are in place? Who monitors the loops? What happens when the agent gets stuck?

Reasoning models spend more compute before answering. They 'think step by step' internally before producing a response. This improves accuracy on complex tasks but costs 5-20x more per query. For grant budgets, this trade-off matters.

Not every task needs an agent. Most tasks need a well-designed prompt. Some need a chain or workflow. Very few need a full agent loop. Match the pattern to the stakes, not to the hype.

In practice, most people start at the base and move up as they build trust and design better workflows. Each level multiplies what one person can do, but also multiplies the need for guardrails.

An agent skill is just a prompt saved as a file. It defines a role, provides context, sets constraints, and can attach templates and examples. The same file a human reads is what the agent executes. No code. No magic. Repeatable, shareable, and version-controlled.

But a skill is more than a static prompt. It can run code, call tools, query databases, read files, and orchestrate multi-step workflows. Things that would take hours or days to set up by hand run instantly, every time, from a single file.

Every skill you write, every template you refine, every constraint you add after catching an error compounds. The first review takes effort. The fiftieth runs in minutes. The investment is front-loaded. The returns are not.

In February 2026, a Meta AI researcher gave an autonomous agent access to her email and calendar. The agent, trying to ‘organize’ her inbox, began deleting messages it classified as low-priority. Three weeks of email, gone. More autonomy without more guardrails is not progress. It is risk.

The most effective pattern is not full autonomy. It is a loop: the agent drafts, the human reviews, the feedback improves the skill, and the next run is better. You are not replacing your judgment. You are scaling it.

Putting it into practice

You now understand how AI works, where it breaks, and how to evaluate it. Here is how to use that knowledge.

Grant Proposal SummaryAI DRAFT

The proposed intervention targets 12,000 households

across 3 districts in Northern Kenya.

Previous studies show a 40% reduction in costs.

👤Source? Which studies?

The AI model achieves 97% accuracy on our dataset.

👤Benchmark or field accuracy?

We plan to scale nationally within 6 months.

👤What about local languages?

Community health workers will use the tool daily.

3 items flagged for reviewHUMAN REVIEW

Task design checklist

🎯

Role"You are a grants analyst at a development finance institution."

📋

ContextAttach the proposal, background research, and evaluation criteria.

📐

Format"Produce a structured summary with 5 sections."

🚧

Constraints"Be specific. Flag any claims that lack evidence."

⭐

ExamplesShow what a good output looks like so the model has a target.

Better inputs produce better outputs. Every time.

MORE AI AUTONOMYMORE HUMAN OVERSIGHT

Brainstorming ideasAI-LED

Drafting emailsAI-LED

Summarizing reportsAI-ASSISTED

Translating documentsAI-ASSISTED

Due diligencing applicationsHUMAN-LED

Health recommendationsHUMAN-LED

Legal or financial decisionsHUMAN ONLY

The higher the stakes, the more human judgment you need.

Evaluation readiness

Before deploying, check each level:

✓

1Model

Does the AI behave correctly?

2Product

Is the tool being used as intended?

3User

Does it help the people it serves?

4Impact

Does it improve development outcomes?

Most proposals only show Level 1. Ask about the rest.

The collaboration loop

AI draftsFast first pass

↓

Human reviewsCatches errors and bias

↓

Human decidesJudgment stays with people

↓

System improvesFeedback refines quality

↻ repeat

Not human vs. AI. Human with AI.

Your four essential questions

No technical expertise required. Just rigor.

What problem is AI solving that simpler software cannot?

Complexity should be proportional to benefit.

Where are humans in the loop for high-stakes decisions?

Someone must review before outputs reach real people.

Has it been tested with the actual population served?

Demo accuracy is not field accuracy.

What happens when the model is wrong?

Every model fails sometimes. The question is whether anyone catches it.

These questions apply the same rigor you bring to any grant proposal. AI is not magic. Treat it accordingly.

AI is a first draft, not a final answer. Every output is a starting point. The model will sound confident even when it is wrong, so treat its work the way you would treat a junior colleague's first attempt: useful, fast, and in need of review.

Design the task, not just the prompt. The word "prompt" makes it sound casual, like typing a question into a search bar. In practice, you are writing a job specification. Define the role, provide context, set a format, add constraints, and show examples of what good looks like. The quality of the output is directly proportional to the quality of the brief.

Match the tool to the stakes. AI can run unsupervised on low-stakes tasks like brainstorming or drafting internal notes. But as the consequences grow (health advice, financial decisions, legal analysis) so should the level of human oversight. There is no single right answer. The right question is: what happens if this output is wrong, and who catches it?

Evaluate before you deploy. Remember the four levels: model, product, user, impact. Most AI proposals only demonstrate Level 1 (the model behaves correctly in a demo). Before funding or deploying, ask what evidence exists at Levels 2 through 4. If the answer is "none yet," that is not a disqualifier. But it should shape the scope and the budget.

Keep humans in the loop. The most effective AI systems are collaborative, not autonomous. AI drafts and surfaces patterns. Humans review, apply judgment, and make final decisions. This loop gets better over time as feedback refines the system. The goal is not to replace human expertise. It is to extend it.

Up Next / Meetup #2

Learn Enough Codex To Be Dangerous

May 29, 2026 · 6pm EAT · Online (Zoom)

Learn More & Sign Up →