A minimally viable AI policy
A six-clause framework for reflective AI adoption, between reflexive enthusiasm and blanket bans.
A year ago, April 2025, Shopify CEO Tobi Lütke’s internal memo leaked: “Reflexive AI usage is now a baseline expectation at Shopify.” A year on, Shopify has doubled down: universal adoption of AI code editors, thousands of Cursor licenses, thousands of internal prototype sites. At that scale, with in-house ML and the measurement infrastructure to catch regressions, reflex was a defensible posture and it has worked for them so far.
Yet it hasn’t worked as well for others. Klarna, which claimed in early 2024 that one AI assistant did the work of 700 customer-service agents, quietly started hiring humans back because the replacement quality did not hold up. Duolingo walked back its AI-first memo’s most enforceable piece, removing AI usage as a performance-review criterion in April 2026. Both adopted the memo’s language without the scaffolding underneath it.
Reflexive AI usage was the right call for Shopify. It is not the right call for most teams, including ours. Institutions like NIST, the ICO, and the European Commission have published frameworks for enterprises and regulators. Here’s a shorter take, for small practitioner teams.
What any AI policy should be trying to do
Let’s take a moment to articulate what any AI policy should be trying to enable versus discourage inside an organization.
Most things we would want a policy to prevent collapse into a few named patterns.
- Shadow AI use, where the work happens with the help of AI but the team cannot see or audit it.
- Silent under-tooling, where someone who wants to try a tool holds back because no one said it was allowed.
- Evidence-free enthusiasm or fear, the Tobi trap of mandating what the evidence does not yet support, or the ban trap of forbidding what no incident has shown to be harmful.
None of these failures show up in the moment. They show up six months later as missed leverage or quiet attrition.
What we want to enable is harder to write down because it reads as fuzzy until you watch it happen.
- Real experimentation, done safely on real work rather than toy demos.
- Honest measurement of what is working and what is not.
- Knowledge compounding, where what one person figured out becomes leverage for the next.
- Safe disagreement, where someone can say “this tool made me slower” and the response is “show us what you tried,” not a performance review.
Our AI policy framework
What follows is the shortest policy we could write that still enables what we want and heads off what we don’t. We walk through how each one plays out for us, but the framework itself is meant to be adapted to your organization.
1. Purpose
A Purpose clause is the one piece of a policy most readers will remember. It is also the clause doing the most philosophical work. It names what you want AI to say yes to and what you refuse to accept. Done well, it lets a stranger predict which calls you would make without reading the rest.
Here is ours.
The Agency Fund uses AI to move faster on research, analysis, and the work of expanding human agency. We do not overstate what the tools did, and we do not bet the work on models that are not yet ready for it.
Our purpose clause does two things at once: it gives the team permission to move on AI where it helps the mission, and it binds us to an honesty floor about what the tools actually did. We phrased it this way because our work is research and analysis that compounds only when attribution stays clean, on behalf of a mission where overclaiming would cost real trust. Another organization will land somewhere different because the leverage and the stakes are different.
2. Data tiers
Most “AI policy violations” are really data violations. Someone pasted client material into a chat that trained on it. Naming a fixed set of tiers, and attaching approved tools and off-limits behavior to each, is the move that keeps a policy short as it ages. A hundred scenario-by-scenario judgments collapse into four. New tool? Slot it in a tier. New data type? Same move.
The three questions the tiers are silently asking: does the tool train on what you send it, can someone other than you see the usage, and can a connector reach data you didn’t explicitly hand over.
Here is ours.
Four AI data classification tiers: Public, Internal, Client and IP, and Regulated. Each tier shows approved tools and off-limits behavior.
Approved - Any AI service. Consumer tier is fine.
Off-limits - Nothing tier-specific.
Approved - Business or API tiers with training disabled and feedback reporting off.
Off-limits - Consumer accounts. Tools with no admin console.
Approved - Approved business and API stack with verified retention and connector settings.
Off-limits - General web services, screenshot-to-AI tools, unclear data residency.
Approved - Enterprise contracts with DPAs, admin-enforced settings, logged access only.
Off-limits - Anything not pre-approved by legal.
The particulars will shift for your team. The shape should not.
For Internal data and above, “approved” means a business or API plan with verified training, retention, and feedback settings, not a checkbox in a personal account. A consumer account may train on our data where a business account does not, and feedback channels sometimes retain inputs for safety review even when training is disabled.
Meeting notetakers sit across every tier at once: a meeting drifts from public to regulated in five minutes and the bot cannot tell. We scope the notetaker to the most sensitive content likely to surface, not the topic on the calendar invite. Default off for HR, 1:1s, board discussions, and partner conversations with identifiable user data. Default on elsewhere, with transcripts stored in the matching tier and PII stripped before anything downstream sees them.
Connectors deserve more ink than the rest of this section because the mistakes they enable stay quiet until they are not. Three security concerns are worth sitting with rather than skimming. The first is scope creep: a grant to “read Drive” reaches every document the user has ever been shared on, not just the folder they had in mind, and a “read email” grant pulls years of client attachments into a context window the model was never meant to see. The second is supply chain: the connector operator is a third party in the trust surface between us and the model, usually with its own subprocessors and retention posture that nobody on the team has read. The third is prompt injection: a document pulled through a connector can carry instructions that the agent then executes against other connectors it has been granted, which is the confused-deputy problem at agent scale and the reason a poisoned PDF in Drive can send email on someone’s behalf. Our rule is that new connectors require a named owner, are approved per tier rather than globally, and are scoped to the narrowest permissions that still let the work happen.
3. Paid seat for daily drivers, API keys for secondaries
Every AI vendor has visible weaknesses and invisible ones. The visible ones are why your favorite is your favorite. The invisible ones are why monoculture quietly gets expensive, and a policy without a second vendor in rotation bakes them into the team’s judgment.
Here is ours.
Hub and spoke diagram showing one paid-seat daily driver with three secondary models on API keys. Persona buttons below show how different teams configure this.
The daily driver gets a paid seat. At frontier pricing today, a seat is more cost-effective than API usage once a team crosses even a modest volume threshold, and it opens the full product surface: memory, projects, connectors, the features that compound with daily use.
For everything else, API keys. When someone on the team wants to pressure-test work against a second model, or a third, we give them a key. The marginal cost is low and the return is asymmetric: one person discovering that Model B catches what Model A misses on a specific task type is an insight the whole team benefits from, if they share it.
We pick secondary models for distance from the primary. Different trainer, different architecture, different strengths. The goal is heterogeneity, not a backup. When someone finds a meaningful difference, that finding circulates. Over time the team builds a shared sense of which model to reach for and when, which is worth more than any single subscription.
4. Ownership
Our Purpose clause normalizes AI use within the team. What it does not settle is who stands behind the output once the tools have shaped it. A disclosure regime tries to settle that question with labels, and as AI use becomes universal the labels drift toward theater. The harder and more durable standard is ownership.
Here is ours.
Taking responsibility for the work we ship is a core value. It does not change when a model produced the first draft.
Ownership takes two forms inside our work. Internally, it shows up as specific attribution: “Claude drafted this section and I rewrote the middle third because it missed the funder’s constraint.” Vague credit, “AI helped,” starves the feedback loop the team relies on, and specificity turns one person’s ten hours of figuring something out into shared intuition about where the tools help and where they mislead. Externally, ownership shows up as work a named human has checked and is willing to put their name on, regardless of whether AI touched it. The question we ask before anything leaves the organization is not “did AI help write this?” but “am I willing to own every claim in it?“
5. Safety gates
Most work inside an AI-positive organization needs no special gate, but three surfaces do. The broadest is externally consumed content, anything a client, funder, grantee, reader, or regulator will judge the organization on. We do not ship that content without a named TAF human who has read it and is willing to own the work, because reputation erodes faster than trust recovers. Narrower than the review gate are autonomous actions, where an agent that deletes files, sends messages, or calls APIs can do meaningful damage at speed, and judgment-critical decisions, where the cost of being wrong is high and regulators agree; the EU AI Act explicitly classifies hiring and employee evaluation as high-risk. All three want humans in the loop, and the shape of that loop is different in each one.
Here are ours.
Interactive diagram showing three safety gates. Externally consumed content needs a named TAF human reviewer. Decisions need a qualified reviewer with written sign-off. Autonomous actions need dry-run, bounded access, and audit logs. Select a scenario to see which gate applies.
- Named TAF reviewer
- Reads before ship
- Willing to own it
- Qualified reviewer
- Written sign-off
- Known escalation
- Dry-run default
- Bounded access
- Audit logs
For review: named means the reviewer owns what they signed off on, which a committee cannot. The gate applies regardless of how much AI shaped the content, zero percent or ninety-nine percent, and what leaves the organization has a human behind it. For decisions: qualified means the reviewer can overrule the AI, written means there is a record someone can audit later, escalation means everyone knows who to call when the reviewer is unsure. The list is short on purpose; expand it to everything and we have written a ban. For actions: dry-run lets the human see what is about to happen, bounded access keeps a bug small, audit logs let us reconstruct what occurred. The threshold is “could this be reversed cheaply?” If the answer is no, a human looks at the plan before the agent runs.
6. Shared learning
A team that uses AI individually plateaus. Each person learns what works for them, then the knowledge stays with them. A team that pools what it figures out compounds, because the tenth hour one person spent on a problem becomes someone else’s first hour. Two things are worth pooling and they pool differently: artifacts that someone else can pick up and run (prompts, agents, skills, MCP configs), and stories from the trenches (what worked, what broke, what nearly shipped before someone caught it).
Here are ours.
The repo is the mechanism for reuse; a skill someone authored for one partner situation becomes the starting point for the next. The channel is the mechanism for context; the tenth-hour lesson about when a tool misleads is easier to post in three sentences than to polish into an artifact, and a team that reads each other’s wins and losses builds intuition faster than one that only reads each other’s prompts. Contributing is optional because mandates degrade contributions. Whatever any of us figures out, we leave a copy where the next person looks, and both the repo and the channel stay searchable by a new hire in under a minute, or nobody uses them.
Reflective over reflexive
Tobi’s memo said “reflexive AI usage is now a baseline expectation.” We think that is the wrong primitive.
Reflex is fast, unexamined, and good for things a person has already learned to do; a frontier tool none of us have used for a year does not qualify. Every clause above is an attempt to make reflection cheaper than its alternative.
What we deliberately cut is also in service of this. Mandatory AI use, blanket tool bans, announce-every-use disclosure, keystroke surveillance, an ethics committee, an enumerated prohibited-use list beyond the EU AI Act’s, and mandatory certification. Each performs safety rather than producing it. Each would make reflection harder by raising the cost of trying a new thing above the cost of staying with the current one.
A policy that asks for reflection has to be willing to be wrong. We watch a handful of signals to find out where it is: waiver rate (low means it fits), activity in the skills repo and wins-and-losses channel, spend on the second-preferred model, and the incident set, where near-miss rate should be high, shipped-harm severity should trend down, and time-to-detect should shrink. Catching mistakes early and fixing the system that let them happen is the actual success criterion.
Reflexive AI usage was the wrong baseline. Reflective AI usage is the one we suggest, held loosely, open to being wrong, meant to evolve as we learn.