Accuracy Doesn't Drive Adoption.
What a century of human-machine trust teaches us about building AI products
Hey there! 👋
You're reading Open Scout 👀
Humans have been delegating control to machines for over a century, and the pattern for how trust gets built has been the same every single time.
Cars had manual transmissions before automatic. Power steering came before cruise control. Cruise control came before adaptive cruise. Adaptive cruise came before lane assist. Lane assist came before self-driving attempts. Each step delegated slightly more control. Each step required the driver to trust the previous one before accepting the next. Nobody went from cranking a Model T to sleeping in a Tesla.
ATMs showed your balance before letting you withdraw. You confirmed the number matched what you expected, THEN you trusted the machine with your money. Online banking previews every transaction before executing. Self-checkout displays each scanned item in real time. The microwave has a glass door so you can watch your food.
The pattern: reversibility first, preview second, graduated control third, explanation fourth, track record over time. Bottom-up. Always.
Many AI products launch without any of this. The user can’t undo, can’t preview, can’t control scope. The team spent six months improving model accuracy and zero weeks on what happens when the model is wrong.
There’s an old observation that when a metric becomes the target, people optimize the metric and forget what it was supposed to measure. That’s what happened with accuracy across the AI industry. The number went up. Adoption didn’t follow. Because accuracy was never what users were evaluating. In most productivity and workflow contexts, a 95% accurate model that’s irreversible and opaque loses to an 80% model where mistakes are cheap and transparent. Users make a gut assessment: “what happens if this goes wrong, and can I fix it?” The benchmark score never enters that calculation.
There IS an accuracy floor below which no design can save you. An AI tool that’s wrong half the time won’t get adopted no matter how reversible it is. But above that floor, the gap between 85% and 95% accuracy matters far less than the gap between “I can undo this” and “I can’t.”
Before we begin... a big thank you to this week’s sponsor: EnergyX
The Hidden Cost of ChatGPT: A Lithium Crisis
That ChatGPT answer you just got? It ran on electricity. A lot of it.
OpenAI’s data centers consume more power than some small countries, and most run rows of lithium-ion batteries. The world isn’t producing enough lithium to keep up.
EnergyX is ready to fill the gap. They just announced plans for a second lithium project in the United States, with approximately 2.4M tons of lithium. They’re projecting $600M+ in annual revenue, at full scale.
50K+ people have already invested alongside General Motors, POSCO, and Eni, Italy’s largest oil company. Now you can join them as lithium demand is projected to hit 5.5M tons by 2040. Invest in EnergyX by July 16.1
If you’re interested in sponsoring this newsletter, please reach out here.
How confidence actually gets built
Assaf Elovic at Monday.com and Harrison Chase at LangChain published a framework for this they call CAIR: Confidence in AI Results.
CAIR = Value ÷ (Risk × Correction Effort)
Value: how helpful when it works. Risk: what goes wrong when it doesn’t. Correction: how hard to fix a bad output. High value, low denominator = adoption. High denominator = the AI gets ignored regardless of how good it is.
The formula is useful. The mechanism underneath it is more interesting. Confidence isn’t one thing. It’s layered, and the layers build in a specific order you cannot skip.
Cursor, the AI code editor, shows all six layers at once. Watch what happens during a typical interaction:
The AI suggests a code block inline. The developer reads it and either accepts it with Tab or rejects it by typing over it. If the accepted suggestion turns out to be wrong, Ctrl+Z. Gone.
The suggestion appeared in the local editor, nowhere near production. The developer can see what the AI wants to do before it executes. They control how much to accept, line by line. Over a few days, they develop a sense for which suggestions to trust and which to double-check.
Every confidence layer is present:
→ Reversibility. Ctrl+Z. Instant. Total. If Cursor auto-deployed to production, adoption would collapse overnight regardless of accuracy. This is the foundation everything else stands on.
→ Preview. Suggestion appears inline. Developer reads before accepting. The focus is on evaluating intent, not cleaning up damage.
→ Consequence isolation. The code runs in a local editor. Not on a staging server. Not anywhere near production. If the suggestion is terrible, the blast radius is zero. This is different from reversibility (undo after the fact) and different from preview (seeing before acting). Consequence isolation means the action happens in a sandbox where nothing real is at stake. Monday.com could apply this by letting users test automations on a duplicate board before deploying to the live one. Draft modes, test environments, sandbox accounts. The principle: let users experience the AI’s output in a space where mistakes cost nothing before trusting it in a space where they do.
→ Graduated control. Accept all, accept line by line, or reject entirely. The developer chooses the autonomy level on each suggestion.
→ Explainability. Suggestion shows context: where it fits, what it replaces, how it connects. The developer evaluates reasoning, not just output.
→ Track record. After a week, the developer trusts boilerplate suggestions but still double-checks complex logic. That calibrated trust comes from accumulated experience, not from a documentation page.
Elovic and Chase analyzed Cursor explicitly: Risk low (local environment), Correction low (one keystroke), Value high. CAIR is high. Adoption is explosive. Now imagine Cursor auto-committing to production. Same model. Risk jumps. Adoption collapses. The model didn’t change. The confidence architecture did.
Why the layers have an order
The intuition is that explainability matters most. “If users understand WHY, they’ll trust it.” Try explaining your reasoning to someone who can’t undo what you just did to their data. They don’t want your explanation. They want the damage reversed.
Reversibility first. Without it, no other signal matters. Preview second. Users need to see what’s happening before evaluating whether to allow it. Graduated control third. Trust in one context (”draft emails”) shouldn’t automatically extend to another (”send emails”). Explainability fourth. Only relevant once users are engaged enough to care about reasoning. Track record last. Earned, not designed.
Google’s AI Overviews showed what happens when layers get skipped. In 2024, the system recommended adding glue to pizza sauce. Delivered with the same confident tone as accurate answers. No uncertainty indicator. Maximum confidence on every response, which meant the confidence signal contained zero information. Users couldn’t tell right from wrong because the AI treated everything identically.
One incident like that erases months of correct answers. Users trust AI quickly but lose trust faster after unexpected outputs. The trust account fills slowly and drains fast. A bad experience in week one can permanently cap a user’s willingness to delegate.
The product design map
Different products have different natural CAIR profiles depending on what happens when the AI is wrong:
High CAIR (fast adoption):
Code editors (Cursor): local, instant undo, high value
Writing assistants (Grammarly): collaborative, user keeps editorial control
Image generation (Midjourney): output is suggestion, no real-world consequences
Moderate CAIR (adoption depends on design):
Workflow automation (Monday.com): real data, real impact. Elovic suggested that adding a preview screen showing what the AI will do before execution could shift risk from Medium to Low, fundamentally changing the adoption profile without changing the model.
Customer support AI: wrong answers reach real customers, but human review before send changes everything
Low CAIR (adoption struggles regardless of accuracy):
Healthcare diagnostics: autonomous AI diagnosing patients is low CAIR (wrong answer potentially fatal). But assistive AI that flags areas of concern for a human radiologist to review is surprisingly high CAIR, because the human retains decision authority and the AI just sharpens their attention. Same technology, different confidence profile based entirely on who makes the final call.
Financial tools: math errors create legal liability. Wealthfront handles this with a smart division of labor: AI does what AI is good at (pattern recognition, portfolio rebalancing based on market signals) and humans handle what requires precision (tax calculations, regulatory compliance, specific dollar amounts). The confidence comes from each side doing what it’s actually best at, not from the AI trying to do everything.
Legal analysis: hallucinated case law has already caused court sanctions
Low-CAIR industries don’t need better models. They need better confidence architecture. The most effective pattern in these industries is strategic human-in-the-loop, but specifically at KEY decision points, not everywhere. TurboTax doesn’t ask the human to verify every calculation. It runs thousands of computations autonomously and surfaces the big decisions: “Itemizing saves you $2,300 based on your home office deductions. Standard or itemized?” The human reviews the ones that matter. The AI handles the volume that would be impossible manually. That targeted handoff, AI handles scale while human handles judgment at critical moments, is how low-CAIR categories become usable without sacrificing speed.
The 40% warning
Gartner projects over 40% of AI agent projects canceled by 2027. Not from technical failure. From users refusing to delegate enough to make agents useful.
Agentic AI is building exactly the products that need the confidence stack most and are least likely to have it. Agents take actions. Actions have consequences. Consequences create risk. Unless the architecture builds reversibility, preview, and graduated control into every agent action:
→ Every action reversible by default
→ Preview mode before autonomous execution
→ Start simple, earn the right to handle complexity
→ Visible reasoning for every action taken
→ Track record dashboard so humans can calibrate trust
Without these, technically capable agents sit unused. The 40% cancellation won’t be a technology failure story. It’ll be a confidence failure story.
The org gap
There’s a structural reason the confidence stack doesn’t get built even when people agree it should.
In most AI companies, the ML team owns accuracy. Product and design own UX. Nobody owns confidence. It falls between teams. ML doesn’t think about reversibility because that’s product’s job. Product doesn’t think about model uncertainty because that’s ML’s job. Accuracy gets measured religiously because one team owns it. Confidence doesn’t get measured because no team does.
The fix: assign ownership. Measure confidence directly. After AI-assisted tasks, ask users “How confident were you in this result?” Simple scale. Track over time. Segment by feature, user type, task complexity. This score becomes a better predictor of retention than any model metric because it captures the thing that actually determines whether users keep delegating.
When confidence scores are flat while accuracy improves, you have a design problem disguised as a model problem. You’ll never discover that if nobody’s looking.
Confidence as a moat
Product teams consistently miss this: confidence isn’t just an adoption lever. It’s a competitive moat.
Users who trust your AI have built something specific over months: calibrated confidence. They know “trust it for boilerplate code, double-check complex logic.” They know “its email drafts are solid but its subject lines need editing.” They’ve developed a nuanced mental model of when to trust and when to verify. When a competitor launches with better accuracy but no confidence stack, switching means losing all that calibration and starting from zero. That’s expensive in a way feature comparisons and pricing sheets don’t capture.
Accuracy advantages are temporary. Models improve every quarter. A competitor with a worse model today might have a better one in six months. Confidence advantages compound because track record, the fifth layer, only grows through sustained use. A user who’s been calibrating trust for a year carries switching cost that better benchmark scores can’t overcome.
But the moat requires maintenance. If your model regresses, if a bad update introduces new failure modes, accumulated trust drains fast. One surprising error in a context where the user had learned to trust the AI can undo months of earned confidence. The moat compounds only as long as the product continues meeting or exceeding the user’s calibrated expectations. A major regression can destroy years of trust in one release.
Every month of consistent performance deepens the moat. The accumulated weight of “I know when to trust this and when to verify” is the most durable advantage available in a market where models are commoditizing and accuracy gaps close faster than confidence gaps do.
Thanks for reading.
If you enjoyed this issue, send it to a friend—it helps more than you think.
Back in your inbox Thursday,
Energy Exploration Technologies, Inc. (“EnergyX”) has engaged Open Scout to publish this communication in connection with EnergyX’s ongoing Regulation A offering. Open Scout has been paid in cash and may receive additional compensation. Open Scout and/or its affiliates do not currently hold securities of EnergyX.
This compensation and any current or future ownership interest could create a conflict of interest. Please consider this disclosure alongside EnergyX’s offering materials. EnergyX’s Regulation A offering has been qualified by the SEC. Offers and sales may be made only by means of the qualified offering circular. Before investing, carefully review the offering circular, including the risk factors. The offering circular is available at invest.energyx.com/.
Comparisons to other companies are for informational purposes only and should not imply similar results. Past performance is not indicative of future results. Market shortfall are forward‑looking estimates and are subject to substantial uncertainty.


