Where the Hallucinations Got Through
Two retractions in a week, one in consulting and one in research preprints. The shared failure isn't the model. It's what the review layer was never built to catch.
Ernst & Young pulled a published study this week because someone reading it noticed the citations didn't exist. That's the version that made the Financial Times. The version that should make procurement officers pause is that the citations made it through whatever internal review EY runs before a study with their name on it goes out the door.
The firm's framing is that this is a cautionary tale about AI misuse. That framing is convenient. It locates the problem in the tool (the technology that misled the professionals) and lets the institution off the hook for the part that actually broke. The study didn't slip past review because the model is too clever. It slipped past because the review wasn't checking the thing the model is known to get wrong.
A day later, on the other end of the credibility economy, arXiv's moderators announced they would start banning submitters whose papers contain AI-generated hallucinations. The announcement came via a moderator's social-media post, not a formal policy document, which is its own tell. A tweet doesn't carry the institutional weight of a board resolution, and the casual framing undersells what's happening. The world's largest preprint server is admitting that its existing screening can't catch what's coming in.
Same gap, different uniforms
Strip away the contexts (Big Four consulting and open-access physics preprints share almost nothing in audience, incentive, or workflow) and what's left is a single shape. AI output enters the pipeline. A human review layer exists to catch errors. The review layer is calibrated for the errors humans used to make: typos, weak arguments, sloppy reasoning, citations that go to the wrong page. It is not calibrated for plausible-looking citations to papers that don't exist, or for confident summaries of statutes that were never passed. The model produces a new error class. The review process catches the old one.
Call it the review-layer gap. Every organization deploying generative AI into knowledge work has one, whether they've audited for it or not. EY has it. arXiv has it. The midmarket consultancy pitching your CFO has it, and probably hasn't named it.
The pricing question
Here's the part the procurement conversation hasn't caught up to. When you buy advisory work from a brand-name firm, you're paying for the review layer at least as much as you're paying for the analyst's keystrokes. The whole pricing premium of a Big Four engagement over a freelance shop is the institutional check that says: this passed through people whose careers depend on it being right. A hallucination in the deliverable doesn't just damage the deliverable. It devalues the premium.
I don't know whether EY's clients on this engagement will pursue contractual remedies, and that gap matters. If they don't, the market signal is that buyers will absorb the cost of bad output rather than litigate the standard. If they do, every Big Four AI policy gets rewritten by Q4. Right now the procurement market is pricing this risk at roughly zero.
The next vendor meeting
The right question for a consulting vendor is no longer "do you use AI." Every honest answer is yes. The right question is: what does your review layer specifically do to catch fabricated citations, invented case law, and confidently wrong numerical summaries before they reach me? If the answer is "our consultants review the output," you have your answer about the review-layer gap, and it's the same answer EY just gave the world.
arXiv has the easier version of this problem — refuse the submission, ban the submitter, write the policy. A consulting firm can't ban its own juniors. It has to rebuild the review layer for an error class the layer wasn't designed to catch. That's a months-long process, not a memo.
Does the contract you signed last quarter say who eats the cost when the deliverable hallucinates?
Sources
Want to talk about this?
Get in touchMore on AI
Clinical AI's Missing Denominator
The Ontario audit reads as another AI hallucination story. Read it again. The detail the coverage is missing isn't in the failures — it's in the question the audit forces every procurement officer to answer.
Two-Thirds of Doctors, Zero Procurement Trail
A hospital legal team can't deposition a tool the hospital didn't buy. That's the upstream phase healthcare is sitting in, and Bentonville just showed how the downstream phase prices in.
The Ratepayer Subsidy Powering AI
Maryland just asked FERC to stop a $2B grid bill from landing on residential ratepayers to fund out-of-state AI data centers. It's the AI infrastructure story enterprise buyers should be tracking and aren't.
