Clinical AI's Missing Denominator

The Ontario audit reads as another AI hallucination story. Read it again. The detail the coverage is missing isn't in the failures — it's in the question the audit forces every procurement officer to answer.

May 15, 2026 · 4 min read

A blank performance threshold line on graph paper with scattered data points crossing it from both sides

The Ontario audit landed this week with predictable headlines. AI notetakers hallucinating. Made-up therapy referrals. Prescriptions in the chart the physician never wrote. The consensus reading is that this is another AI safety story — vendors shipped tools that weren't ready, and the failures finally got documented in a place regulators couldn't ignore.

That reading is not wrong. It's just shallow.

The detail worth dwelling on isn't in the failure list itself. It's in the question the audit forces every procurement officer in healthcare to answer, and that nobody has a good answer to: what error rate, exactly, would have been acceptable?

Before go-live

A clinical lab test gets approved with a defined sensitivity and specificity. A surgical device ships with a documented adverse event threshold. A drug gets a labeled rate of contraindicated interactions. The number is never zero. The number is also never undefined. Somewhere in the regulatory file, somebody wrote down what "working" means in numbers, and what "broken" means in numbers, and the gap between them is the operational definition of acceptable risk.

Clinical AI notetakers entered Ontario hospitals without that number. Not a stricter version of the number. The number, full stop. No documented hallucination rate the system had to clear before deployment. No agreed denominator against which the audit's findings could be benchmarked.

This is what separates the Ontario audit from a vendor disclaimer or a researcher's bench test. A government body went looking, found the harm, and discovered nothing in the procurement trail against which the harm could be measured.

Adjectives and numbers

The reason this matters isn't philosophical. It's contractual.

Most enterprise AI SLAs in circulation right now — clinical, financial, legal — contain some version of a vendor promise about safety, accuracy, or compliance. Those promises are written in adjectives. "Industry-leading." "Rigorously tested." "Enterprise-grade safeguards." The promises lack the one thing that would make them enforceable: a number both sides agreed to before the contract was signed.

(Aside — what does breach even look like in a system designed to produce probabilistic outputs? If the vendor warrants "no hallucinations," every output is breach. If the vendor warrants nothing specific, nothing is breach. The middle ground — say, fewer than X material hallucinations per thousand documents, measured by Y methodology, adjudicated by Z — exists in almost no contract on the market. The procurement teams didn't ask. The vendors didn't offer.)

Without that middle ground, the SLA is theater. The hospital can't sue for breach because no breach was defined. The vendor can't be held to a standard because no standard was negotiated. The audit can find harms, but the harms don't connect to a contractual remedy. Everyone is operating in a defined absence of definition.

The lab-side problem

The New York Times piece this week on AI safety controls, the one arguing that bypassing them three years post-ChatGPT is "almost trivial," overstates its case in places. Not every model bypass is trivial. Not every safety control fails the same way. The framing flattens a real range of difficulty into a single rhetorical claim, and on that specific point the piece is more polemic than survey.

But the underlying observation holds. The labs themselves, the entities with the deepest possible incentive to harden their own products, have not produced safety controls that work reliably under adversarial pressure. If the supply side cannot deliver a controllable failure rate, the demand side cannot procure against one. The hospital can't write "hallucination rate below 0.5%" into a contract when the vendor can't credibly commit to it and the auditor can't credibly measure it.

This is the part the Ontario coverage isn't connecting. The provincial audit is one jurisdiction's documented harm. The lab-side limitation is global. The intersection is that enterprise AI procurement, in every sector that buys this stuff, is sitting on contracts built around vendor promises that the vendors' own engineering cannot underwrite.

Before CLIA

There's a precedent worth knowing. In the United States, clinical laboratory testing operated for decades on vendor-stated accuracy claims with no federal floor. The Clinical Laboratory Improvement Amendments of 1988 imposed proficiency testing, defined performance standards, and made labs accountable for documented error rates. The legislation didn't appear because labs were uniformly bad. It appeared because Pap smear misreads were killing women, the misreads were documentable, and no acceptable rate had ever been defined.

The shape rhymes. A diagnostic technology gets deployed at scale. A pattern of harm emerges. A jurisdiction audits, finds the harm, discovers the absence of a measurable standard. The standard gets written into law afterward, often poorly, often by people who don't fully understand the technology, often with carve-outs the lobbyists asked for. The interim — the years between the harm becoming visible and the rule becoming binding — is where the lawsuits and the reputational damage live.

Healthcare AI is in that interim now. So is enterprise AI more broadly, with healthcare just running first because the harm is most visible.

Eighteen months

I don't know which jurisdiction codifies an acceptable error rate first, and I don't know whether it'll come from healthcare regulators, state AGs, or a federal agency the current administration hasn't decided to fund. The Ontario audit is provincial; nothing in it automatically constrains a US health system. The lab-side problem is global, but global problems get regulated jurisdictionally and inconsistently.

What I'll commit to is the direction. Inside eighteen months, at least one major enterprise AI procurement standard — issued by a regulator, an insurer, or a hospital association — will include a defined acceptable error rate as a precondition for deployment. Vendors who can credibly commit to a number will gain procurement advantage. Vendors who can't will face contract risk they don't currently price.

The Ontario auditors found the missing denominator. The next people to find it will be litigators.

Sources

Want to talk about this?

Get in touch

More on AI

May 16, 2026

Clinical AI's Missing Denominator

May 15, 2026 · 4 min read

That reading is not wrong. It's just shallow.

Before go-live

Adjectives and numbers

The reason this matters isn't philosophical. It's contractual.

The lab-side problem

Before CLIA

Healthcare AI is in that interim now. So is enterprise AI more broadly, with healthcare just running first because the harm is most visible.

Eighteen months

The Ontario auditors found the missing denominator. The next people to find it will be litigators.

Sources

Want to talk about this?

Get in touch

Clinical AI's Missing Denominator

Before go-live

Adjectives and numbers

The lab-side problem

Before CLIA

Eighteen months

Sources

More on AI

Where the Hallucinations Got Through

Two-Thirds of Doctors, Zero Procurement Trail

The Ratepayer Subsidy Powering AI

Clinical AI's Missing Denominator

Before go-live

Adjectives and numbers

The lab-side problem

Before CLIA

Eighteen months

Sources

More on AI

Where the Hallucinations Got Through

Two-Thirds of Doctors, Zero Procurement Trail

The Ratepayer Subsidy Powering AI