The “Review and Sign-Off” Fallacy
What Ontario’s Scribe Procurement Tells Us About Shifting Liability
The Canadian experience with Ambient AI scribes recently soured as the Ontario Auditor General released a special report on AI Governance. Spoiler alert… the results were not good. Let’s dive into those details. As always, keep sending in your ideas for future newsletters, and don’t forget to subscribe and tell a friend.
Sam
When government audit reports touch clinical medicine, they usually focus on infrastructure, budgets, or wait times. Every so often, however, an audit exposes a tension that reaches directly into our daily practice.
The Special Report on AI Governance recently released by Ontario Auditor General Shelley Spence is one of those documents. The report evaluated how Ontario’s public sector manages emerging technologies, including a detailed review of the province’s growing use of “AI scribes,” ambient documentation systems promoted through initiatives like the Ontario AI Scribe Program to reduce physician administrative burden and burnout.
The timeline is important. The testing that exposed significant performance gaps occurred in mid-2024, when Supply Ontario ran procurement Tender 20123 to establish an approved vendor pool. The Auditor General’s office then spent much of 2025 reviewing what happened after those systems entered provincial workflows.
The baseline findings were striking. Evaluators ran two simulated physician-patient encounters through 20 approved commercial systems. Every product demonstrated some form of clinical inaccuracy or fabricated content.
Specifically:
45% hallucinated treatment plans, blood tests, or referrals that were never discussed
60% documented incorrect medication names or dosages
85% omitted critical aspects of mental health history in at least one scenario
Yet despite these findings, all 20 systems were added to the province’s approved vendor list and recommended to physicians. Several vendors also failed to complete the requested security checks, including third-party security audits and required privacy assessments.
The interesting question is not whether these systems demonstrated limitations during early evaluation. Most emerging technologies do. The issue is how health systems define physician oversight once these tools enter routine clinical practice, because buried inside the Ontario report is a much larger change in clinical workflow.
Shifting the Cognitive Burden
When provincial officials addressed the discrepancy rates identified during testing, the response reflected a now-familiar framework: AI scribes function as productivity tools, while physicians remain responsible for reviewing, correcting, and signing every note.
On the surface, this appears reasonable. For many straightforward encounters, ambient documentation systems perform remarkably well. A short, routine URI visit or an uncomplicated musculoskeletal complaint often produces a clean, organized note that feels clinically usable within seconds.
At the same time, the more clinically capable these systems become, the easier it becomes to overlook a structural problem: large language models generate probable language patterns, not clinical understanding.
That’s important because relying on physician “review and sign-off” as the primary safety mechanism does not eliminate cognitive burden. It redistributes it. The physician’s role gradually shifts from primary historian to high-speed auditor of machine-generated summaries.
This creates a fascinating gray zone once conversations become emotionally layered, fragmented, or non-linear. Patients discussing depression, trauma, chronic pain, substance use, or psychiatric history rarely present information in orderly chronological form. They hesitate. They circle back. They contradict themselves. Critical details emerge indirectly and often late in the encounter.
An ambient scribe optimized for readability and SOAP-note structure must continuously determine what constitutes signal versus conversational background.
When a system fabricates a blood test or medication, the error can sometimes be obvious. An omission behaves differently. A missing psychiatric detail simply disappears from the final narrative, while the note itself still looks polished and clinically fluent. That is precisely what makes omissions difficult to detect.
A recent Medical Economics analysis highlighted omissions as one of the hardest discrepancy categories for busy clinicians to identify during review. Independent linguistic analyses suggest that omissions may account for the majority of meaningful automated documentation errors.
Functionally, this changes the nature of chart review itself.
Traditional proofreading assumes the underlying narrative originates from the clinician. Ambient documentation introduces a probabilistic intermediary that organizes, compresses, and selectively reconstructs clinical dialogue before the physician ever sees the final text. The cognitive task becomes less about correcting grammar or formatting and more about reconstructing what may have been lost.
Divergent Definitions of Success
The Ontario audit also exposes a broader industry tension:
Physician satisfaction and documentation accuracy are not measuring the same thing.
In the United States, ambient documentation platforms such as Abridge have expanded rapidly across major health systems. Published literature evaluating large-scale deployments is consistently positive. Studies involving millions of patient encounters, including a large NEJM AI evaluation, report reductions in after-hours charting time, improved clinician satisfaction, and lower burnout scores.
Those findings are real. Documentation fatigue is itself a meaningful clinical problem. Physicians experiencing less administrative exhaustion may communicate more effectively, maintain better attention during encounters, and preserve more direct patient engagement throughout the workday.
At the same time, a separate body of literature evaluating the actual text output of these systems reveals persistent reliability concerns.
A multi-center JMIR study found that 70% of AI-generated notes contained at least one error. Research published in npj Digital Medicine identified hallucinations in a relatively small percentage of generated sentences overall, yet nearly half of those hallucinations qualified as “major” errors capable of altering diagnosis or management if left uncorrected.
Both realities can exist simultaneously. Ambient scribes may substantially improve the lived experience of clinical documentation while still producing error patterns that differ fundamentally from traditional dictation systems.
Even industry leaders implicitly acknowledge this tension. In technical materials discussing “confabulation elimination,” Abridge engineers describe building secondary validation layers designed specifically to suppress plausible but unsupported generated text before it reaches the physician.
The interesting issue is not whether safeguards exist. The issue is what those safeguards reveal about the underlying architecture.
The continued need for increasingly sophisticated safeguards suggests that hallucinations and omissions are not simply isolated edge cases. They appear to be recurring features of probabilistic language generation. That creates important implications for procurement, regulation, and workflow design.
Reframing the Physician’s Role
If ambient documentation becomes a permanent layer within modern medicine, the physician-tool relationship needs a different operational model.
Reviewing flat blocks of generated text line by line is unlikely to remain sustainable as patient encounter volume increases. The more productive these systems become, the more notes clinicians will be expected to verify under increasing time compression.
A more durable framework involves shifting physicians from passive proofreaders toward targeted auditors.
Several practical changes could support that transition:
Linked Traceability: This should become standard. Clinicians should be able to click any sentence within a generated note and immediately access the corresponding transcript segment or audio timestamp.
Local Testing: Health systems should conduct adversarial local testing before deployment, using complex psychiatric, multi-complaint, or medication-heavy encounters rather than idealized demonstrations.
Ommission Checklist: Clinicians may benefit from focused omission checks targeting high-risk areas such as medication dosages, explicit exclusions, and subjective psychiatric histories.
None of this diminishes the legitimate value of ambient documentation. These systems can meaningfully reduce clerical burden and restore attention to the patient sitting in the room. Many physicians already describe them as indispensable.
At the same time, the Ontario audit illustrates how quickly health systems can confuse usability with reliability.
The more clinically embedded these systems become, the more medicine will need to distinguish between fluent language generation and faithful clinical representation. Those are related concepts, but they are not identical ones.


