How to cite this paper

Prescod, Paul, and Phill Tornroth. “Clean SOAP: Evaluating AI-based Structured Document Generation in a Medical Context.” Presented at Balisage: The Markup Conference 2024, Washington, DC, July 29 - August 2, 2024. In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.Prescod01.

Balisage: The Markup Conference 2024
July 29 - August 2, 2024

Balisage Paper: Clean SOAP

Evaluating AI-based Structured Document Generation in a Medical Context

Paul Prescod

Staff Engineer

Elation Health

Paul Prescod has more than 20 years of software design and development experience, spanning enterprise software, document processing, healthcare, and even games. His early contributions to the field include advising on the design of XML at the World Wide Web Consortium and authoring “The XML Handbook.” During this period, he played a key role in popularizing the REST Architectural Style and advised organizations on integrating it into their web services stack.

At Elation Health, Paul is dedicated to applying AI to reduce documentation burnout for primary care physicians.

Phill Tornroth

VP, Technical Strategy

Elation Health

As one of Elation Health’s earliest and longest-tenured employees, Phill Tornroth has, in the words of Forbes Magazine, “quietly built a leading electronic health record for primary care physicians” in close collaboration with Elation’s co-founders, Kyna and Conan Fong. Phill is passionate about using disruptive technology to reduce the documentation burden for PCPs. Since his first “physician shadowing” in 2010, Phill’s approach has always been to frequently engage with doctors and express empathy through engineering and leadership through listening.

In his current role as VP, Phill leads Elation’s strategy to find safe, effective, and ethical ways to adapt AI for this purpose.

Copyright 2024, Elation Health. Republished with permission.

Abstract

Healthcare systems worldwide face a crisis: there are far too few Primary Care Physicians (PCPs), and they are severely overworked. Modern Electronic Health Records (EHR) systems, while solving critical issues with paper-based records, have significantly increased the burden of medical case documentation. An “AI Scribe” is an artificial intelligence system designed to streamline clinical documentation by automatically generating “doctor visit notes,” which all physicians must produce following a patient visit. AI Scribe systems leverage natural language processing (NLP) and machine learning to interpret and document patient encounters in a format called SOAP (Subjective, Objective, Assessment, Plan). However, implementing AI in medical documentation raises multiple risks, including hallucinations, critical detail omissions, misclassifications, narrative quality and organization issues, security and privacy concerns, bias and discrimination, and legal and ethical challenges. Automated testing of AI-generated SOAP Notes is essential but poses its own challenges. Using Large Language Models to test the output of other Large Language Models can automate and scale semantic and contextual SOAP testing. The critical questions are: what can we test successfully, and how do we avoid an infinite regress of validators?

Table of Contents

Context: The Documentation-effort Crisis in PCP Healthcare
The Risks to Quality
AI Scribe and SOAP Notes
Correctness in a Stochastic System
Evaluating Transcription and Transcript Cleanup
Automating Textual Quality Assessment
Dealing with Ambiguity
Meta-Validation
The Role of Human Expertise
Results
Related Work
Conclusions

The practice of medicine is the way you handle data and think with it. And the way you handle it determines the way you think.

— Dr. Lawrence Weed

Context: The Documentation-effort Crisis in PCP Healthcare

Healthcare systems around the world are in crisis due to a shortage of Primary Care Physicians (PCPs). This shortage has well-documented negative effects on health outcomes and systemic costs.[1]

There are many reasons for the PCP shortage, and not all of them are related to technology or can be fixed by it. However, some issues do relate to technology. One doctor characterizes the medical documentation situation as follows:

😢 “I just want to take care of people. That’s my job. I’m good at it. But Electronic Health Records Systems reduce the number of people that I can serve. Electronic medical records have changed me and my entire profession. Not in a good way. Before EHR, I saw 24 patients in a 12-hour day. Now I can only see 10 patients, and I still work 12-hour days. Six hours to spend with patients and then six hours documenting BS that has no impact on their care.”

— Guy Culpepper, MD, on LinkedIn

The Markup Technologies community has been at the forefront of promoting the use of digital record-keeping technologies, and rightly so. The old way of data management was disastrous. Previously, vital patient information was trapped in single-location paper files inaccessible to other providers. That was no way to run a life-or-death information system in the 21st century.

However, we must now reverse the negative trend in documentation effort. Our North Star should be that doctors view medical documentation management systems the way programmers view version control systems or accountants view spreadsheets: as powerful tools assisting them in getting their job done, rather than corporate and regulatory-imposed obstacles to patient care.

An “AI Scribe” refers to an artificial intelligence system designed to assist healthcare professionals by automatically generating “visit notes” in the SOAP (Subjective, Objective, Assesssment, Plan) format. These systems leverage natural language processing (NLP) and machine learning to interpret and document patient encounters, thereby streamlining the process of clinical documentation. The AI Scribe “listens in” to a doctor-patient interaction (with dual consent!) and automatically generates the documentation that the doctor would usually type up on evenings and weekends.

One doctor’s transformative experience with AI Scribes was described in the Canadian news:

During the summer of 2023, Dr. Rosemary Lall, a family physician working at a bustling medical clinic in Scarborough, hit her breaking point.

“I lost all my joy of work,” Lall told Global News. “I was coming into work really dreading the day.”

The administrative burden would often take her up to two hours per day. Ontario’s Medical Association has estimated family doctors spend 19 hours per week on administrative tasks, including four hours spent writing notes or completing forms for patients.

The solution, Lall said, was new artificial intelligence note-taking apps that are designed to mimic doctor’s notes and reduce the amount of paperwork a physician would have to manually compile.

Lall said the true benefit comes after the appointment when the AI Scribe compiles the information into a so-called SOAP note, a standard requirement for family physicians as prescribed by the College of Physicians and Surgeons of Ontario.

If the physician is unhappy with the note, Lall said, they can ask the AI model to regenerate the information or add more detail to any one of the categories. While the tool has some imperfections, she said, the improvements have been noticeable over the 10 months since she began using it.

“I really feel this should be the next gold standard for all of our doctors. It decreases the cognitive load you feel at the end of the day,” she said.

Lall said that after 29 years as a family physician, last Christmas was the first celebration that wasn’t interrupted by the need to update patient notes thanks to the AI notetaking software.

“For me, this has changed things,” Lall said. “It’s made me really happy.”

This is far from a unique case. When Kaiser Permenante rolled out AI Scribe, thousands of doctors voluntarily enrolled and one said: “I use it for every visit I can and it is making my notes more concise and my visits better. I know I’m gushing, but this has been the biggest game changer for me.”

Dr. Lall’s work on notes every Christmas is an extreme case of what many physicians call “Pajama time”.[2] This is the time that they spend after their kids are in bed where they try to complete the paperwork that they could not fit into the normal work day. Beyond the distressing implications from the point of view of work-life balance, what does it mean for health delivery quality if doctors are writing up their notes from memory, hours after the visit?

Newspaper article: Physicians report 15 hours of 'pajama time' While 93% of
                physicians said that they feel burned out in athenahealth's third Physician
                Sentiment Survey, conducted by the Harris Poll, 83% said that Al had the potential
                to reduce administrative burdens.

Although most readers likely have deep compassion for the challenging situation of these doctors, it is also appropriate to be concerned about the substantial risks posed by a naive implementation of AI into the medical documentation generation system.

The Risks to Quality

As described earlier, AI Scribe is a rapidly evolving product category that consists of tools that are authorized to listen into a patient/doctor conversation and generate a “Visit Note” in the SOAP format.

SOAP is a loose format for Physician Visit Notes which was developed by Dr. Lawrence Weed in the 1960's at the University of Vermont. Lawrence Weed explained his zeal for organized records like this:

“We really aren’t taking care of records — we’re taking care of people. And we’re trying to get across the idea that this record cannot be separated from the caring of that patient. This is not something [distinct]: the practice of medicine over here and the record over here. This is the practice of medicine. It’s intertwined with it. It determines what you do in the long run. You’re a victim of it or you’re a triumph because of it. The human mind simply cannot carry all the information about all the patients in the practice without error. And so the record becomes part of your practice.”

While AI Scribes offer significant advantages in the automation of SOAP note generation, several risks are associated with potential mistakes they could make. These risks can impact patient care, data integrity, legal compliance, and reimbursement. Key risks include:

  1. Incorrect Data Entry:

    • Misinterpretation of Information: The AI Scribe might misunderstand or misinterpret the patient’s statements or clinical data, leading to inaccurate documentation.

    • Hallucination[3]: Generative AIs are prone to creating and injecting non-factual information to overcome uncertainty when completing the wrong word.

  2. Incomplete Documentation:

    • Omission of Critical Information: The AI might fail to capture important details such as key symptoms, relevant medical history, or specific findings from the physical examination.

    • Lack of Context: Missing contextual information can lead to incomplete or ambiguous notes, which can affect clinical decision-making.

  3. Bias and Discrimination:

    • Algorithmic Bias: If the AI Scribe's algorithms are biased, they might systematically produce notes that reflect or reinforce existing healthcare disparities.

    • Inconsistent Documentation: AI systems might perform inconsistently across different patient populations or clinical settings, leading to unequal quality of care.

  4. Narrative and Presentation quality issues:

    • Logical Narrative: The AI might generate notes that lack a coherent narrative, making it difficult for healthcare providers to follow the patient's story or understand the clinical context.

    • Logical Order: Information might be presented in a disorganized manner, disrupting the logical flow of the SOAP note and complicating the interpretation of clinical data.

    • Voice and Tone: Does the output sound natural and in accordance with medical best practices

  5. Technical Issues:

    • Software Bugs: Technical glitches or software bugs can lead to erroneous data entry or loss of information.

    • Integration Problems: Issues with integrating the AI Scribe with other Electronic Health Record (EHR) systems can result in data mismatches or duplications.

  6. Privacy and Security Concerns:

    • Data Breaches: As with any digital system, there is a risk of data breaches which could expose sensitive patient information.

    • Unauthorized Access: Improper access controls could allow unauthorized individuals to view or alter patient records.

  7. Legal and Ethical Issues:

    • Liability: Incorrect notes could result in legal action against healthcare providers or institutions if the errors lead to patient harm.

    • Ethical Concerns: There may be ethical issues around the reliance on AI for clinical documentation, especially if it impacts the quality of patient-provider interactions.

This paper is focused on the first four issues, although Elation Health takes all seven very seriously. Contracts prevent us from publishing evaluation results for specific Note Generation models, but this paper describes our evaluation mechanism for these factors.

AI Scribe and SOAP Notes

Having established the stakes, let us turn to the problem. Given an audio trasnscript for an input text like this (from the ACI-Bench corpus):

hey brandon you know glad to see you in here today i see on your chart that you're experiencing some neck pain could you tell me a bit about what happened
yeah i was in a car crash
wow okay when was that
well which car crash
okay so multiple car crashes alright so let's see if we can how many let's start
my therapist said well my well actually my mother said i should go see the therapist and the therapist said i should see the lawyer but my neck's hurting
okay so i'm glad that you know you're getting some advice and so let's let's talk about this neck pain how many car crashes have we had recently
well the ones that are my fault or all of them
all of them
i was fine after the second crash although i was in therapy for a few months and then after the third crash i had surgery but i was fine until this crash
okay the most recent crash when was that
that's when i was coming home from the pain clinic because my neck hurt and my back hurt but that was in february
okay alright so we had a car crash in february
what year it was which february it was
okay so let's let's try with this one see what happens hopefully you remember i need you to start writing down these car crashes that this is becoming a thing but you know it's okay so let's let's say maybe you had a
you're not judging me are you
no there's no judgment here whatsoever i want to make sure that i'm giving you the best advice possible and in order to do that i need the most information that you can provide me makes sense
yes
alright so we're gon na say hope maybe that you had a car crash and we can verify this in february of this year and you've been experiencing some neck pain since then right
yes
okay alright on a scale of one to ten what ten is your arm is being cut off by a chainsaw severe how bad is your pain
twelve
okay terrible pain now i know you mentioned you had previous car crashes and you've been to therapy has anyone prescribed you any medication it's you said you went to a pain clinic yes
well they had prescribed it recently i was i was on fentanyl
oh
i haven't gotten a prescription for several weeks
okay alright and so we will be able to check on that when you take your medication so before you take your medication rather like are you able to move like are you experiencing any stiffness
…

We would like an output like this:

CHIEF COMPLAINT
    Neck pain.

HISTORY OF PRESENT ILLNESS
    Brandon Green is a 46-year-old male who presents to the clinic today for the
    evaluation of neck pain. His pain began when he was involved in a motor vehicle
    accident in 02/2022 when he was on his way home from a pain clinic. The patient
    notes that he has been in 4 motor vehicle accidents; however, he notes that he was
    fine after the first two accidents, but the third motor vehicle accident is when his
    neck and back pain began. He states that he was in therapy following the second
    accident and had surgery after his third accident. The patient was seen at a pain
    clinic secondary to neck and back pain. He was prescribed fentanyl; however, he has
    not received a prescription for several weeks. Today, he reports that his pain is a
    12 out of 10. He describes his pain as sharp and incapacitating with stiffness and
    pain. The patient also reports headaches, occasional dizziness. He denies any recent
    visual disturbances. He also reports numbness in his left arm and right leg. The
    patient also reports spasms throughout his body. He states that he has been
    experiencing fatigue since the accident. He notes that he is unable to work with
    this much pain.

REVIEW OF SYSTEMS
    Constitutional: Reports fatigue.
    Eyes: Denies any recent visual disturbances.
    Musculoskeletal: Reports neck and back pain, and occasional swelling and bruising
        of the neck.
    Neurological: Reports headaches, dizziness, spasms, and numbness.
    PHYSICAL EXAM
    SKIN: No lacerations.
    MSK: Examination of the cervical spine: Pain on palpation on the bony process and
        muscle. Moderate ROM. No bruising or edema noted.

RESULTS
    X-rays of the neck reveal no fractures.

ASSESSMENT
    Neck sprain.

PLAN
    After reviewing the patient's examination and radiographic findings today, I have
    had a lengthy discussion with the patient in regards to his current symptoms. I have
    explained to him that his x-rays did not reveal any signs of a fracture. I
    recommended an MRI for further evaluation. I have also prescribed the patient
    Robaxin 1500 mg every 6 to 8 hours to treat his pain. I have also advised him to
    utilize ice, a heating pad, IcyHot, or Biofreeze on his neck as needed. I have also
    provided him with a home exercise program to work on his range of motion. I advised
    the patient that he will not be able to work until we have the MRI results.

INSTRUCTIONS
    The patient will follow up with me after his MRI for results.

Or…depending on the physician’s tastes, perhaps like this:

CHIEF COMPLAINT
    Patient is experiencing neck pain.

HISTORY OF PRESENT ILLNESS
    * Motor Vehicle Accident (MVA) - February 2023 (V43.52XA)
        - Most recent car crash was in February of this year.
        - Patient reports fatigue since the accident.
    * Pain (R52)
        - Patient rates pain as 12 out of 10.
        - Patient experiences stiffness and sharp, incapacitating pain.
        - Patient reports it really hurts to bend neck forward and backward.
    * Headaches (R51)
        - Patient reports headaches, occasional dizziness, and numbness in the left arm
        and right leg.
    * Muscle Spasms (M62.838)
        - Patient reports muscle spasms.

ASSESSMENT AND PLAN
    * Neck Sprain (ICD10: S13.4XXA)
        - Doctor's assessment is neck sprain.
        - Doctor orders an MRI for a more thorough image.
        - Doctor prescribes Robaxin, 1500 mg every 6-8 hours.
        - Doctor suggests using a heat pad or biofreeze if Robaxin is not effective.
        - Doctor plans to refer patient to physical therapy and pain medicine for
        possible local injections.

PAST HISTORY
    Patient was in multiple car crashes.

CURRENT MEDICATIONS

    Patient was previously prescribed fentanyl but has not had a prescription for
        several weeks.

PHYSICAL EXAM
    Neck: Pain on palpation both on the bony process and on the muscle. Patient has
        moderate range of movement in the neck.
    Skin: No bruising, swelling, or lacerations observed during the exam.

DATA REVIEWED
    X-ray shows no fracture.

PATIENT INSTRUCTIONS
    * Doctor advises using ice for bruising and swelling.
    * Doctor provides exercises to improve neck movement.
    * Patient will be given time off work until MRI results are available.

Can an automated system achieve such a task? And can we be confident enough in it to trust it with patient transcripts?

Correctness in a Stochastic System

For many reasons, the behaviour of AIs is not as predictable as traditional imperative software systems. Stochasticity (randomness) is built into their design. They are black boxes with billions or trillions of parameters. They are sensitive to floating point rounding errors. They utilize randomness in their token selection. They are constantly evolving as corporations upgrade or streamline their models.

As in other parts of computing, when a system grows to the point that we cannot rely on our intuition nor formal methods, we must use testing.

In traditional software, there are two types of testing: manual and automated. Similarly, in testing the output of generative AIs, there are two types of testing. “Vibes checking” is the lowest standard, where humans eyeball outputs to validate that they “look right”. Sometimes in software development this is called “smoke testing”.

More formalized types of human-in-the-loop testing are good for identifying certain kinds of problems, just as structured Quality Assurance testing is appropriate with other software.

At some point, however, you may decide that you want to check 1000 generated SOAP notes against 1000 transcripts to validate that not a single incorrect assertion (“hallucination”) was introduced. And you might want to do it on a daily, weekly or monthly basis.

At that point, you are either going to need a very large number of humans, who have an enormous amount of time and a gigantic budget, or you will need an automated solution.

Evaluating Transcription and Transcript Cleanup

Most of these systems operate in two steps: first audio is transcribed into textual transcripts, and then transcripts are converted into SOAP Notes. One concern we had was that the first part of this process might make transcription errors and that the second step would compound the errors.

Instead, we generally found the opposite - the second step would usually correct for most errors in the first step. This is because the note generation process has more time and more context to work than the real-time transcription process.

We also found that it is possible to inject an explicit “cleanup this transcript” step between the audio transcription and the note generation, to improve the output quality even more.

One particular question we had was about bias and unfairness in transcription. We hypothesized that the system might work less effectively for doctors with strong accents or speech impediments. Using co-workers with accents, we did an experiment to determine whether this was a serious issue.

Early indications confirm that the transcription may bias in the form of higher error rates for some voices, but that transcript cleanup step with a generalized Large Language Model model not only corrects for the majority of errors but also significantly narrows the variability in error rates across voice samples.

Table I

Sample Transcription Accuracy (No Correction) Transcription Accuracy after LLM Cleanup
WER MER WIL WIP CER WER MER WIL WIP CER
Voice 1 0.16 0.16 0.28 0.72 0.04 0.08 0.08 0.13 0.87 0.04
Voice 2 0.17 0.17 0.29 0.71 0.04 0.06 0.06 0.10 0.90 0.03
Voice 3 0.18 0.17 0.29 0.71 0.05 0.07 0.07 0.12 0.88 0.04
Voice 4 0.22 0.21 0.33 0.67 0.07 0.08 0.08 0.14 0.86 0.05
Std Dev 0.026 0.022 0.022 0.022 0.014 0.010 0.010 0.017 0.017 0.008
% Improvement (Avg) 90.03% 90.48% 85.56% 81.66% 93.00%
% Reduction Std Dev 63.60% 56.82% 22.98% 22.98% 42.26%

This table compares transcription accuracy metrics before and after applying LLM (Language Model) cleanup. The table includes multiple metrics: Word Error Rate (WER), Match Error Rate (MER), Word Information Lost (WIL), Word Information Preserved (WIP), and Character Error Rate (CER). Lower scores are better for every value other than WIP.

As you can see on the left, Voice 1 (0.16), a speaker with a standard American accent, had a dramatically lower Word Error Rate in the initial transcription than Voice 4 (0.26), a participant who learned English as a Second Language.

Whereas after LLM cleanup, both of them had a Word Error Rate of 0.08. Voice 1 and Voice 4 had nearly identical error rates and information preservation rates after data cleanup. In this case, AI is a technology which reduces rather than exacerbates inequality.

Although our sample size is small, we have consistently seen (in formal and informal tests) that most errors were generally minor and non-consequential. We have highlighted a couple of consequential ones in the diagram, but the vast majority are extremely minor.

We will watch these metrics closely as we scale up the test corpus volume.

Once we are comfortable with the text transcript, we can move on to evaluating the Note Generation aspect.

Automating Textual Quality Assessment

Our challenge in evaluating Note Generation is that the AI is not the only stochastic factor in the system. Our gold standard notes are written (or at least finalized) by humans, and there is a large subjective factor in their decision about how to structure the information. Both the AI and the human reference documentor have quite a bit of flexibility in terms of what order they state facts and even what section of the SOAP note to put facts. This makes it difficult to use simplistic text-similarity metrics like BLEU, and ROUGE which are common in the NLP field for text translation tasks.

A bigger problem with these tools, however, is that they are not designed to describe – in English – what is wrong with the output. They are designed to provide a score, but not a reason for the score. This is a problem because we want to be able to provide feedback to the AI developers about what is wrong with the output.

Instead of using these sorts of text-similarity metrics, what we chose to do is automate – and scale – tests like:

  • “Is every statement in the SOAP note grounded in the transcript? If not, which are missing?”

  • “Is every relevant fact in the transcript reflected in the SOAP note? If not, which are extra?”

  • “Is every fact properly categorized? Does it show up in the right part of the note? Which are not correct?”

In general, traditional imperative code is not sophisticated enough to do this, so we need to rely on NLP tools, and Large Language Models in particular. This does raise the question of meta-evaluation: how do we trust the tools that evaluate the evaluator. This will be discussed in the section Meta-Validation.

Our approach to scoring is strongly influenced by and partially implemented with the open source project called DeepEval.[4]

For example, here is a prompt that can be used to verify if a certain fact is grounded in the transcript:

Given the evaluation steps, return a JSON with two keys:
1) a score key ranging from 0 - 10, with 10 being that it follows the
criteria outlined in the steps and 0 being that it does not, and
2) a reason key, a reason for the given score, but DO NOT QUOTE THE
SCORE in your reason.
Please mention specific information from Input and Transcript in your reason, but
be very concise with it!

Evaluation Steps:
1. Extract the key fact from the input.
2. Search the transcript for any mention of the key fact from the input.
3. If the key fact is found anywhere in the transcript, assign a high score.
4. If the key fact is not found in the transcript, assign a low score.

Input:
Neck Pain: Patient cannot swivel neck side to side.
Transcript: hey brandon you know glad to see you in here today… the full
transcript goes here … okay okay any other questions? Not right now. Alright...

Notice how the prompt uses both a score for quantitative and programmatic purposes and a reason for transparency and human review. Our reports make both parts accessible to human reviewers and downstream processes.

In addition to using the score, DeepEval does some wizardry with the log probabilities of the tokens in the output to judge how confident the Evaluator LLM was in its scoring.

The primary tests that we currently do are:

  • Hallucination Tests:

    • Are there assertions in the Note which are not grounded in the transcript? The example above is a Hallucination test.

    • We verify each assertion against the Note one by one, although we may batch them for efficiency in the future.

  • Missing Facts Test:

    • Are facts extracted from the gold-standard human-authored note missing from the AI generated note?

    • We verify each fact one by one.

  • Categorization Consistency Tests:

    • Is each fact appropriate for the section of the note in which it turns up?

  • Non-redundancy Score:

    • Are the facts in each section unique and non-duplicative or does the Note Generator just say the same thing several times?

  • Latency:

    • How long did it take to generate the note using the provided tool? Sometimes there is a time/quality trade-off, but in some cases, the extra time may have no value.

The output is a 2x2 matrix with different Note Generation System versions and configurations as columns and criteria like these as rows. For example, a system might support different SOAP Note styles and we would like to compare the Generation fidelity across all of them.

None of our testing Transcripts or SOAP notes are from real patients. They are either authored by Elation Health or part of the ACI-Bench Dataset which was designed specifically for doing these kinds of evaluations.

Dealing with Ambiguity

There are subtle, usually subjective, cases where humans might disagree with the evaluator.

For example, the transcript might say:

okay so … let's wait for the mri result… and we'll see what the mri says about
what whether or not we can get you like true local injections

The Note Generator might output:

Follow up after MRI results.

The Evaluator might say that this is not explicitly stated:

Hallucination: Data Reviewed: Follow-up appointment after MRI results are available ❌
Reason: The actual output does not mention the follow-up appointment after MRI results are available.
Score: 6.0%

It is largely correct: the intent to follow up after the MRI is implied, not stated explicitly. The Generator and Evaluator are both “trying” to be helpful, but with different “goals”. The Generator “wants”[5] to find something to add to the Follow-Up section, even if it is implied rather than stated. The Evaluator “wants” to find facts that are not explicitly grounded.

In a case like this, we would adjudicate that the evaluator is working fine, but that the evaluation dataset needs to be tweaked to allow this particular fact to be accepted without criticism. We are developing a sort of “exceptions” mechanism for this.

This is one of the steps in our process which is the least scalable, at the moment. This exception labeling does require human annotation. Perhaps by this time next year we will have trained a third AI to adjudicate these corner cases!

One interesting note on the topic of Follow Ups. One of the few hallucinations that we do see persistently is “Follow-up in four weeks” and we do in fact need to watch for this and ensure it does not erroneously slip into real notes. Sometimes the follow-up is implied, but very occasionally it really is hallucinated.

Meta-Validation

Using Large Language Models to check the output of other Large Language Models does open up a different can of worms, however. If we cannot trust the LLMs underlying the Note Generation tools, what makes us feel that we can trust the tester LLMs? Quis custodiet ipsos custodes? How do we avoid an infinite regress of validators?

What we do to evaluate the evaluators is create negative examples of tests that should fail, due to hallucinations, missed facts and other similar factors. By running tests of what happens if we inject false facts into SOAP notes (or delete facts from Notes), we can evaluate whether the Evaluator would properly pick up on those facts.

We find that Evaluator LLMs are more reliable than Soap Note Generation Systems for the simple reason that they are doing much simpler work with much more focus. They are generally focused on one, or a small number, of facts at a time. This allows us to build confidence in the Evaluators faster than in the Note Generation Systems themselves.

The Role of Human Expertise

Evaluation cannot be left entirely to software, no matter how many billions of parameters it has. Human expert evaluators have serveral important jobs to do:

  1. Check for things that we do not yet have an automated check for. For example, our evaluators have pointed out that some Note Generation models use very repetitive sentence structures, which makes the notes sound robotic. We could ask an LLM to grade this, but it has not become urgent enough yet.

  2. Point out any true errors that the evaluator made, so that we can investigate them.

  3. Point out ambiguous false positives and false negatives from the evaluators, so that we can make exceptions for them.

  4. Sanity check that all of the technology works!

Results

The overall goal of this paper is not to review Note Generation software. Our contracts with our vendor does not permit us to publish reviews of them. We can provide a few high level results however:

  • Hallucinations and missing facts are common enough that a Physician must review every note before signing it (which is, after all, the point of “signing” it).

  • Hallucinations and missing facts are infrequent enough to be a minor inconvenience. As reported publicly, this technology is a major time saver.

  • Getting every fact into the right section of the Note is subtle and a challenge for today’s Note Generators. For example, if a patient says: “My A1C is 8.5”, this is supposed to go into the subjective, rather than objective, part of the SOAP Note, because the doctor did measure it themselves. That a measurement like that could be considered “subjective” is one of the subtle concepts to convey to an LLM.[6]

Related Work

Academia has also started the process of evaluating this technology. “Aci-bench” is a “Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation” which is to say that it is a dataset and proposed methodology for validating AI scribes.

The Aci-bench team used a quite different evaluation approach. They used a blend of several standard NLP techniques:

Specifically, we measure at least one lexical n-gram metric, an embedding-based similarity metric, a learned metric, and finally an information extraction metric. We evaluate the note generation performance both in the full note and in each division.

To consolidate the various evaluation metrics, we first take the average of the three ROUGE submetrics as ROUGE, and the average of ROUGE, BERTScore, BLEURT, and MEDCON scores as the final evaluation score.

While these metrics have the virtue of being industry standard and extremely scalable, they trade off interpretability and transparency. With our own internal system, most failures are tied to a specific statement which is marked as a hallucination or omission or poorly worded section etc. The Evaluator even explains its reasoning.

This transparency is important to us as a product vendor. We need to be able to report specific failures to our vendors or development teams.

Conclusions

AI Scribing is a large and growing industry. It seems inevitable that it will be the standard form of data entry for many medical specialties including Primary Care. Every citizen has a stake in its quality and reliability.

AI Scribing offered in a Certified EHR Technology (CEHRT) most likely falls into the regulatory category from the Office of National Coordinator for Health IT (ONC) of Predictive Decision Support Interventions: “technology that supports decision-making based on algorithms or models that derive relationships from training data and then produces an output that results in prediction, classification, recommendation, evaluation, or analysis”. As such, it is subject to certain documentation and transparency requirements, but not any task-specific quality standard. Regulators have not established discrete requirements due to rapidly evolving technology functions. The requirements for transparency do include validity, evaluation, and regular testing, but the regulators do not define what constitutes sufficient testing or a passing score.

As AI in health care emerged, US Health and Human Services (HHS) established baseline requirements which included a wide net approach to provide flexibility of CEHRT vendors to establish AI tools with a wide range of transparency requirements applicable to multiple technology use cases. The regulations have already begun to be refined (the second proposed update to the HTI regulations have been released) and the ONC and HHS have stated the intention to refine regulations as the technology is refined and better understood over time.

In the meantime, we as vendors must proactively evaluate our own technologies to minimize patient risk and clinician effort. In fact, we will always want to exceed, rather than just meet, whatever standards are imposed by regulation. To meet this challenge, Elation Health has made large investments in both automated and human evaluation, and plans to continue doing so.

Using this mix of automated and human evaluation we have built a feedback loop wherein the automated tests progressively approach parity with human feedback, which allows our experts to focus on increasingly strict definitions of SOAP note quality.



[3] The term “hallucination” is, of course, a form of anthropomorphization. Talking about AIs without using any anthropomorphizing terminology is extremely tedious, so we ask the reader for some leeway.

[4] https://github.com/confident-ai/deepeval “DeepEval is a simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.”

[5] Anthropomorphization!

[6] Although, in fairness, when asked, GPT-4 says: “If a patient says that their A1C is 8.5, this information would go into the Subjective part of a note. The Subjective section contains information reported by the patient, including their personal health history, symptoms, and self-reported measurements or lab results.” The challenge is that it may not always remember that in the process of generating a real note.