
Exploring the risks of Large Language Models (LLMs) in summarizing Electronic Health Record (EHR) data, a JAMA-published viewpoint by Katherine E. Goodman urges comprehensive standards and FDA oversight. Goodman reveals the potential patient harm from LLMs not falling under current FDA medical device oversight. Varied LLM-generated clinical summaries pose challenges, impacting clinician interpretations. Sycophancy bias and subtle errors in summaries could lead to faulty decision-making. The solution, according to Goodman, lies in establishing industry-wide standards, rigorous clinical testing, and FDA regulatory clarifications. By addressing these concerns, the healthcare industry can harness the transformative benefits of generative AI in EHR data summarization responsibly.
The promise of generative AI in summarizing Electronic Health Record (EHR) data comes with inherent risks, as highlighted in a recent JAMA viewpoint by Katherine E. Goodman. Despite the potential benefits, Large Language Models (LLMs) used for clinical data summarization lack clear coverage under existing FDA safeguards. Current Electronic Health Record systems’ inefficiencies contribute to clinician burnout, and the implementation of LLM-generated summaries could alleviate these issues. However, Goodman underscores the need for comprehensive standards and FDA oversight, emphasizing the potential for patient harm and the impact of varied summaries on clinical decision-making.
Current Landscape and Challenges
Goodman highlights the limitations of current EHR systems, originally designed for documentation and billing purposes, leading to inefficient information access and the prevalence of lengthy, cut-and-pasted content. These shortcomings contribute to physician burnout and clinical errors. In contrast, LLM-generated summaries have the potential to revolutionize EHR interactions, offering advantages such as improved efficiency and reduced clinician burden.
However, the author points out a critical gap in regulatory coverage. Despite the transformative potential of LLMs in summarization tasks, they do not neatly fit into the FDA’s existing framework for medical device oversight. This is particularly concerning given the significant impact these tools can have on clinical decision-making.
Challenges in Standardization
One of the key challenges highlighted by Goodman is the lack of comprehensive standards for LLM-generated clinical summaries. While it is generally acknowledged that summaries should be consistently accurate and concise, the absence of specific guidelines leaves room for variations in summary length, organization, and tone. These variations can influence how clinicians interpret information and make subsequent decisions.
To illustrate this variability, Goodman conducted an experiment using ChatGPT-4 to summarize deidentified clinical documents. The results demonstrated that identical prompts led to varied summaries, impacting the listing of patient conditions and emphasizing different clinical history elements. These discrepancies, according to Goodman, have profound clinical implications, necessitating further investigation through clinical studies.
Sycophancy Bias and Clinical Significance
Another noteworthy concern is the potential for “sycophancy” bias in LLMs, where responses may be tailored to perceived user expectations. The article emphasizes that even small differences in prompts can lead to variations in output. For instance, when summarizing previous admissions for a hypothetical patient, the LLM’s responses differed depending on whether there was concern for myocardial infarction or pneumonia.
Goodman draws attention to the fact that seemingly accurate summaries may contain small errors with clinical significance. Using an example of a chest radiography report, the article highlights how an LLM summary added the word “fever” based on a one-word mistake. Such errors, though subtle, could potentially lead to faulty decision-making and misdiagnoses.
The Call for Safeguards and Oversight
In light of these challenges, Goodman outlines a three-pronged approach to address the risks associated with LLM-generated clinical summaries:
1. Comprehensive Standards: The industry needs to establish comprehensive standards for LLM-generated summaries, extending beyond accuracy to include stress testing for sycophancy and identifying small yet clinically important errors. These standards should be developed with input from diverse stakeholders, ensuring scientific and clinical consensus rather than being dominated by a few technology companies.
2. Clinical Testing: LLMs designed for clinical summarization should undergo rigorous clinical testing to quantify both potential harms and benefits before widespread deployment. This step is crucial to understanding the real-world impact of these tools on patient care and clinical decision-making.
3. Regulatory Clarifications: Acknowledging the evolving landscape, Goodman calls for the FDA to provide regulatory clarifications, particularly for LLMs that permit more open-ended clinician prompting. The FDA should preemptively recognize certain prompts as causing LLMs to function as medical devices, despite being semantically restricted to summarization tasks.
Overall, Goodman advocates a proactive approach to address the challenges associated with LLM-generated clinical summaries. Recognizing the limitations of existing FDA regulations, she proposes the establishment of comprehensive industry standards, rigorous clinical testing, and regulatory clarifications. By emphasizing the importance of scientific and clinical consensus in standard development and FDA guidance, Goodman aims to strike a balance between the transformative potential of generative AI in EHR data summarization and the imperative to ensure patient safety and accuracy in clinical decision-making.