NIH findings shed light on risks and benefits of integrating AI into medical decision-making

July 23, 2024: Researchers at the National Institutes of Health (NIH) found that an artificial intelligence (AI) model answered questions on a medical questionnaire (designed to assess health care professionals’ ability to diagnose patients based on clinical images and a brief text summary) with high accuracy. However, the evaluating physicians found that the AI ​​model made errors when describing the images and explaining how its decision-making led to the correct answer. The findings, which shed light on the potential of AI in the clinical setting, were published in npj Digital Medicine. The study was led by researchers at the NIH’s National Library of Medicine (NLM) and Weill Cornell Medicine in New York City.

“Integrating AI into healthcare is a very promising tool that will help medical professionals diagnose patients more quickly, allowing them to begin treatment sooner,” said NLM Interim Director Dr. Stephen Sherry. “However, as this study demonstrates, AI is not yet advanced enough to replace human expertise, which is crucial for accurate diagnosis.”

The AI ​​model and human doctors answered questions from the New England Journal of Medicine (NEJM) Picture Challenge. The challenge is an online quiz that provides real clinical images and a brief text description including details about the patient’s symptoms and presentation, and then asks users to choose the correct diagnosis from multiple-choice answers.

The researchers asked the AI ​​model to answer 207 image challenge questions and provide a written justification for each answer. The instruction specified that the justification should include a description of the image, a summary of the relevant medical knowledge, and provide a step-by-step reasoning for how the model chose the answer.

Nine physicians from various institutions, each with a different medical specialty, were recruited and answered the assigned questions first in a “closed-book” environment (without consulting any external materials such as online resources) and then in an “open-book” environment (using external resources). The researchers then provided the physicians with the correct answer, along with the AI ​​model’s response and corresponding justification. Finally, the physicians were asked to rate the AI ​​model’s ability to describe the image, summarize the relevant medical knowledge, and provide their step-by-step reasoning.

The researchers found that both the AI ​​model and the doctors scored highly in selecting the correct diagnosis. Interestingly, the AI ​​model selected the correct diagnosis more often than doctors in closed-book settings, while doctors in open-book tools performed better than the AI ​​model, especially when answering questions classified as more difficult.

Importantly, according to doctors’ assessments, the AI ​​model often made mistakes when describing the medical image and explaining the reasoning behind the diagnosis, even in cases where it made the correct final decision. In one example, the AI ​​model was provided with a photo of a patient’s arm with two injuries. A doctor would easily recognize that both injuries were caused by the same condition. However, because the injuries were presented at different angles, causing the illusion of different colors and shapes, the AI ​​model did not recognize that both injuries could be related to the same diagnosis.

The researchers say these findings underscore the importance of further evaluating multimodal AI technology before introducing it into the clinical setting.

“This technology has the potential to help physicians augment their capabilities with data-driven insights that can lead to better clinical decision making,” said NLM senior investigator and corresponding author Dr. Zhiyong Lu. “Understanding the risks and limitations of this technology is essential to harnessing its potential in medicine.”

The study used an AI model known as GPT-4V (Generative Pretrained Transformer 4 with Vision), which is a “multimodal AI model” that can process combinations of multiple types of data, including text and images. The researchers note that while this is a small study, it sheds light on the potential for multimodal AI to help doctors make medical decisions. More research is needed to understand how these models compare to doctors’ ability to diagnose patients.

The study was co-authored by collaborators from the National Eye Institute and the NIH Clinical Center; the University of Pittsburgh; UT Southwestern Medical Center, Dallas; New York University Grossman School of Medicine, New York City; Harvard Medical School and Massachusetts General Hospital, Boston; Case Western Reserve University School of Medicine, Cleveland; the University of California, San Diego, La Jolla; and the University of Arkansas, Little Rock.

The National Library of Medicine (NLM) is a leader in biomedical informatics and data science research and the largest biomedical library in the world. NLM conducts and supports research on methods for recording, storing, retrieving, preserving, and communicating health information. NLM creates resources and tools that millions of people use billions of times a year to access and analyze information on molecular biology, biotechnology, toxicology, environmental health, and health services.

For more information: https://www.nlm.nih.gov

Reference

Qiao Jin, et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digital Medicine. DOI: 10.1038/s41746-024-01185-7(link is external) (2024).