AI outperforms its peers on medical oncology quiz, but some mistakes could be harmful

In a recent study published in the Open JAMA Network, Researchers evaluated the accuracy and security of large language models (LLMs) for answering medical oncology exam questions.

Study: Performance of large language models on medical oncology exam questions.  Image Credit: CHILD ANTHONY/Shutterstock.comStudy: Performance of large language models on medical oncology exam questions. Image Credit: CHILD ANTHONY/Shutterstock.com

Background

LLMs have the potential to revolutionize healthcare by helping doctors with tasks and interacting with patients. These models, trained on vast corpora of text, can be fine-tuned to answer questions with human-like answers.

LLMs encode extensive medical knowledge and have demonstrated the ability to pass the United States (US) Medical Licensing Examination, demonstrating understanding and reasoning. However, its performance varies according to medical subspecialties.

With rapidly evolving knowledge and a high volume of publications, medical oncology presents a unique challenge.

More research is needed to ensure that LLMs can reliably and safely apply their medical knowledge to dynamic and specialized fields such as medical oncology, improving medical support and patient care.

About the study

The present study, conducted from May 28 to October 11, 2023, followed Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines and did not require ethics board approval or informed consent due to the lack of human participants.

The American Society of Clinical Oncology (ASCO) publicly accessible question bank provided 52 multiple-choice questions, each with a correct answer and explanatory references. Similarly, the 2021 and 2022 European Society for Medical Oncology (ESMO) exam test questions provided 75 questions after excluding image-based ones, with answers developed by oncologists.

To ensure fair testing, oncologists created 20 original questions, maintaining a multiple-choice format.

Chat Generative Pre-trained Transformer (ChatGPT)-3.5 and ChatGPT-4 were used to answer these questions, labeled consistently for comparison. Six open source LLMs were also evaluated, including the Mistral-7B Biomedical Domain Adapted for Retrieval and Assessment (BioMistral-7B DARE), designed for biomedical domains.

Responses were recorded with explanations rated on a four-level error scale. Statistical analysis, performed in R version 4.3.0, tested precision, error distribution, and agreement between oncologists.

The study used binomial distribution, McNemar test, Fisher test, weighted κ, and Wilcoxon rank sum test, with a two-sided P value of 0.05, indicating statistical significance.

Study results

The evaluation of the LLMs through 147 exam questions included 52 from ASCO, 75 from ESMO and 20 original questions. Hematology was the most common category (15.0%), but the questions covered several topics.

The ESMO questions were more general and addressed the mechanisms and toxic effects of systemic therapies. Notably, 27.9% of questions required knowledge from published evidence from 2018. LLMs provided prose answers to all questions, and proprietary LLM 2 needed prompts for specific answers in 22.4%. of the cases.

One selected ASCO question involved a 62-year-old woman with metastatic breast cancer who presented with symptoms of pulmonary embolism. The patented LLM 2 correctly identified the best treatment as low molecular weight heparin or a direct oral anticoagulant, considering the patient’s cancer and her travel history.

Another ASCO question described a 61-year-old woman with metastatic colon cancer who was experiencing neuropathy due to her chemotherapy regimen. The LLM recommended switching to targeted therapy with encorafenib and cetuximab, given the presence of a B-Raf proto-oncogene serine/threonine kinase (BRAF) V600E mutation and its side effects.

The proprietary LLM 2 demonstrated the highest accuracy, answering 85.0% of questions correctly (125 of 147), significantly outperforming random answers and other models. Performance was consistent across ASCO (80.8%), ESMO (88.0%), and original questions (85.0%).

When a second attempt was made, 54.5% of the initially incorrect answers were corrected. The proprietary LLM 1 and the best open source LLM, Mixture of Mistral-8x7B version 0.1 (Mixtral-8x7B-v0.1), had lower accuracies of 60.5% and 59.2%, respectively. BioMistral-7B DARE, adapted to biomedical fields, had an accuracy of 33.6%.

Qualitative evaluation of prose responses by physicians showed that the proprietary LLM 2 provided correct and error-free answers for 83.7% of the questions.

Incorrect answers were more frequent when the questions required knowledge of recent publications, identifying errors in knowledge retrieval, reasoning and reading comprehension.

Clinicians classified 63.6% of errors as having a medium probability of causing harm, with a high probability in 18.2% of cases. No hallucinations were observed in LLM responses.

Conclusions

In this study, LLMs performed exceptionally well on medical oncology exam-like questions intended for students approaching clinical practice. The proprietary LLM 2 answered 85.0% of the multiple choice questions correctly and provided accurate explanations, demonstrating your significant knowledge in medical oncology and reasoning ability.

However, incorrect answers, particularly those involving recent posts, raised significant security concerns. The proprietary LLM 2 outperformed its predecessor, the proprietary LLM 1, and demonstrated superior accuracy compared to other LLMs.

The study revealed that while LLM capabilities are improving, errors in information retrieval, especially with more recent evidence, pose risks. Improved training and frequent refreshers are essential to maintain up-to-date medical oncology knowledge in LLMs.