AI Chatbot 'Hallucinates' Faulty Medical Intelligence

Megan Brooks

October 18, 2023

Artificial intelligence (AI) models are typically a year out of date and have this "charming problem of hallucinating made-up data and saying it with all the certainty of an attending on rounds," Isaac Kohane, MD, PhD, Harvard Medical School, told a packed audience at plenary at IDWeek 2023 in Boston, Massachusetts.

Kohane, chair of the Department of Biomedical Informatics, says the future intersection between AI and healthcare is "muddy."

Echoing questions about the accuracy of new AI tools, researchers at the meeting presented the results of their new test of ChatGPT.

The AI chatbot is designed for language processing — not scientific accuracy — and does not guarantee that responses to medical queries are fully factual.

To test the accuracy of ChatGPT's version 3.5, the researchers asked it if there are any boxed warnings on the US Food and Drug Administration’s (FDA’s) label for common antibiotics, and if so, what they are.

ChatGPT provided correct answers about FDA boxed warnings for only 12 of the 41 antibiotics queried — a matching rate of just 29%.

For the other 29 antibiotics, ChatGPT either "incorrectly reported that there was an FDA boxed warning when there was not, or inaccurately or incorrectly reported the boxed warning," Rebecca Linfield, MD, infectious diseases fellow, Stanford University, California, told Medscape Medical News.

Uncritical AI Use Risky

Nine of the 41 antibiotics included in the query have boxed warnings. And ChatGPT correctly identified all nine, but only three were the matching adverse event (33%). For the 32 antibiotics without an FDA boxed warning, ChatGPT correctly reported that 28% (9 of 32) do not have a boxed warning.

For example, ChatGPT stated that the antibiotic fidaxomicin has a boxed warning for increased risk for C difficile, "but it is the first line antibiotic used to treat C. difficile," Linfield pointed out.

ChatGPT also reported that cefepime increased the risk for death in those with pneumonia and fabricated a study supporting that assertion. "However, cefepime is a first-line drug for those with hospital-acquired pneumonia," Linfield explained.

"I can imagine a worried family member finding this through ChatGPT, and needing to have extensive reassurances from the patient's physicians about why this antibiotic was chosen," she said.

ChatGPT also incorrectly stated that aztreonam has a boxed warning for increased mortality.

"The risk is that both physicians and the public uncritically use ChatGPT as an easily-accessible, readable source of clinically-validated information, when these large language models are meant to generate fluid text, and not necessarily accurate information," Linfield told Medscape Medical News.

Linfield said that the next step is to compare the ChatGPT 3.5 used in this analysis with ChatGPT 4, as well as with Google's Med-PaLM 2 after it is released to the public.

Advancing Fast

At plenary, Kohane pointed out that AI is a quick learner and improvements in tools are coming fast.

As an example, just 3 years ago, the best AI tool could score about as well as the worst student taking the medical boards, he told the audience. "Three years later, the leading large language models are scoring better than 90% of all the candidates. What's it going to be doing next year?" he asked.

"I don't know," Kohane said, "but it will be better than this year." AI will "transform healthcare."

IDWeek 2023 Annual Meeting: Poster 338. Presented October 12, 2023.

For more news, follow Medscape on Facebook, X (formerly known as Twitter), Instagram, YouTube, and LinkedIn.

Comments

3090D553-9492-4563-8681-AD288FA52ACE
Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.

processing....