Is ChatGPT Smarter Than a PCP?

Becky McCall

October 20, 2023

GLASGOW, Scotland ― ChatGPT failed to pass the UK's National Primary Care examinations in a new study, highlighting how artificial intelligence (AI) does not necessarily match human perceptions of medical complexity.

ChatGPT also provided novel explanations ― it frequently "hallucinates" ― by describing inaccurate information as if they were facts, according to Shathar Mahmood, BA, a fifth-year medical student at the University of Cambridge School of Clinical Medicine, Cambridge, UK, who presented the findings at the Royal College of General Practitioners (RCGP) Annual Conference 2023. The study was published in JMIR Medical Education earlier this year.

"Artificial intelligence has generated impressive results across medicine, and with the release of ChatGPT there is now discussion about these large language models taking over clinicians' jobs," Arun James Thirunavukarasu, MB BChir, of the University of Oxford and Oxford University Hospitals NHS Foundation Trust, who is the lead author of the study, told Medscape Medical News.

Performance of AI on medical school examinations has prompted much of this discussion, often because performance does not reflect real-world clinical practice, he said. "We used the Applied Knowledge Test instead, and this allowed us to explore the potential and pitfalls of deploying large language models in primary care and to explore what further development of medical large language model applications is required."

The researchers investigated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test. The computer-based, multiple-choice assessment is part of the UK's specialty training to become a general practitioner (GP). It tests knowledge behind general practice within the context of the UK's National Health Service.

The researchers entered a series of 674 questions into ChatGPT on two occasions, or "runs." "By putting the questions into two separate dialogues, we hoped to avoid the influence of one dialogue on the other," Mahmood said. To validate that the answers were correct, the ChatGPT responses were compared with the answers provided by the GP self-test and past articles.

Docs 1, AI 0

Overall performance of the algorithm was good across both runs (59.94% and 60.39%); 83.23% of questions produced the same answer on both runs.

But 17% of the answers didn't match, Mahmood reported, a statistically significant difference. "And the overall performance of ChatGPT was 10% lower than the average RCGP pass mark in the last few years, which informs one of our conclusions about it not being very precise at expert level recall and decision-making," she said.

Also, a small percentage of questions (1.48% and 2.25% in each run) produced an uncertain answer or there was no answer.

Say What?

Novel explanations were generated upon running a question through ChatGPT that then provided an extended answer, Mahmood said. When the accuracy of the extended answers was checked against the correct answers, no correlation was found. "ChatGPT can hallucinate answers, and there's no way a nonexpert reading this could know it is incorrect," she said.

Regarding the application of ChatGPT and similar algorithms to clinical practice, Mahmood was clear. "As they stand, [AI systems] will not be able to replace the healthcare professional workforce, in primary care at least," she said. "I think larger and more medically specific datasets are required to improve their outputs in this field."

Sandip Pramanik, MBcHB, a GP in Watford, Hertfordshire, UK, said the study "clearly showed ChatGPT's struggle to deal with the complexity of the exam questions that is based on the primary care system. In essence, this in indicative of the human factors involved in decision-making in primary care."

The applied knowledge test is designed to test the knowledge required to be a generalist in the primary care setting, and as such, there are lots of nuances reflecting this within the questions, Pramanik said.

"ChatGPT may look at these in a more black and white way, whereas the generalist needs to be reflective of the complexities involved and the different possibilities that can present rather than take a binary 'yes' or 'no' stance," he said. "In fact, this highlights a lot about the nature of general practice in managing uncertainty, and this is reflected in the questions asked in the exam," he remarked. He noted, "Being a generalist is about factoring in human emotion and human perception as well as knowledge."

Mahmood, Thirunavukarasu, and Pramanik have disclosed no relevant financial relationships.

Royal College of General Practitioners (RCGP) Annual Conference 2023: Poster presented October 19, 2023.

JMIR Med Educ. Published April 21, 2023. Full text

Becky McCall is a freelance medical journalist based in London, UK. She has written for Medscape for nearly 15 years.

For more news, follow Medscape on Facebook, X, Instagram, and YouTube.

Comments

3090D553-9492-4563-8681-AD288FA52ACE
Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.

processing....