Skip to main content
Top

Comparative evaluation of large language models on multiple-choice and image-based rheumatology questions

Published in:

Abstract

Large language models (LLMs) are increasingly used in medical education and clinical decision support, including applications in rheumatology. We evaluated seven publicly accessible LLM tools, ChatGPT (GPT-3.5 and GPT-4.0), Claude Sonnet 4, Gemini, Perplexity AI, DeepSeek, and OpenEvidence, using 50 multiple-choice questions (MCQs) and 25 image-based diagnostic prompts. We assessed accuracy, self-reported confidence, and hallucination rates. In MCQs, DeepSeek achieved the highest accuracy (96%), followed by Claude (94%), GPT-3.5 (92%), GPT-4.0 (92%), Gemini (92%), OpenEvidence (92%), and Perplexity (90%). Image-based performance was lower and more variable, ranging from 16% (Claude) to 56% (Gemini). All models showed significantly reduced odds of correct responses to image questions compared to MCQs (p < 0.01). Claude performed significantly worse than GPT-3.5 on image-based questions (OR 0.24; 95% CI: 0.06–0.86, p = 0.04); no model significantly outperformed GPT-3.5 on MCQs. Confidence scores remained high across all models, ranging from 7 to 10 for MCQs and 8 to 10 for image-based questions. Hallucinations were rare for MCQs (n = 3; 1 Gemini, 2 Perplexity) but common in image responses, ranging from 40% (Gemini) to 84% (Claude). Publicly available LLMs demonstrate high accuracy on text-based rheumatology questions but show limited capability in image interpretation. High confidence in incorrect image responses and frequent hallucinations highlight the need for caution when integrating these tools into clinical education or decision-making.
Title
Comparative evaluation of large language models on multiple-choice and image-based rheumatology questions
Authors
Pannathorn Nakaphan
Ivan Damara
Bhoowit Lerttiendamrong
Varote Shotelersuk
Nattanicha Chaisrimaneepan
Publication date
01-01-2026
Publisher
Springer Berlin Heidelberg
Published in
Rheumatology International / Issue 1/2026
Print ISSN: 0172-8172
Electronic ISSN: 1437-160X
DOI
https://doi.org/10.1007/s00296-025-06053-5
This content is only visible if you are logged in and have the appropriate permissions.
This content is only visible if you are logged in and have the appropriate permissions.

Keynote webinar | Spotlight on progress in colorectal cancer

CRC remains a major global health burden, but advances in screening, treatment, and lifestyle-based prevention continue to reshape clinical practice. Gain insights into how the latest research can be leveraged to optimize patient care across the CRC continuum.

Prof. Antoni Castells
Prof. Harpreet Wasan
Prof. Edward Giovannucci
Watch now
Image Credits
Colon cancer illustration/© (M) KATERYNA KON / SCIENCE PHOTO LIBRARY / Getty Images