Comparative evaluation of large language models on multiple-choice and image-based rheumatology questions
- 01-01-2026
- Artificial Intelligence
- Observational Research
- Authors
- Pannathorn Nakaphan
- Ivan Damara
- Bhoowit Lerttiendamrong
- Varote Shotelersuk
- Nattanicha Chaisrimaneepan
- Published in
- Rheumatology International | Issue 1/2026
Abstract
Large language models (LLMs) are increasingly used in medical education and clinical decision support, including applications in rheumatology. We evaluated seven publicly accessible LLM tools, ChatGPT (GPT-3.5 and GPT-4.0), Claude Sonnet 4, Gemini, Perplexity AI, DeepSeek, and OpenEvidence, using 50 multiple-choice questions (MCQs) and 25 image-based diagnostic prompts. We assessed accuracy, self-reported confidence, and hallucination rates. In MCQs, DeepSeek achieved the highest accuracy (96%), followed by Claude (94%), GPT-3.5 (92%), GPT-4.0 (92%), Gemini (92%), OpenEvidence (92%), and Perplexity (90%). Image-based performance was lower and more variable, ranging from 16% (Claude) to 56% (Gemini). All models showed significantly reduced odds of correct responses to image questions compared to MCQs (p < 0.01). Claude performed significantly worse than GPT-3.5 on image-based questions (OR 0.24; 95% CI: 0.06–0.86, p = 0.04); no model significantly outperformed GPT-3.5 on MCQs. Confidence scores remained high across all models, ranging from 7 to 10 for MCQs and 8 to 10 for image-based questions. Hallucinations were rare for MCQs (n = 3; 1 Gemini, 2 Perplexity) but common in image responses, ranging from 40% (Gemini) to 84% (Claude). Publicly available LLMs demonstrate high accuracy on text-based rheumatology questions but show limited capability in image interpretation. High confidence in incorrect image responses and frequent hallucinations highlight the need for caution when integrating these tools into clinical education or decision-making.
Advertisement
- Title
- Comparative evaluation of large language models on multiple-choice and image-based rheumatology questions
- Authors
-
Pannathorn Nakaphan
Ivan Damara
Bhoowit Lerttiendamrong
Varote Shotelersuk
Nattanicha Chaisrimaneepan
- Publication date
- 01-01-2026
- Publisher
- Springer Berlin Heidelberg
- Keyword
- Artificial Intelligence
- Published in
-
Rheumatology International / Issue 1/2026
Print ISSN: 0172-8172
Electronic ISSN: 1437-160X - DOI
- https://doi.org/10.1007/s00296-025-06053-5
This content is only visible if you are logged in and have the appropriate permissions.
This content is only visible if you are logged in and have the appropriate permissions.