Skip to main content
Top
Published in:

25-09-2023 | Methotrexate | Observational Research

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use

Authors: Belkis Nihan Coskun, Burcu Yagiz, Gokhan Ocakoglu, Ediz Dalkilic, Yavuz Pehlivan

Published in: Rheumatology International | Issue 3/2024

Login to get access

Abstract

We aimed to assess Large Language Models (LLMs)—ChatGPT 3.5–4, BARD, and Bing—in their accuracy and completeness when answering Methotrexate (MTX) related questions for treating rheumatoid arthritis. We employed 23 questions from an earlier study related to MTX concerns. These questions were entered into the LLMs, and the responses generated by each model were evaluated by two reviewers using Likert scales to assess accuracy and completeness. The GPT models achieved a 100% correct answer rate, while BARD and Bing scored 73.91%. In terms of accuracy of the outputs (completely correct responses), GPT-4 achieved a score of 100%, GPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. BARD produced 17.39% incorrect responses and 8.7% non-responses, while Bing recorded 13.04% incorrect and 13.04% non-responses. The ChatGPT models produced significantly more accurate responses than Bing for the “mechanism of action” category, and GPT-4 model showed significantly higher accuracy than BARD in the “side effects” category. There were no statistically significant differences among the models for the “lifestyle” category. GPT-4 achieved a comprehensive output of 100%, followed by GPT-3.5 at 86.96%, BARD at 60.86%, and Bing at 0%. In the “mechanism of action” category, both ChatGPT models and BARD produced significantly more comprehensive outputs than Bing. For the “side effects” and “lifestyle” categories, the ChatGPT models showed significantly higher completeness than Bing. The GPT models, particularly GPT 4, demonstrated superior performance in providing accurate and comprehensive patient information about MTX use. However, the study also identified inaccuracies and shortcomings in the generated responses.
Appendix
Available only for authorised users
Literature
17.
go back to reference Johnson D, Goodman R, Patrinely J, et al (2023) Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT Model (in review) Johnson D, Goodman R, Patrinely J, et al (2023) Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT Model (in review)
Metadata
Title
Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use
Authors
Belkis Nihan Coskun
Burcu Yagiz
Gokhan Ocakoglu
Ediz Dalkilic
Yavuz Pehlivan
Publication date
25-09-2023
Publisher
Springer Berlin Heidelberg
Published in
Rheumatology International / Issue 3/2024
Print ISSN: 0172-8172
Electronic ISSN: 1437-160X
DOI
https://doi.org/10.1007/s00296-023-05473-5

Other articles of this Issue 3/2024

Rheumatology International 3/2024 Go to the issue

Keynote webinar | Spotlight on adolescent vaping

  • Live
  • Webinar | 29-01-2025 | 18:00 (CET)

Live: Wednesday 29th January, 18:00-19:30 CET

Growing numbers of young people are using e-cigarettes, despite warnings of respiratory effects and addiction. How can doctors tackle the epidemic, and what health effects should you prepare to manage in your clinics?

Prof. Ann McNeill
Dr. Debbie Robson
Benji Horwell
Developed by: Springer Medicine
Join the webinar

Keynote webinar | Spotlight on modern management of frailty

Frailty has a significant impact on health and wellbeing, especially in older adults. Our experts explain the factors that contribute to the development of frailty and how you can manage the condition and reduce the risk of disability, dependency, and mortality in your patients.

Prof. Alfonso Cruz-Jentoft
Prof. Barbara C. van Munster
Prof. Mirko Petrovic
Developed by: Springer Medicine
Watch now

A quick guide to ECGs

Improve your ECG interpretation skills with this comprehensive, rapid, interactive course. Expert advice provides detailed feedback as you work through 50 ECGs covering the most common cardiac presentations to ensure your practice stays up to date. 

PD Dr. Carsten W. Israel
Developed by: Springer Medizin
Start the cases

At a glance: The STEP trials

A round-up of the STEP phase 3 clinical trials evaluating semaglutide for weight loss in people with overweight or obesity.

Developed by: Springer Medicine
Read more