Skip to main content
Top
Published in: Surgical Endoscopy 5/2024

Open Access 05-03-2024 | Artificial Intelligence | 2024 SAGES Poster

Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis

Authors: Yazid K. Ghanem, Armaun D. Rouhi, Ammr Al-Houssan, Zena Saleh, Matthew C. Moccia, Hansa Joshi, Kristoffel R. Dumon, Young Hong, Francis Spitz, Amit R. Joshi, Michael Kwiatt

Published in: Surgical Endoscopy | Issue 5/2024

Login to get access

Abstract

Introduction

Generative artificial intelligence (AI) chatbots have recently been posited as potential sources of online medical information for patients making medical decisions. Existing online patient-oriented medical information has repeatedly been shown to be of variable quality and difficult readability. Therefore, we sought to evaluate the content and quality of AI-generated medical information on acute appendicitis.

Methods

A modified DISCERN assessment tool, comprising 16 distinct criteria each scored on a 5-point Likert scale (score range 16–80), was used to assess AI-generated content. Readability was determined using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Four popular chatbots, ChatGPT-3.5 and ChatGPT-4, Bard, and Claude-2, were prompted to generate medical information about appendicitis. Three investigators independently scored the generated texts blinded to the identity of the AI platforms.

Results

ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 had overall mean (SD) quality scores of 60.7 (1.2), 62.0 (1.0), 62.3 (1.2), and 51.3 (2.3), respectively, on a scale of 16–80. Inter-rater reliability was 0.81, 0.75, 0.81, and 0.72, respectively, indicating substantial agreement. Claude-2 demonstrated a significantly lower mean quality score compared to ChatGPT-4 (p = 0.001), ChatGPT-3.5 (p = 0.005), and Bard (p = 0.001). Bard was the only AI platform that listed verifiable sources, while Claude-2 provided fabricated sources. All chatbots except for Claude-2 advised readers to consult a physician if experiencing symptoms. Regarding readability, FKGL and FRE scores of ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 were 14.6 and 23.8, 11.9 and 33.9, 8.6 and 52.8, 11.0 and 36.6, respectively, indicating difficulty readability at a college reading skill level.

Conclusion

AI-generated medical information on appendicitis scored favorably upon quality assessment, but most either fabricated sources or did not provide any altogether. Additionally, overall readability far exceeded recommended levels for the public. Generative AI platforms demonstrate measured potential for patient education and engagement about appendicitis.
Literature
10.
go back to reference Weiss BD (2003) Health literacy: a manual for clinicians. American Medical Association Foundation and American Medical Association, Chicago Weiss BD (2003) Health literacy: a manual for clinicians. American Medical Association Foundation and American Medical Association, Chicago
11.
go back to reference National Cancer Institute (1994) Clear and simple: developing effective print materials for low literate readers. National Institutes of Health, National Cancer Institute National Cancer Institute (1994) Clear and simple: developing effective print materials for low literate readers. National Institutes of Health, National Cancer Institute
18.
Metadata
Title
Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis
Authors
Yazid K. Ghanem
Armaun D. Rouhi
Ammr Al-Houssan
Zena Saleh
Matthew C. Moccia
Hansa Joshi
Kristoffel R. Dumon
Young Hong
Francis Spitz
Amit R. Joshi
Michael Kwiatt
Publication date
05-03-2024
Publisher
Springer US
Published in
Surgical Endoscopy / Issue 5/2024
Print ISSN: 0930-2794
Electronic ISSN: 1432-2218
DOI
https://doi.org/10.1007/s00464-024-10739-5

Other articles of this Issue 5/2024

Surgical Endoscopy 5/2024 Go to the issue