Skip to main content
Top
Published in: Surgical Endoscopy 5/2024

Open Access 12-03-2024 | Bariatric Surgery

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources

Authors: Nitin Srinivasan, Jamil S. Samaan, Nithya D. Rajeev, Mmerobasi U. Kanu, Yee Hui Yeo, Kamran Samakar

Published in: Surgical Endoscopy | Issue 5/2024

Login to get access

Abstract

Background

The readability of online bariatric surgery patient education materials (PEMs) often surpasses the recommended 6th grade level. Large language models (LLMs), like ChatGPT and Bard, have the potential to revolutionize PEM delivery. We aimed to evaluate the readability of PEMs produced by U.S. medical institutions compared to LLMs, as well as the ability of LLMs to simplify their responses.

Methods

Responses to frequently asked questions (FAQs) related to bariatric surgery were gathered from top-ranked health institutions. FAQ responses were also generated from GPT-3.5, GPT-4, and Bard. LLMs were then prompted to improve the readability of their initial responses. The readability of institutional responses, initial LLM responses, and simplified LLM responses were graded using validated readability formulas. Accuracy and comprehensiveness of initial and simplified LLM responses were also compared.

Results

Responses to 66 FAQs were included. All institutional and initial LLM responses had poor readability, with average reading levels ranging from 9th grade to college graduate. Simplified responses from LLMs had significantly improved readability, with reading levels ranging from 6th grade to college freshman. When comparing simplified LLM responses, GPT-4 responses demonstrated the highest readability, with reading levels ranging from 6th to 9th grade. Accuracy was similar between initial and simplified responses from all LLMs. Comprehensiveness was similar between initial and simplified responses from GPT-3.5 and GPT-4. However, 34.8% of Bard's simplified responses were graded as less comprehensive compared to initial.

Conclusion

Our study highlights the efficacy of LLMs in enhancing the readability of bariatric surgery PEMs. GPT-4 outperformed other models, generating simplified PEMs from 6th to 9th grade reading levels. Unlike GPT-3.5 and GPT-4, Bard’s simplified responses were graded as less comprehensive. We advocate for future studies examining the potential role of LLMs as dynamic and personalized sources of PEMs for diverse patient populations of all literacy levels.
Appendix
Available only for authorised users
Literature
32.
go back to reference Manyika J (2023) An overview of Bard: an early experiment with generative AI. Google Manyika J (2023) An overview of Bard: an early experiment with generative AI. Google
41.
go back to reference McLaughlin GH (1969) SMOG grading: a new readability formula. J Read 12(8):639–646 McLaughlin GH (1969) SMOG grading: a new readability formula. J Read 12(8):639–646
43.
go back to reference Smith EA, Senter RJ (1967) Automated readability index. Aerospace Medical Research Laboratories, Aerospace Medical Division, Air Force Systems Command. pp 1–14 Smith EA, Senter RJ (1967) Automated readability index. Aerospace Medical Research Laboratories, Aerospace Medical Division, Air Force Systems Command. pp 1–14
Metadata
Title
Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources
Authors
Nitin Srinivasan
Jamil S. Samaan
Nithya D. Rajeev
Mmerobasi U. Kanu
Yee Hui Yeo
Kamran Samakar
Publication date
12-03-2024
Publisher
Springer US
Published in
Surgical Endoscopy / Issue 5/2024
Print ISSN: 0930-2794
Electronic ISSN: 1432-2218
DOI
https://doi.org/10.1007/s00464-024-10720-2

Other articles of this Issue 5/2024

Surgical Endoscopy 5/2024 Go to the issue