Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2024 | Research

Assessing the research landscape and clinical utility of large language models: a scoping review

Authors: Ye-Jean Park, Abhinav Pillai, Jiawen Deng, Eddie Guo, Mehul Gupta, Mike Paget, Christopher Naugler

Published in: BMC Medical Informatics and Decision Making | Issue 1/2024

Abstract

Importance

Large language models (LLMs) like OpenAI’s ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base.

Objective

This scoping review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs’ clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications.

Evidence review

We screened 4,036 records from MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations.

Findings

Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs’ effectiveness and feasibility.

Conclusions and relevance

This review thus highlights potential future directions and questions to address these limitations and to further explore LLMs’ potential in enhancing healthcare delivery.

Available only for authorised users

Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194.CrossRefPubMedPubMedCentral

OpenAI. Introducing ChatGPT [Internet]. [cited 2023 May 2]. Available from: https://openai.com/blog/chatgpt.

Devlin J, Chang MW, Lee K, Toutanova K, BERT. Pre-training of deep bidirectional Transformers for language understanding [Internet]. arXiv. 2018. Available from: https://arxiv.org/abs/1810.04805.

Levine DM, Tuwani R, Kompa B, Varma A, Finlayson SG, Mehrotra A et al. The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model [Internet]. medRxiv. 2023. https://doi.org/10.1101/2023.01.30.23285067.

Stewart J, Lu J, Goudie A, Arendts G, Meka SA, Freeman S et al. Applications of natural language processing at emergency department triage: A systematic review [Internet]. bioRxiv. 2022. https://doi.org/10.1101/2022.12.20.22283735.

Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel) [Internet]. 2023;11(6). https://doi.org/10.3390/healthcare11060887.

Sallam M. The utility of ChatGPT as an example of large language models in healthcare education, research and practice: Systematic review on the future perspectives and potential limitations [Internet]. medRxiv. 2023. https://doi.org/10.1101/2023.02.19.23286155.

Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature. 2023;614(7947):214–6.ADSCrossRefPubMed

Nov O, Singh N, Mann DM. Putting ChatGPT’s medical advice to the (Turing) Test [Internet]. bioRxiv. 2023. Available from: http://medrxiv.org/lookup/doi/https://doi.org/10.1101/2023.01.23.23284735.

10.

Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA Extension for scoping reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med. 2018;169(7):467–73.CrossRefPubMed

11.

The Centre for Evidence-Based Medicine [Internet]. 2020 [cited 2023 Oct 10]. The centre for evidence-based medicine. Available from: https://www.cebm.net/.

12.

Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. Using ChatGPT to write patient clinic letters. Lancet Digit Health. 2023;5(4):e179–81.CrossRefPubMed

13.

Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in Healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33.CrossRefPubMedPubMedCentral

14.

Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5(3):e107–8.CrossRefPubMed

15.

Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233–9.CrossRefPubMed

16.

Puthenpura V, Nadkarni S, DiLuna M, Hieftje K, Marks A. Personality changes and staring spells in a 12-Year-old child: a Case Report incorporating ChatGPT, a Natural Language Processing Tool Driven by Artificial Intelligence (AI). Cureus. 2023;15(3):e36408.PubMedPubMedCentral

17.

Lantz R. Toxic epidermal necrolysis in a critically ill African American woman: a Case Report Written with ChatGPT Assistance. Cureus. 2023;15(3):e35742.PubMedPubMedCentral

18.

Beltrami EJ, Grant-Kels JM. Consulting ChatGPT: Ethical dilemmas in language model artificial intelligence. J Am Acad Dermatol [Internet]. 2023; https://doi.org/10.1016/j.jaad.2023.02.052.

19.

Sezgin E, Sirrianni J, Linwood SL, Operationalizing, Pretrained I. Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3 (GPT-3) as a service model. JMIR Med Inf. 2022;10(2):e32875.CrossRef

20.

Baumgartner C. The potential impact of ChatGPT in clinical and translational medicine. Clin Transl Med. 2023;13(3):e1206.CrossRefPubMedPubMedCentral

21.

Haupt CE, Marks M. AI-Generated medical Advice-GPT and Beyond. JAMA. 2023;329(16):1349–50.CrossRefPubMed

22.

Google Cloud [Internet]. [cited 2023 Jul 15]. Evaluating models. Available from: https://cloud.google.com/translate/automl/docs/evaluate.

23.

Lyu Q, Tan J, Zapadka ME, Ponnatapura J, Niu C, Myers KJ et al. Translating Radiology Reports into Plain Language using ChatGPT and GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential [Internet]. arXiv [cs.CL]. 2023. Available from: http://arxiv.org/abs/2303.09038.

24.

Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma [Internet]. bioRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.02.06.23285449v1.

25.

Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med. 2023;21(1):269.CrossRefPubMedPubMedCentral

26.

Ali R, Connolly ID, Tang OY, Mirza FN, Johnston B, Abdulrazeq HA et al. Bridging the literacy gap for surgical consents: An AI-human expert collaborative approach [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.05.06.23289615v1.

27.

Cox A, Seth I, Xie Y, Hunter-Smith DJ, Rozen WM. Utilizing ChatGPT-4 for Providing Medical Information on Blepharoplasties to Patients. Aesthet Surg J [Internet]. 2023; https://doi.org/10.1093/asj/sjad096.

28.

Suresh K, Rathi V, Nwosu O, Partain MP, Glicksman JT, Jowett N et al. Utility of GPT-4 as an informational patient resource in otolaryngology [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.05.14.23289944v1.

29.

Chari S, Acharya P, Gruen DM, Zhang O, Eyigoz EK, Ghalwash M, et al. Informing clinical assessment by contextualizing post-hoc explanations of risk prediction models in type-2 diabetes. Artif Intell Med. 2023;137:102498.CrossRefPubMed

30.

DiGiorgio AM, Ehrenfeld JM. Artificial Intelligence in Medicine & ChatGPT: De-tether the Physician. J Med Syst. 2023;47(1):32.CrossRefPubMed

31.

Khan RA, Jawaid M, Khan AR, Sajjad M. ChatGPT - reshaping medical education and clinical management. Pak J Med Sci Q. 2023 Mar-Apr;39(2):605–7.

32.

Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. medRxiv [Internet]. 2023; https://doi.org/10.1101/2023.02.02.23285399.

33.

Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow. medRxiv [Internet]. 2023; https://doi.org/10.1101/2023.02.21.23285886.

34.

Sabry Abdel-Messih M, Kamel Boulos MN. ChatGPT in clinical toxicology. JMIR Med Educ. 2023;9:e46876.CrossRefPubMedPubMedCentral

35.

Ufuk F. The Role and limitations of large Language models such as ChatGPT in Clinical settings and Medical Journalism. Radiology. 2023;307(3):e230276.CrossRefPubMed

36.

Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records [Internet]. arXiv [cs.CL]. 2022. Available from: http://arxiv.org/abs/2203.03540.

37.

Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems [Internet]. arXiv [cs.CL]. 2023. Available from: http://arxiv.org/abs/2303.13375.

38.

Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res. 2023;25:e48568.CrossRefPubMedPubMedCentral

39.

Haemmerli J, Sveikata L, Nouri A, May A, Egervari K, Freyschlag C et al. ChatGPT in glioma patient adjuvant therapy decision making: ready to assume the role of a doctor in the tumour board? [Internet]. bioRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.03.19.23287452v1.

40.

Au Yeung J, Kraljevic Z, Luintel A, Balston A, Idowu E, Dobson RJ, et al. AI chatbots not yet ready for clinical use. Front Digit Health. 2023;5:1161098.CrossRefPubMedPubMedCentral

41.

Kim JH. Search for medical information and treatment options for musculoskeletal disorders through an artificial intelligence chatbot: Focusing on shoulder impingement syndrome [Internet]. bioRxiv. 2022. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2022.12.16.22283512v2.

42.

Mehnen L, Gruarin S, Vasileva M, Knapp B. ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.04.20.23288859v2.

43.

Knebel D, Priglinger S, Scherer N, Siedlecki J, Schworm B. Assessment of ChatGPT in the preclinical management of ophthalmological emergencies - an analysis of ten fictional case vignettes [Internet]. bioRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.04.16.23288645v1.

44.

Gravel J, D’Amours-Gravel M, Osmanlliu E. Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions [Internet]. bioRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.03.16.23286914v1.

45.

Xie Q, Schenck EJ, Yang HS, Chen Y, Peng Y, Wang F, Faithful AI. in Medicine: A Systematic Review with Large Language Models and Beyond. medRxiv [Internet]. 2023; https://doi.org/10.1101/2023.04.18.23288752.

46.

Perlis RH. Research Letter: Application of GPT-4 to select next-step antidepressant treatment in major depression. medRxiv [Internet]. 2023; https://doi.org/10.1101/2023.04.14.23288595.

47.

Rau A, Rau S, Fink A, Tran H, Wilpert C, Nattenmueller J et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.04.10.23288354v1.

48.

Comrie D. ChatGPT decision support system: Utility in creating public policy for concussion/repetitive brain trauma associated with neurodegenerative diseases [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.04.17.23288681v1.

49.

Wagner MW, Ertl-Wagner BB. Accuracy of information and references using ChatGPT-3 for Retrieval of Clinical Radiological Information. Can Assoc Radiol J. 2023;8465371231171125.

50.

Williams MC, Shambrook J. How will artificial intelligence transform cardiovascular computed tomography? A conversation with an AI model. J Cardiovasc Comput Tomogr [Internet]. 2023; https://doi.org/10.1016/j.jcct.2023.03.010.

51.

Ueda D, Walston SL, Matsumoto T, Deguchi R, Tatekawa H, Miki Y, Evaluating. GPT-4-based ChatGPT’s clinical potential on the NEJM quiz [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.05.04.23289493v1.

52.

Gabriel RA, Mariano ER, McAuley J, Wu CL. How large language models can augment perioperative medicine: a daring discourse. Reg Anesth Pain Med [Internet]. 2023; https://doi.org/10.1136/rapm-2023-104637.

53.

Liao Z, Wang J, Shi Z, Lu L, Tabata H. Revolutionary Potential of ChatGPT in Constructing Intelligent Clinical Decision Support Systems. Ann Biomed Eng [Internet]. 2023; https://doi.org/10.1007/s10439-023-03288-w.

54.

Ravipati A, Pradeep T, Elman SA. The role of artificial intelligence in dermatology: the promising but limited accuracy of ChatGPT in diagnosing clinical scenarios. Int J Dermatol [Internet]. 2023; https://doi.org/10.1111/ijd.16746.

55.

Snoswell CL, Snoswell AJ, Kelly JT, Caffery LJ, Smith AC. Artificial intelligence: augmenting telehealth with large language models. J Telemed Telecare. 2023;1357633X:231169055.

56.

Danilov G, Kotik K, Shevchenko E, Usachev D, Shifrin M, Strunina Y, et al. Length of Stay Prediction in Neurosurgery with Russian GPT-3 Language Model compared to human expectations. Stud Health Technol Inf. 2022;289:156–9.

57.

Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int J Environ Res Public Health [Internet]. 2023;20(4). https://doi.org/10.3390/ijerph20043378.

58.

Liu S, Wright AP, Patterson BL, Wanderer JP, Turer RW, Nelson SD et al. Assessing the Value of ChatGPT for Clinical Decision Support Optimization. medRxiv [Internet]. 2023; https://doi.org/10.1101/2023.02.21.23286254.

59.

Tripathy S, Singh R, Ray M. Natural Language Processing for Covid-19 Consulting System. Procedia Comput Sci. 2023;218:1335–41.CrossRefPubMedPubMedCentral

60.

Harskamp RE, De Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2) [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.03.25.23285475v1.

61.

Guo E, Gupta M, Sinha S, Rössler K, Tatagiba M, Akagami R et al. NeuroGPT-X: Towards an accountable expert opinion tool for vestibular schwannoma [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.02.25.23286117v1.

62.

Noaeen M, Amini S, Bhasker S, Ghezelsefli Z, Ahmed A, Jafarinezhad O et al. Unlocking the power of EHRs: Harnessing unstructured data for Machine Learning-based outcome predictions [Internet]. medRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.02.13.23285873v1.

63.

Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and Artificial Intelligence Chatbot responses to patient questions posted to a Public Social Media Forum. JAMA Intern Med. 2023;183(6):589–96.CrossRefPubMed

64.

Nastasi AJ, Courtright KR, Halpern SD, Weissman GE. Does ChatGPT provide appropriate and equitable medical advice? A vignette-based, clinical evaluation across care contexts [Internet]. bioRxiv. 2023. Available from: https://www.medrxiv.org/content/https://doi.org/10.1101/2023.02.25.23286451v1.

65.

Brown H, Lee K, Mireshghallah F, Shokri R, Tramèr F. What Does it Mean for a Language Model to Preserve Privacy? In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery; 2022. p. 2280–92. (FAccT ’22).

66.

Mireshghallah F, Goyal K, Uniyal A, Berg-Kirkpatrick T, Shokri R. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks [Internet]. arXiv [cs.LG]. 2022. Available from: http://arxiv.org/abs/2203.03929.

67.

About BGPT. HIPAA compliant ChatGPT [Internet]. [cited 2023 Dec 25]. Available from: https://bastiongpt.com/company/about.

68.

Kraljevic Z, Bean D, Shek A, Bendayan R, Hemingway H, Au Yeung J et al. Foresight -- Generative Pretrained Transformer (GPT) for Modelling of Patient Timelines using EHRs [Internet]. arXiv [cs.CL]. 2022. Available from: http://arxiv.org/abs/2212.08072.

69.

David E, The Verge. 2023 [cited 2023 Jul 18]. Meta is giving away its AI tech to try to beat ChatGPT. Available from: https://www.theverge.com/2023/7/18/23799025/meta-ai-llama-2-open-source-microsoft.

70.

Falcon LLM. [Internet]. [cited 2023 Jul 18]. Available from: https://falconllm.tii.ae/.

71.

OpenAI. GPT-4 [Internet]. [cited 2023 May 2]. Available from: https://openai.com/research/gpt-4.

72.

Apple Support [Internet]. [cited 2023 Jul 18]. Secure Enclave. Available from: https://support.apple.com/en-ca/guide/security/sec59b0b31ff/web.

73.

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44.

74.

Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. 2023;6(1):120.CrossRefPubMedPubMedCentral

75.

Siala H, Wang Y. SHIFTing artificial intelligence to be responsible in healthcare: a systematic review. Soc Sci Med. 2022;296:114782.CrossRefPubMed

76.

Lambert SI, Madi M, Sopka S, Lenes A, Stange H, Buszello CP, et al. An integrative review on the acceptance of artificial intelligence among healthcare professionals in hospitals. NPJ Digit Med. 2023;6(1):111.CrossRefPubMedPubMedCentral

77.

Shinn N, Cassano F, Labash B, Gopinath A, Narasimhan K, Yao S, Reflexion. Language Agents with Verbal Reinforcement Learning [Internet]. arXiv [cs.AI]. 2023. Available from: http://arxiv.org/abs/2303.11366.

78.

Superalignment fast grants [Internet]. [cited 2023 Dec 26]. Available from: https://openai.com/blog/superalignment-fast-grants.

79.

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW et al. Large language models encode clinical knowledge. Nature [Internet]. 2023; https://doi.org/10.1038/s41586-023-06291-2.

80.

Lau FF, Ronit A, Weis N, Winckelmann A. Reactive infectious mucosal eruptions (RIME) secondary to Chlamydia pneumoniae infection. Rep Int Dev Res Cent Can. 2021;4(2):11.

81.

Graham ID, Harrison MB. Evaluation and adaptation of clinical practice guidelines. Evid Based Nurs. 2005;8(3):68–72.CrossRefPubMed

Title: Assessing the research landscape and clinical utility of large language models: a scoping review
Authors: Ye-Jean Park
Abhinav Pillai
Jiawen Deng
Eddie Guo
Mehul Gupta
Mike Paget
Christopher Naugler
Publication date: 01-12-2024
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2024
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-024-02459-6

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Assessing the research landscape and clinical utility of large language models: a scoping review

Abstract

Importance

Objective

Evidence review

Findings

Conclusions and relevance

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Abstract

Importance

Objective

Evidence review

Findings

Conclusions and relevance

Please log in to get access to this content

Other articles of this Issue 1/2024

Susceptibility of AutoML mortality prediction algorithms to model drift caused by the COVID pandemic

Development of a generative deep learning model to improve epiretinal membrane detection in fundus photography

Multiple machine-learning tools identifying prognostic biomarkers for acute Myeloid Leukemia

Temporal topic model for clinical pathway mining from electronic medical records

Optimizing deep learning-based segmentation of densely packed cells using cell surface markers

Identifying subgroups in heart failure patients with multimorbidity by clustering and network analysis