Skip to main content
Top
Published in:

Open Access 01-12-2024 | Research

Analyzing evaluation methods for large language models in the medical field: a scoping review

Authors: Junbok Lee, Sungkyung Park, Jaeyong Shin, Belong Cho

Published in: BMC Medical Informatics and Decision Making | Issue 1/2024

Login to get access

Abstract

Background

Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs.

Objective

This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies.

Methods & materials

We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy.

Results

A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering.

Conclusions

More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Appendix
Available only for authorised users
Literature
2.
go back to reference Lund BD, Wang T. Chatting about ChatGPT: how may AI and GPT impact academia and libraries? Libr Hi Tech News. 2023;40:26–9.CrossRef Lund BD, Wang T. Chatting about ChatGPT: how may AI and GPT impact academia and libraries? Libr Hi Tech News. 2023;40:26–9.CrossRef
3.
4.
5.
go back to reference Qiu J et al. Large ai models in health informatics: Applications, challenges, and the future. IEEE Journal of Biomedical and Health Informatics (2023). (2023). Qiu J et al. Large ai models in health informatics: Applications, challenges, and the future. IEEE Journal of Biomedical and Health Informatics (2023). (2023).
6.
go back to reference Temsah M-H et al. MDPI,. Chatgpt and the future of digital health: a study on healthcare workers’ perceptions and expectations. In Healthcare 1812 (2023). Temsah M-H et al. MDPI,. Chatgpt and the future of digital health: a study on healthcare workers’ perceptions and expectations. In Healthcare 1812 (2023).
7.
go back to reference Wu T, et al. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J Automatica Sinica. 2023;10:1122–36.CrossRef Wu T, et al. A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J Automatica Sinica. 2023;10:1122–36.CrossRef
8.
go back to reference Rahaman MS et al. The AI race is on! Google’s Bard and OpenAI’s ChatGPT head to head: an opinion article. Mizanur and Rahman, Md Nafizur, The AI Race is on (2023). (2023). Rahaman MS et al. The AI race is on! Google’s Bard and OpenAI’s ChatGPT head to head: an opinion article. Mizanur and Rahman, Md Nafizur, The AI Race is on (2023). (2023).
9.
go back to reference Hill JE, Harris C, Clegg A. Methods for using Bing’s AI-powered search engine for data extraction for a systematic review. Res Synthesis Methods. 2024;15:347–53.CrossRef Hill JE, Harris C, Clegg A. Methods for using Bing’s AI-powered search engine for data extraction for a systematic review. Res Synthesis Methods. 2024;15:347–53.CrossRef
11.
12.
go back to reference Subramanian CR, Yang DA, Khanna R. Enhancing health care communication with large language models—the role, challenges, and future directions. JAMA Netw Open. 2024;7:e240347–240347.CrossRef Subramanian CR, Yang DA, Khanna R. Enhancing health care communication with large language models—the role, challenges, and future directions. JAMA Netw Open. 2024;7:e240347–240347.CrossRef
13.
go back to reference Karabacak M, Margetis K. Embracing large Language models for Medical Applications: opportunities and challenges. Cureus 15 (2023). Karabacak M, Margetis K. Embracing large Language models for Medical Applications: opportunities and challenges. Cureus 15 (2023).
14.
go back to reference Choudhury A, Shamszare H. Investigating the impact of user trust on the adoption and use of ChatGPT: Survey Analysis. J Med Internet Res. 2023;25:e47184.PubMedPubMedCentralCrossRef Choudhury A, Shamszare H. Investigating the impact of user trust on the adoption and use of ChatGPT: Survey Analysis. J Med Internet Res. 2023;25:e47184.PubMedPubMedCentralCrossRef
15.
go back to reference Shahsavar Y, Choudhury A. User intentions to Use ChatGPT for self-diagnosis and health-related Purposes: cross-sectional survey study. JMIR Hum Factors. 2023;10:e47564.PubMedPubMedCentralCrossRef Shahsavar Y, Choudhury A. User intentions to Use ChatGPT for self-diagnosis and health-related Purposes: cross-sectional survey study. JMIR Hum Factors. 2023;10:e47564.PubMedPubMedCentralCrossRef
16.
go back to reference Reddy S. Evaluating large language models for use in healthcare: A framework for translational value assessment. Informatics in Medicine Unlocked (2023). 101304 (2023). Reddy S. Evaluating large language models for use in healthcare: A framework for translational value assessment. Informatics in Medicine Unlocked (2023). 101304 (2023).
17.
go back to reference Arksey H, O’Malley L. Scoping studies: towards a methodological framework. Int J Soc Res Methodol. 2005;8:19–32.CrossRef Arksey H, O’Malley L. Scoping studies: towards a methodological framework. Int J Soc Res Methodol. 2005;8:19–32.CrossRef
18.
go back to reference Colquhoun HL, et al. Scoping reviews: time for clarity in definition, methods, and reporting. J Clin Epidemiol. 2014;67:1291–4.PubMedCrossRef Colquhoun HL, et al. Scoping reviews: time for clarity in definition, methods, and reporting. J Clin Epidemiol. 2014;67:1291–4.PubMedCrossRef
19.
go back to reference Munn Z, et al. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med Res Methodol. 2018;18:1–7.CrossRef Munn Z, et al. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med Res Methodol. 2018;18:1–7.CrossRef
20.
go back to reference Pavlenko A. Narrative analysis. Blackwell Guide Res Methods Biling Multiling. 2008;311–325:2008. Pavlenko A. Narrative analysis. Blackwell Guide Res Methods Biling Multiling. 2008;311–325:2008.
21.
go back to reference Tricco AC, et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. 2018;169:467–73.PubMedCrossRef Tricco AC, et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. 2018;169:467–73.PubMedCrossRef
22.
go back to reference Ali R et al. Performance of ChatGPT, GPT-4, and Google bard on a neurosurgery oral boards preparation question bank. Neurosurgery (2022). 10.1227 (2022). Ali R et al. Performance of ChatGPT, GPT-4, and Google bard on a neurosurgery oral boards preparation question bank. Neurosurgery (2022). 10.1227 (2022).
23.
go back to reference Ali R et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. medRxiv (2023). 2023.2003. 2025.23287743 (2023). Ali R et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. medRxiv (2023). 2023.2003. 2025.23287743 (2023).
24.
go back to reference Antaki F et al. Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings. Ophthalmology Science (2023). 100324 (2023). Antaki F et al. Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings. Ophthalmology Science (2023). 100324 (2023).
25.
go back to reference Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology (2023). 230582 (2023). Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology (2023). 230582 (2023).
26.
go back to reference Cai LZ et al. Performance of Generative Large Language Models on Ophthalmology Board Style Questions. American Journal of Ophthalmology (2023). (2023). Cai LZ et al. Performance of Generative Large Language Models on Ophthalmology Board Style Questions. American Journal of Ophthalmology (2023). (2023).
27.
go back to reference Chen TC et al. Chat GPT as a Neuro-Score Calculator: Analysis of a Large Language Model’s Performance on Various Neurological Exam Grading Scales. World Neurosurgery (2023). (2023). Chen TC et al. Chat GPT as a Neuro-Score Calculator: Analysis of a Large Language Model’s Performance on Various Neurological Exam Grading Scales. World Neurosurgery (2023). (2023).
28.
go back to reference Cohen A et al. Performance of ChatGPT in Israeli Hebrew OBGYN national residency examinations. Archives of Gynecology and Obstetrics (2023). 1–6 (2023). Cohen A et al. Performance of ChatGPT in Israeli Hebrew OBGYN national residency examinations. Archives of Gynecology and Obstetrics (2023). 1–6 (2023).
29.
go back to reference Cuthbert R, Simpson AI. Artificial intelligence in orthopaedics: can chat generative pre-trained transformer (ChatGPT) pass Sect. 1 of the Fellowship of the Royal College of Surgeons (Trauma & Orthopaedics) examination? Postgrad Med J. 2023;99:1110–4.PubMedCrossRef Cuthbert R, Simpson AI. Artificial intelligence in orthopaedics: can chat generative pre-trained transformer (ChatGPT) pass Sect. 1 of the Fellowship of the Royal College of Surgeons (Trauma & Orthopaedics) examination? Postgrad Med J. 2023;99:1110–4.PubMedCrossRef
30.
go back to reference Deebel NA, Terlecki R. ChatGPT performance on the American Urological Association (AUA) Self-Assessment Study Program and the potential influence of artificial intelligence (AI) in urologic training. Urology (2023). (2023). Deebel NA, Terlecki R. ChatGPT performance on the American Urological Association (AUA) Self-Assessment Study Program and the potential influence of artificial intelligence (AI) in urologic training. Urology (2023). (2023).
31.
32.
33.
go back to reference Giannos P. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK Neurology Specialty Certificate Examination. BMJ Neurol Open 5 (2023). Giannos P. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK Neurology Specialty Certificate Examination. BMJ Neurol Open 5 (2023).
34.
go back to reference Gilson A, et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312.PubMedPubMedCentralCrossRef Gilson A, et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312.PubMedPubMedCentralCrossRef
35.
go back to reference Guerra GA et al. GPT-4 artificial intelligence model outperforms ChatGPT, medical students, and neurosurgery residents on neurosurgery written board-like questions. World Neurosurgery (2023). (2023). Guerra GA et al. GPT-4 artificial intelligence model outperforms ChatGPT, medical students, and neurosurgery residents on neurosurgery written board-like questions. World Neurosurgery (2023). (2023).
36.
go back to reference Guigue PA et al. Performance of ChatGPT in French language Parcours d’Accès Spécifique Santé test and in OBGYN. International Journal of Gynecology & Obstetrics (2023). (2023). Guigue PA et al. Performance of ChatGPT in French language Parcours d’Accès Spécifique Santé test and in OBGYN. International Journal of Gynecology & Obstetrics (2023). (2023).
37.
go back to reference Gupta R, et al. Performance of ChatGPT on the plastic surgery inservice training examination. Aesthetic Surg J. 2023;sjad128:2023. Gupta R, et al. Performance of ChatGPT on the plastic surgery inservice training examination. Aesthetic Surg J. 2023;sjad128:2023.
38.
go back to reference Hoch CC et al. ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. European Archives of Oto-Rhino-Laryngology (2023). 1–8 (2023). Hoch CC et al. ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. European Archives of Oto-Rhino-Laryngology (2023). 1–8 (2023).
39.
go back to reference Holmes J et al. Evaluating large language models on a highly-specialized topic, radiation oncology physics. arXiv preprint arXiv:2304.01938 (2023). (2023). Holmes J et al. Evaluating large language models on a highly-specialized topic, radiation oncology physics. arXiv preprint arXiv:2304.01938 (2023). (2023).
40.
go back to reference Hopkins BS, et al. ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board–style questions. J Neurosurg. 2023;139:904–11.PubMedCrossRef Hopkins BS, et al. ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board–style questions. J Neurosurg. 2023;139:904–11.PubMedCrossRef
41.
go back to reference Huang RS, et al. Assessment of resident and AI chatbot performance on the University of Toronto family medicine residency progress test: comparative study. JMIR Med Educ. 2023;9:e50514.PubMedPubMedCentralCrossRef Huang RS, et al. Assessment of resident and AI chatbot performance on the University of Toronto family medicine residency progress test: comparative study. JMIR Med Educ. 2023;9:e50514.PubMedPubMedCentralCrossRef
42.
go back to reference Humar P, Asaad M, Bengur FB, Nguyen V. ChatGPT is equivalent to first year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service exam. Aesthetic Surgery Journal (2023). sjad130 (2023). Humar P, Asaad M, Bengur FB, Nguyen V. ChatGPT is equivalent to first year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service exam. Aesthetic Surgery Journal (2023). sjad130 (2023).
43.
go back to reference Hurley NC, Schroeder KM, Hess AS. Would doctors dream of electric blood bankers? Large language model-based artificial intelligence performs well in many aspects of transfusion medicine. Transfusion. 2023;63:1833–40.PubMedCrossRef Hurley NC, Schroeder KM, Hess AS. Would doctors dream of electric blood bankers? Large language model-based artificial intelligence performs well in many aspects of transfusion medicine. Transfusion. 2023;63:1833–40.PubMedCrossRef
44.
go back to reference Kaneda Y et al. Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese nursing examination. Cureus 15 (2023). Kaneda Y et al. Assessing the performance of GPT-3.5 and GPT-4 on the 2023 Japanese nursing examination. Cureus 15 (2023).
45.
go back to reference Kumah-Crystal Y, Mankowitz S, Embi P, Lehmann CU. ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association (2023). ocad104 (2023). Kumah-Crystal Y, Mankowitz S, Embi P, Lehmann CU. ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association (2023). ocad104 (2023).
46.
47.
go back to reference Lewandowski M, Łukowicz P, Świetlik D, Barańska-Rybak W. An original study of ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the dermatology specialty certificate examinations. Clinical and Experimental Dermatology (2023). llad255 (2023). Lewandowski M, Łukowicz P, Świetlik D, Barańska-Rybak W. An original study of ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the dermatology specialty certificate examinations. Clinical and Experimental Dermatology (2023). llad255 (2023).
48.
go back to reference Li Q, Min X. Unleashing the Power of Language Models in Clinical Settings: A Trailblazing Evaluation Unveiling Novel Test Design. medRxiv (2023). 2023.2007. 2011.23292512 (2023). Li Q, Min X. Unleashing the Power of Language Models in Clinical Settings: A Trailblazing Evaluation Unveiling Novel Test Design. medRxiv (2023). 2023.2007. 2011.23292512 (2023).
50.
go back to reference Lum ZC. Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clinical Orthopaedics and Related Research®. 2022:10.1097. Lum ZC. Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clinical Orthopaedics and Related Research®. 2022:10.1097.
51.
go back to reference Madrid-García A, et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish Access exam to Specialized Medical Training. medRxiv. 2023;20232007. 2021.23292821. (2023). Madrid-García A, et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish Access exam to Specialized Medical Training. medRxiv. 2023;20232007. 2021.23292821. (2023).
52.
go back to reference Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. JAAOS-Journal of the American Academy of Orthopaedic Surgeons (2022). 10.5435 (2022). Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. JAAOS-Journal of the American Academy of Orthopaedic Surgeons (2022). 10.5435 (2022).
53.
go back to reference Meo SA et al. ChatGPT knowledge evaluation in basic and clinical medical sciences: multiple choice question examination-based performance. In Healthcare 2046 (MDPI, 2023). Meo SA et al. ChatGPT knowledge evaluation in basic and clinical medical sciences: multiple choice question examination-based performance. In Healthcare 2046 (MDPI, 2023).
54.
go back to reference Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA ophthalmology (2023). (2023). Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA ophthalmology (2023). (2023).
55.
go back to reference Moshirfar M et al. Artificial Intelligence in Ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus 15 (2023). Moshirfar M et al. Artificial Intelligence in Ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus 15 (2023).
56.
go back to reference Noda R et al. Performance of ChatGPT and Bard in Self-Assessment Questions for Nephrology Board Renewal. medRxiv (2023). 2023.2006. 2006.23291070 (2023). Noda R et al. Performance of ChatGPT and Bard in Self-Assessment Questions for Nephrology Board Renewal. medRxiv (2023). 2023.2006. 2006.23291070 (2023).
57.
go back to reference Oh N, Choi G-S, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Annals Surg Treat Res. 2023;104:269.CrossRef Oh N, Choi G-S, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Annals Surg Treat Res. 2023;104:269.CrossRef
60.
go back to reference Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative performance of ChatGPT and bard in a text-based radiology knowledge assessment. Canadian Association of Radiologists Journal (2023). 08465371231193716 (2023). Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative performance of ChatGPT and bard in a text-based radiology knowledge assessment. Canadian Association of Radiologists Journal (2023). 08465371231193716 (2023).
61.
go back to reference Roos J, Kasapovic A, Jansen T, Kaczmarczyk R. Artificial Intelligence in Medical Education: comparative analysis of ChatGPT, Bing, and medical students in Germany. JMIR Med Educ. 2023;9:e46482.PubMedPubMedCentralCrossRef Roos J, Kasapovic A, Jansen T, Kaczmarczyk R. Artificial Intelligence in Medical Education: comparative analysis of ChatGPT, Bing, and medical students in Germany. JMIR Med Educ. 2023;9:e46482.PubMedPubMedCentralCrossRef
62.
go back to reference Rosoł M et al. Evaluation of the performance of GPT-3.5 and GPT-4 on the Medical Final Examination. medRxiv (2023). 2023.2006. 2004.23290939 (2023). Rosoł M et al. Evaluation of the performance of GPT-3.5 and GPT-4 on the Medical Final Examination. medRxiv (2023). 2023.2006. 2004.23290939 (2023).
63.
go back to reference Saad A, Iyengar KP, Kurisunkal V, Botchu R. Assessing ChatGPT’s ability to pass the FRCS orthopaedic part a exam: a critical analysis. Surgeon. 2023;21:263–6.PubMedCrossRef Saad A, Iyengar KP, Kurisunkal V, Botchu R. Assessing ChatGPT’s ability to pass the FRCS orthopaedic part a exam: a critical analysis. Surgeon. 2023;21:263–6.PubMedCrossRef
64.
go back to reference Schubert MC, Wick W, Venkataramani V. Evaluating the Performance of Large Language Models on a Neurology Board-Style Examination. medRxiv (2023). 2023.2007. 2013.23292598 (2023). Schubert MC, Wick W, Venkataramani V. Evaluating the Performance of Large Language Models on a Neurology Board-Style Examination. medRxiv (2023). 2023.2007. 2013.23292598 (2023).
65.
go back to reference Shetty M, Ettlinger M, Lynch M. GPT-4, an artificial intelligence large language model, exhibits high levels of accuracy on dermatology specialty certificate exam questions. medRxiv (2023). 2023.2007. 2013.23292418 (2023). Shetty M, Ettlinger M, Lynch M. GPT-4, an artificial intelligence large language model, exhibits high levels of accuracy on dermatology specialty certificate exam questions. medRxiv (2023). 2023.2007. 2013.23292418 (2023).
66.
go back to reference Smith J, Choi PM, Buntine P. Will code one day run a code? Performance of language models on ACEM primary examinations and implications. Emergency Medicine Australasia (2023). (2023). Smith J, Choi PM, Buntine P. Will code one day run a code? Performance of language models on ACEM primary examinations and implications. Emergency Medicine Australasia (2023). (2023).
67.
go back to reference Suchman K, Garg S, Trindade AJ. Chat Generative Pretrained Transformer fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test. Official J Am Coll Gastroenterology| ACG. 2022. 10.14309. (2022). Suchman K, Garg S, Trindade AJ. Chat Generative Pretrained Transformer fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test. Official J Am Coll Gastroenterology| ACG. 2022. 10.14309. (2022).
68.
go back to reference Taira K, Itaya T, Hanada A. Performance of the large language model ChatGPT on the national nurse examinations in Japan: evaluation study. JMIR Nurs. 2023;6:e47305.PubMedPubMedCentralCrossRef Taira K, Itaya T, Hanada A. Performance of the large language model ChatGPT on the national nurse examinations in Japan: evaluation study. JMIR Nurs. 2023;6:e47305.PubMedPubMedCentralCrossRef
69.
go back to reference Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9:e48002.PubMedPubMedCentralCrossRef Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9:e48002.PubMedPubMedCentralCrossRef
70.
go back to reference Tanaka Y et al. Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan. medRxiv (2023). 2023.2004. 2017.23288603 (2023). Tanaka Y et al. Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan. medRxiv (2023). 2023.2004. 2017.23288603 (2023).
71.
go back to reference Teebagy S, et al. Improved performance of chatgpt-4 on the OKAP examination: a comparative study with chatgpt-3.5. J Acad Ophthalmol. 2023;15:e184–7.CrossRef Teebagy S, et al. Improved performance of chatgpt-4 on the OKAP examination: a comparative study with chatgpt-3.5. J Acad Ophthalmol. 2023;15:e184–7.CrossRef
72.
go back to reference Thirunavukarasu AJ, et al. Trialling a large language model (ChatGPT) in general practice with the Applied Knowledge Test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 2023;9:e46599.PubMedPubMedCentralCrossRef Thirunavukarasu AJ, et al. Trialling a large language model (ChatGPT) in general practice with the Applied Knowledge Test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 2023;9:e46599.PubMedPubMedCentralCrossRef
73.
go back to reference Valdez D et al. Performance of progressive generations of GPT on an exam designed for certifying physicians as Certified Clinical Densitometrists. medRxiv (2023). 2023.2007. 2025.23293171 (2023). Valdez D et al. Performance of progressive generations of GPT on an exam designed for certifying physicians as Certified Clinical Densitometrists. medRxiv (2023). 2023.2007. 2025.23293171 (2023).
74.
go back to reference Wang H, et al. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Informatics. 2023;177:105173.CrossRef Wang H, et al. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Informatics. 2023;177:105173.CrossRef
75.
go back to reference Abi-Rafeh J et al. Complications Following Facelift and Neck Lift: Implementation and Assessment of Large Language Model and Artificial Intelligence (ChatGPT) Performance Across 16 Simulated Patient Presentations. Aesthetic Plastic Surgery (2023). 1–8 (2023). Abi-Rafeh J et al. Complications Following Facelift and Neck Lift: Implementation and Assessment of Large Language Model and Artificial Intelligence (ChatGPT) Performance Across 16 Simulated Patient Presentations. Aesthetic Plastic Surgery (2023). 1–8 (2023).
76.
go back to reference Ali MJ. ChatGPT and lacrimal drainage disorders: performance and scope of improvement. Ophthal Plast Reconstr Surg. 2023;39:221.PubMedCrossRef Ali MJ. ChatGPT and lacrimal drainage disorders: performance and scope of improvement. Ophthal Plast Reconstr Surg. 2023;39:221.PubMedCrossRef
77.
go back to reference Allahqoli L et al. The Diagnostic and Management Performance of the ChatGPT in Obstetrics and Gynecology. Gynecologic and Obstetric Investigation (2023). (2023). Allahqoli L et al. The Diagnostic and Management Performance of the ChatGPT in Obstetrics and Gynecology. Gynecologic and Obstetric Investigation (2023). (2023).
78.
go back to reference Athavale A, Baier J, Ross E, Fukaya E, THE POTENTIAL OF CHATBOTS IN CHRONIC VENOUS DISEASE PATIENT MANAGEMENT. JVS-Vascular Insights (2023). 100019 (2023). Athavale A, Baier J, Ross E, Fukaya E, THE POTENTIAL OF CHATBOTS IN CHRONIC VENOUS DISEASE PATIENT MANAGEMENT. JVS-Vascular Insights (2023). 100019 (2023).
79.
go back to reference Ayers JW et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine (2023). (2023). Ayers JW et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine (2023). (2023).
80.
go back to reference Ayoub NF, Lee YJ, Grimm D, Divi V. Head-to‐Head Comparison of ChatGPT Versus Google Search for Medical Knowledge Acquisition. Otolaryngology–Head and Neck Surgery (2023). (2023). Ayoub NF, Lee YJ, Grimm D, Divi V. Head-to‐Head Comparison of ChatGPT Versus Google Search for Medical Knowledge Acquisition. Otolaryngology–Head and Neck Surgery (2023). (2023).
81.
go back to reference Barash Y, Klang E, Konen E, Sorin V. ChatGPT-4 assistance in optimizing emergency department radiology referrals and imaging selection. J Am Coll Radiol. 2023;20:998–1003.PubMedCrossRef Barash Y, Klang E, Konen E, Sorin V. ChatGPT-4 assistance in optimizing emergency department radiology referrals and imaging selection. J Am Coll Radiol. 2023;20:998–1003.PubMedCrossRef
82.
go back to reference Bellinger JR et al. BPPV Information on Google Versus AI (ChatGPT). Otolaryngology–Head and Neck Surgery (2023). (2023). Bellinger JR et al. BPPV Information on Google Versus AI (ChatGPT). Otolaryngology–Head and Neck Surgery (2023). (2023).
83.
go back to reference Benirschke RC et al. Assessment of a large language model’s utility in helping pathology professionals answer general knowledge pathology questions. American Journal of Clinical Pathology (2023). aqad106 (2023). Benirschke RC et al. Assessment of a large language model’s utility in helping pathology professionals answer general knowledge pathology questions. American Journal of Clinical Pathology (2023). aqad106 (2023).
84.
go back to reference Bernstein IA, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open. 2023;6:e2330320–2330320.PubMedPubMedCentralCrossRef Bernstein IA, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open. 2023;6:e2330320–2330320.PubMedPubMedCentralCrossRef
85.
go back to reference Birkun AA, Gautam A. Large language model-based chatbot as a source of advice on first aid in heart attack. Current Problems in Cardiology (2023). 102048 (2023). Birkun AA, Gautam A. Large language model-based chatbot as a source of advice on first aid in heart attack. Current Problems in Cardiology (2023). 102048 (2023).
86.
go back to reference Biswas S et al. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic and Physiological Optics (2023). (2023). Biswas S et al. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic and Physiological Optics (2023). (2023).
87.
go back to reference Cadamuro J, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med (CCLM). 2023;61:1158–66.PubMedCrossRef Cadamuro J, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med (CCLM). 2023;61:1158–66.PubMedCrossRef
88.
go back to reference Caglar U et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. Journal of pediatric urology (2023). (2023). Caglar U et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. Journal of pediatric urology (2023). (2023).
89.
go back to reference Cakir H et al. Evaluating the performance of ChatGPT in answering questions related to urolithiasis. International Urology and Nephrology (2023). 1–5 (2023). Cakir H et al. Evaluating the performance of ChatGPT in answering questions related to urolithiasis. International Urology and Nephrology (2023). 1–5 (2023).
90.
go back to reference Chen S et al. The utility of ChatGPT for cancer treatment information. medRxiv (2023). 2023.2003. 2016.23287316 (2023). Chen S et al. The utility of ChatGPT for cancer treatment information. medRxiv (2023). 2023.2003. 2016.23287316 (2023).
91.
go back to reference Chiesa-Estomba CM et al. Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support. European Archives of Oto-Rhino-Laryngology (2023). 1–6 (2023). Chiesa-Estomba CM et al. Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support. European Archives of Oto-Rhino-Laryngology (2023). 1–6 (2023).
92.
go back to reference Clough RA et al. Transforming healthcare documentation: Harnessing the potential of AI to generate discharge summaries. BJGP open (2023). (2023). Clough RA et al. Transforming healthcare documentation: Harnessing the potential of AI to generate discharge summaries. BJGP open (2023). (2023).
93.
go back to reference Cocci A et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate cancer and prostatic diseases (2023). 1–6 (2023). Cocci A et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate cancer and prostatic diseases (2023). 1–6 (2023).
94.
go back to reference Coskun B, Ocakoglu G, Yetemen M, Kaygisiz O. Can ChatGPT, an artificial intelligence language model, provide accurate and high-quality patient information on prostate cancer? Urology 180, 35–58 (2023). Coskun B, Ocakoglu G, Yetemen M, Kaygisiz O. Can ChatGPT, an artificial intelligence language model, provide accurate and high-quality patient information on prostate cancer? Urology 180, 35–58 (2023).
95.
go back to reference Coskun BN et al. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatology International (2023). 1–7 (2023). Coskun BN et al. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatology International (2023). 1–7 (2023).
96.
go back to reference Davis R, et al. Evaluating the effectiveness of artificial intelligence–powered large language models application in disseminating appropriate and readable health information in urology. J Urol. 2023;210:688–94.PubMedCrossRef Davis R, et al. Evaluating the effectiveness of artificial intelligence–powered large language models application in disseminating appropriate and readable health information in urology. J Urol. 2023;210:688–94.PubMedCrossRef
97.
go back to reference Delsoz M et al. Performance of chatgpt in diagnosis of corneal eye diseases. medRxiv (2023). (2023). Delsoz M et al. Performance of chatgpt in diagnosis of corneal eye diseases. medRxiv (2023). (2023).
98.
go back to reference Delsoz M et al. The Use of ChatGPT to Assist in Diagnosing Glaucoma Based on Clinical Case Reports. Ophthalmology and Therapy (2023). 1–12 (2023). Delsoz M et al. The Use of ChatGPT to Assist in Diagnosing Glaucoma Based on Clinical Case Reports. Ophthalmology and Therapy (2023). 1–12 (2023).
99.
go back to reference Duey AH, et al. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J. 2023;23:1684–91.PubMedCrossRef Duey AH, et al. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J. 2023;23:1684–91.PubMedCrossRef
100.
go back to reference Fink MA, et al. Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology. 2023;308:e231362.PubMedCrossRef Fink MA, et al. Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology. 2023;308:e231362.PubMedCrossRef
101.
go back to reference Gorelik Y, Ghersin I, Maza I, Klein A. Harnessing language models for streamlined postcolonoscopy patient management: a novel approach. Gastrointest Endosc. 2023;98:639–41. e634.PubMedCrossRef Gorelik Y, Ghersin I, Maza I, Klein A. Harnessing language models for streamlined postcolonoscopy patient management: a novel approach. Gastrointest Endosc. 2023;98:639–41. e634.PubMedCrossRef
102.
go back to reference Haemmerli J et al. ChatGPT in glioma adjuvant therapy decision making: ready to assume the role of a doctor in the tumour board? BMJ Health Care Inf 30 (2023). Haemmerli J et al. ChatGPT in glioma adjuvant therapy decision making: ready to assume the role of a doctor in the tumour board? BMJ Health Care Inf 30 (2023).
103.
go back to reference Henson JB et al. Evaluation of the Potential Utility of an Artificial Intelligence Chatbot in Gastroesophageal Reflux Disease Management. Official journal of the American College of Gastroenterology| ACG (2022). 10.14309 (2022). Henson JB et al. Evaluation of the Potential Utility of an Artificial Intelligence Chatbot in Gastroesophageal Reflux Disease Management. Official journal of the American College of Gastroenterology| ACG (2022). 10.14309 (2022).
104.
go back to reference Hirosawa T, et al. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Public Health. 2023;20:3378.PubMedPubMedCentralCrossRef Hirosawa T, et al. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Public Health. 2023;20:3378.PubMedPubMedCentralCrossRef
105.
go back to reference Hirosawa T, Mizuta K, Harada Y, Shimizu T. Comparative Evaluation of Diagnostic Accuracy between Google Bard and Physicians. Am J Med. 2023;136:1119–23. e1118.PubMedCrossRef Hirosawa T, Mizuta K, Harada Y, Shimizu T. Comparative Evaluation of Diagnostic Accuracy between Google Bard and Physicians. Am J Med. 2023;136:1119–23. e1118.PubMedCrossRef
106.
go back to reference Hristidis V, et al. ChatGPT vs Google for queries related to dementia and other Cognitive decline: comparison of results. J Med Internet Res. 2023;25:e48966.PubMedPubMedCentralCrossRef Hristidis V, et al. ChatGPT vs Google for queries related to dementia and other Cognitive decline: comparison of results. J Med Internet Res. 2023;25:e48966.PubMedPubMedCentralCrossRef
107.
go back to reference Hu X et al. What can GPT-4 do for Diagnosing Rare Eye Diseases? A Pilot Study. Ophthalmology and Therapy (2023). 1–8 (2023). Hu X et al. What can GPT-4 do for Diagnosing Rare Eye Diseases? A Pilot Study. Ophthalmology and Therapy (2023). 1–8 (2023).
108.
go back to reference Hung Y-C, et al. Comparison of Patient Education materials generated by Chat Generative Pre-trained Transformer Versus experts: an innovative way to increase readability of Patient Education materials. Ann Plast Surg. 2023;91:409–12.PubMedCrossRef Hung Y-C, et al. Comparison of Patient Education materials generated by Chat Generative Pre-trained Transformer Versus experts: an innovative way to increase readability of Patient Education materials. Ann Plast Surg. 2023;91:409–12.PubMedCrossRef
109.
go back to reference Johnson D et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Research square (2023). (2023). Johnson D et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Research square (2023). (2023).
110.
go back to reference Kaarre J, et al. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg Sports Traumatol Arthrosc. 2023;31:5190–8.PubMedPubMedCentralCrossRef Kaarre J, et al. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg Sports Traumatol Arthrosc. 2023;31:5190–8.PubMedPubMedCentralCrossRef
111.
go back to reference Kao H-J et al. Assessing ChatGPT’s capacity for clinical decision support in pediatrics: a comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine 102 (2023). Kao H-J et al. Assessing ChatGPT’s capacity for clinical decision support in pediatrics: a comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine 102 (2023).
112.
go back to reference King RC et al. A multidisciplinary assessment of ChatGPTs knowledge of amyloidosis. medRxiv (2023). 2023.2007. 2017.23292780 (2023). King RC et al. A multidisciplinary assessment of ChatGPTs knowledge of amyloidosis. medRxiv (2023). 2023.2007. 2017.23292780 (2023).
113.
go back to reference King RC et al. Appropriateness of ChatGPT in answering heart failure related questions. medRxiv (2023). 2023.2007. 2007.23292385 (2023). King RC et al. Appropriateness of ChatGPT in answering heart failure related questions. medRxiv (2023). 2023.2007. 2007.23292385 (2023).
114.
go back to reference Kiyohara Y et al. Large language models to differentiate vasospastic angina using patient information. medRxiv (2023). 2023.2006. 2026.23291913 (2023). Kiyohara Y et al. Large language models to differentiate vasospastic angina using patient information. medRxiv (2023). 2023.2006. 2026.23291913 (2023).
115.
go back to reference Krusche M, Callhoff J, Knitza J, Ruffer N. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatology International (2023). 1–4 (2023). Krusche M, Callhoff J, Knitza J, Ruffer N. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatology International (2023). 1–4 (2023).
116.
go back to reference Kuckelman IJ et al. Assessing ai-powered patient education: a case study in radiology. Academic Radiology (2023). (2023). Kuckelman IJ et al. Assessing ai-powered patient education: a case study in radiology. Academic Radiology (2023). (2023).
117.
go back to reference Kumari A et al. Large language models in hematology case solving: a comparative study of ChatGPT-3.5, Google Bard, and Microsoft Bing. Cureus 15 (2023). Kumari A et al. Large language models in hematology case solving: a comparative study of ChatGPT-3.5, Google Bard, and Microsoft Bing. Cureus 15 (2023).
118.
go back to reference Kuroiwa T, et al. The potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic diseases: exploratory study. J Med Internet Res. 2023;25:e47621.PubMedPubMedCentralCrossRef Kuroiwa T, et al. The potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic diseases: exploratory study. J Med Internet Res. 2023;25:e47621.PubMedPubMedCentralCrossRef
119.
go back to reference Kusunose K, Kashima S, Sata M. Evaluation of the accuracy of ChatGPT in answering clinical questions on the Japanese Society of Hypertension guidelines. Circ J. 2023;87:1030–3.PubMedCrossRef Kusunose K, Kashima S, Sata M. Evaluation of the accuracy of ChatGPT in answering clinical questions on the Japanese Society of Hypertension guidelines. Circ J. 2023;87:1030–3.PubMedCrossRef
120.
go back to reference Lahat A et al. Evaluating the Utility of a Large Language Model in Answering Common Patients’ Gastrointestinal Health-Related Questions: Are We There Yet? Diagnostics 13, 1950 (2023). Lahat A et al. Evaluating the Utility of a Large Language Model in Answering Common Patients’ Gastrointestinal Health-Related Questions: Are We There Yet? Diagnostics 13, 1950 (2023).
121.
go back to reference Lim ZW et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 95 (2023). Lim ZW et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 95 (2023).
122.
go back to reference Liu S et al. Assessing the value of ChatGPT for clinical decision support optimization. MedRxiv (2023). 2023.2002. 2021.23286254 (2023). Liu S et al. Assessing the value of ChatGPT for clinical decision support optimization. MedRxiv (2023). 2023.2002. 2021.23286254 (2023).
123.
go back to reference Lukac S et al. Evaluating ChatGPT as an Adjunct for the Multidisciplinary Tumor Board Decision-Making in Primary Breast Cancer Cases. (2023). (2023). Lukac S et al. Evaluating ChatGPT as an Adjunct for the Multidisciplinary Tumor Board Decision-Making in Primary Breast Cancer Cases. (2023). (2023).
124.
go back to reference Lyons RJ et al. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Canadian Journal of Ophthalmology (2023). (2023). Lyons RJ et al. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Canadian Journal of Ophthalmology (2023). (2023).
125.
go back to reference Lyu Q, et al. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Visual Comput Ind Biomed Art. 2023;6:9.CrossRef Lyu Q, et al. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Visual Comput Ind Biomed Art. 2023;6:9.CrossRef
126.
go back to reference Mika AP et al. Assessing ChatGPT Responses to Common Patient Questions Regarding Total Hip Arthroplasty. JBJS 105, 1519–1526 (2023). Mika AP et al. Assessing ChatGPT Responses to Common Patient Questions Regarding Total Hip Arthroplasty. JBJS 105, 1519–1526 (2023).
127.
go back to reference Mishra A et al. Exploring the intersection of artificial intelligence and neurosurgery: Let us be cautious with ChatGPT. Neurosurgery (2022). 10.1227 (2022). Mishra A et al. Exploring the intersection of artificial intelligence and neurosurgery: Let us be cautious with ChatGPT. Neurosurgery (2022). 10.1227 (2022).
128.
go back to reference Momenaei B et al. Appropriateness and Readability of ChatGPT-4 generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmology Retina (2023). (2023). Momenaei B et al. Appropriateness and Readability of ChatGPT-4 generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmology Retina (2023). (2023).
129.
go back to reference Nakaura T et al. Preliminary assessment of automated radiology report generation with generative pre-trained transformers: comparing results to radiologist-generated reports. Japanese Journal of Radiology (2023). 1–11 (2023). Nakaura T et al. Preliminary assessment of automated radiology report generation with generative pre-trained transformers: comparing results to radiologist-generated reports. Japanese Journal of Radiology (2023). 1–11 (2023).
130.
go back to reference O’Hagan R et al. Trends in accuracy and appropriateness of alopecia areata information obtained from a popular online large language model, ChatGPT. Dermatology (Basel, Switzerland) (2023). (2023). O’Hagan R et al. Trends in accuracy and appropriateness of alopecia areata information obtained from a popular online large language model, ChatGPT. Dermatology (Basel, Switzerland) (2023). (2023).
131.
go back to reference Qu RW, Qureshi U, Petersen G, Lee SC. Diagnostic and management applications of ChatGPT in structured otolaryngology clinical scenarios. OTO open. 2023;7:e67.PubMedPubMedCentralCrossRef Qu RW, Qureshi U, Petersen G, Lee SC. Diagnostic and management applications of ChatGPT in structured otolaryngology clinical scenarios. OTO open. 2023;7:e67.PubMedPubMedCentralCrossRef
132.
go back to reference Rahsepar AA, et al. How AI responds to common lung Cancer questions: ChatGPT vs Google Bard. Radiology. 2023;307:e230922.PubMedCrossRef Rahsepar AA, et al. How AI responds to common lung Cancer questions: ChatGPT vs Google Bard. Radiology. 2023;307:e230922.PubMedCrossRef
133.
go back to reference Rao A et al. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. Journal of the American College of Radiology (2023). (2023). Rao A et al. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. Journal of the American College of Radiology (2023). (2023).
134.
go back to reference Rao A, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res. 2023;25:e48659.PubMedPubMedCentralCrossRef Rao A, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res. 2023;25:e48659.PubMedPubMedCentralCrossRef
135.
go back to reference Rau A et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines. medRxiv (2023). 2023.2004. 2010.23288354 (2023). Rau A et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines. medRxiv (2023). 2023.2004. 2010.23288354 (2023).
136.
go back to reference Reese JT et al. On the limitations of large language models in clinical diagnosis. medRxiv (2023). (2023). Reese JT et al. On the limitations of large language models in clinical diagnosis. medRxiv (2023). (2023).
137.
go back to reference Rogasch JM et al. ChatGPT: Can You Prepare My Patients for [18F] FDG PET/CT and Explain My Reports? Journal of Nuclear Medicine (2023). (2023). Rogasch JM et al. ChatGPT: Can You Prepare My Patients for [18F] FDG PET/CT and Explain My Reports? Journal of Nuclear Medicine (2023). (2023).
138.
go back to reference Rojas-Carabali W et al. Evaluating the Diagnostic Accuracy and Management Recommendations of ChatGPT in Uveitis. Ocular Immunology and Inflammation (2023). 1–6 (2023). Rojas-Carabali W et al. Evaluating the Diagnostic Accuracy and Management Recommendations of ChatGPT in Uveitis. Ocular Immunology and Inflammation (2023). 1–6 (2023).
139.
go back to reference Russe MF, et al. Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports. Sci Rep. 2023;13:14215.PubMedPubMedCentralCrossRef Russe MF, et al. Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports. Sci Rep. 2023;13:14215.PubMedPubMedCentralCrossRef
140.
go back to reference Salazar GZ et al. Efficacy of AI chats to determine an emergency: a comparison between OpenAI’s ChatGPT, Google Bard, and Microsoft Bing AI Chat. Cureus 15 (2023). Salazar GZ et al. Efficacy of AI chats to determine an emergency: a comparison between OpenAI’s ChatGPT, Google Bard, and Microsoft Bing AI Chat. Cureus 15 (2023).
141.
go back to reference Samaan JS, et al. ChatGPT’s ability to comprehend and answer cirrhosis related questions in Arabic. Arab J Gastroenterol. 2023;24:145–8.PubMedCrossRef Samaan JS, et al. ChatGPT’s ability to comprehend and answer cirrhosis related questions in Arabic. Arab J Gastroenterol. 2023;24:145–8.PubMedCrossRef
142.
go back to reference Samaan JS et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obesity surgery (2023). 1–7 (2023). Samaan JS et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obesity surgery (2023). 1–7 (2023).
143.
go back to reference Sarbay İ, Berikol GB, Özturan İU. Performance of emergency triage prediction of an open access natural language processing based chatbot application (ChatGPT): a preliminary, scenario-based cross-sectional study. Turkish J Emerg Med. 2023;23:156.CrossRef Sarbay İ, Berikol GB, Özturan İU. Performance of emergency triage prediction of an open access natural language processing based chatbot application (ChatGPT): a preliminary, scenario-based cross-sectional study. Turkish J Emerg Med. 2023;23:156.CrossRef
144.
go back to reference Shao C-y, et al. Appropriateness and comprehensiveness of using ChatGPT for perioperative patient education in thoracic surgery in different language contexts: survey study. Interact J Med Res. 2023;12:e46900.PubMedPubMedCentralCrossRef Shao C-y, et al. Appropriateness and comprehensiveness of using ChatGPT for perioperative patient education in thoracic surgery in different language contexts: survey study. Interact J Med Res. 2023;12:e46900.PubMedPubMedCentralCrossRef
145.
go back to reference Stevenson E, Walsh C, Hibberd L. Can artificial intelligence replace biochemists? A study comparing interpretation of thyroid function test results by ChatGPT and Google Bard to practising biochemists. Annals of Clinical Biochemistry (2023). 00045632231203473 (2023). Stevenson E, Walsh C, Hibberd L. Can artificial intelligence replace biochemists? A study comparing interpretation of thyroid function test results by ChatGPT and Google Bard to practising biochemists. Annals of Clinical Biochemistry (2023). 00045632231203473 (2023).
146.
go back to reference Sütcüoğlu BM, Güler M. Appropriateness of premature ovarian insufficiency recommendations provided by ChatGPT. Menopause. 2023;30:1033–7.PubMedCrossRef Sütcüoğlu BM, Güler M. Appropriateness of premature ovarian insufficiency recommendations provided by ChatGPT. Menopause. 2023;30:1033–7.PubMedCrossRef
147.
go back to reference Suthar PP et al. Artificial Intelligence (AI) in Radiology: A Deep Dive Into ChatGPT 4.0’s Accuracy with the American Journal of Neuroradiology’s (AJNR) Case of the Month. Cureus 15 (2023). Suthar PP et al. Artificial Intelligence (AI) in Radiology: A Deep Dive Into ChatGPT 4.0’s Accuracy with the American Journal of Neuroradiology’s (AJNR) Case of the Month. Cureus 15 (2023).
148.
go back to reference Ueda D, et al. Diagnostic performance of ChatGPT from Patient History and Imaging findings on the diagnosis please quizzes. Radiology. 2023;308:e231040.PubMedCrossRef Ueda D, et al. Diagnostic performance of ChatGPT from Patient History and Imaging findings on the diagnosis please quizzes. Radiology. 2023;308:e231040.PubMedCrossRef
149.
go back to reference Uz C, Umay E. Dr ChatGPT: Is it a reliable and useful source for common rheumatic diseases? International Journal of Rheumatic Diseases (2023). (2023). Uz C, Umay E. Dr ChatGPT: Is it a reliable and useful source for common rheumatic diseases? International Journal of Rheumatic Diseases (2023). (2023).
150.
go back to reference Vaira LA et al. Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. Otolaryngology–Head and Neck Surgery (2023). (2023). Vaira LA et al. Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. Otolaryngology–Head and Neck Surgery (2023). (2023).
151.
go back to reference Wagner MW, Ertl-Wagner BB. Accuracy of information and references using ChatGPT-3 for retrieval of clinical radiological information. Canadian Association of Radiologists Journal (2023). 08465371231171125 (2023). Wagner MW, Ertl-Wagner BB. Accuracy of information and references using ChatGPT-3 for retrieval of clinical radiological information. Canadian Association of Radiologists Journal (2023). 08465371231171125 (2023).
152.
go back to reference Wang C, Liu S, Li A, Liu J. Text dialogue analysis based ChatGPT for primary screening of mild cognitive impairment. medRxiv (2023). 2023.2006. 2027.23291884 (2023). Wang C, Liu S, Li A, Liu J. Text dialogue analysis based ChatGPT for primary screening of mild cognitive impairment. medRxiv (2023). 2023.2006. 2027.23291884 (2023).
153.
go back to reference Whiles BB, et al. Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology. 2023;180:278–84.PubMedCrossRef Whiles BB, et al. Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology. 2023;180:278–84.PubMedCrossRef
154.
go back to reference Yeo YH et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. medRxiv (2023). 2023.2002. 2006.23285449 (2023). Yeo YH et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. medRxiv (2023). 2023.2002. 2006.23285449 (2023).
155.
go back to reference Angel M, Rinehart J, Canneson M, Baldi PF. Clinical Knowledge and Reasoning Abilities of AI Large Language Models in Anesthesiology: A Comparative Study on the ABA Exam. medRxiv (2023). 2023.2005. 2010.23289805 (2023). Angel M, Rinehart J, Canneson M, Baldi PF. Clinical Knowledge and Reasoning Abilities of AI Large Language Models in Anesthesiology: A Comparative Study on the ABA Exam. medRxiv (2023). 2023.2005. 2010.23289805 (2023).
156.
go back to reference Chervenak J, Lieman H, Blanco-Breindel M, Jindal S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertility and Sterility (2023). (2023). Chervenak J, Lieman H, Blanco-Breindel M, Jindal S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertility and Sterility (2023). (2023).
157.
go back to reference Copeland-Halperin LR, O’Brien L, Copeland M. Evaluation of Artificial intelligence–generated responses to common plastic surgery questions. Plast Reconstr Surgery–Global Open. 2023;11:e5226.CrossRef Copeland-Halperin LR, O’Brien L, Copeland M. Evaluation of Artificial intelligence–generated responses to common plastic surgery questions. Plast Reconstr Surgery–Global Open. 2023;11:e5226.CrossRef
158.
go back to reference Harskamp RE, De Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). medRxiv (2023). 2023.2003. 2025.23285475 (2023). Harskamp RE, De Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). medRxiv (2023). 2023.2003. 2025.23285475 (2023).
159.
go back to reference Beaulieu-Jones BR et al. Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments. medRxiv (2023). (2023). Beaulieu-Jones BR et al. Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments. medRxiv (2023). (2023).
160.
go back to reference Fang C et al. How does ChatGPT4 preform on Non-English National Medical Licensing Examination? An Evaluation in Chinese Language. medRxiv (2023). 2023.2005. 2003.23289443 (2023). Fang C et al. How does ChatGPT4 preform on Non-English National Medical Licensing Examination? An Evaluation in Chinese Language. medRxiv (2023). 2023.2005. 2003.23289443 (2023).
162.
go back to reference Kung TH, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit Health. 2023;2:e0000198.PubMedPubMedCentralCrossRef Kung TH, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit Health. 2023;2:e0000198.PubMedPubMedCentralCrossRef
163.
go back to reference Strong E et al. Performance of ChatGPT on free-response, clinical reasoning exams. medRxiv (2023). 2023.2003. 2024.23287731 (2023). Strong E et al. Performance of ChatGPT on free-response, clinical reasoning exams. medRxiv (2023). 2023.2003. 2024.23287731 (2023).
164.
go back to reference Athaluri SA et al. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus 15 (2023). Athaluri SA et al. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus 15 (2023).
165.
go back to reference Gilbert S et al. Large language model AI chatbots require approval as medical devices. Nature Medicine (2023). 1–3 (2023). Gilbert S et al. Large language model AI chatbots require approval as medical devices. Nature Medicine (2023). 1–3 (2023).
Metadata
Title
Analyzing evaluation methods for large language models in the medical field: a scoping review
Authors
Junbok Lee
Sungkyung Park
Jaeyong Shin
Belong Cho
Publication date
01-12-2024
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2024
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-024-02709-7