Skip to main content
Top
Published in: World Journal of Urology 1/2024

01-12-2024 | Artificial Intelligence | Original Article

Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings

Authors: Chung-You Tsai, Shang-Ju Hsieh, Hung-Hsiang Huang, Juinn-Horng Deng, Yi-You Huang, Pai-Yu Cheng

Published in: World Journal of Urology | Issue 1/2024

Login to get access

Abstract

Purpose

To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan urology board examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across 12 urology domains.

Methods

450 multiple-choice questions from TUBE(2020–2022) were presented to two models. Three urologists assessed correctness and consistency of each response. Accuracy quantifies correct answers; consistency assesses logic and coherence in explanations out of total responses, alongside a penalty reduction experiment with prompt variations. Univariate logistic regression was applied for subgroup comparison.

Results

ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7% (2020), 58.0% (2021), and 50.7% (2022), significantly surpassing ChatGPT-3.5 (33.8%, OR = 2.68, 95% CI [2.05–3.52]). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR = 0.69, p = 0.05). A high consistency rate of 91.6% in explanations across all domains indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p = 0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making.

Conclusions

ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements of urology-specific AI tools.
Appendix
Available only for authorised users
Literature
3.
go back to reference Sallam M (2023) ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 6:887CrossRef Sallam M (2023) ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 6:887CrossRef
4.
go back to reference Patel SB, Lam K (2023) ChatGPT: the future of discharge summaries? Lancet Digital Health 5(3):e107–e108CrossRefPubMed Patel SB, Lam K (2023) ChatGPT: the future of discharge summaries? Lancet Digital Health 5(3):e107–e108CrossRefPubMed
5.
go back to reference Talyshinskii A, Naik N, Hameed BMZ, Zhanbyrbekuly U, Khairli G, Guliev B, Juilebø-Jones P, Tzelves L, Somani BK (2023) Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surge 10:1257191 Talyshinskii A, Naik N, Hameed BMZ, Zhanbyrbekuly U, Khairli G, Guliev B, Juilebø-Jones P, Tzelves L, Somani BK (2023) Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surge 10:1257191
6.
go back to reference Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J (2023) Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS digital health 2(2):e0000198CrossRefPubMedPubMedCentral Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J (2023) Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS digital health 2(2):e0000198CrossRefPubMedPubMedCentral
8.
go back to reference Eppler M, Ganjavi C, Ramacciotti LS, Piazza P, Rodler S, Checcucci E, Rivas JG, Kowalewski KF, Belenchón IR, Puliatti S, Taratkin M, Veccia A, BaekelandtL, Teoh JY-C, Somani BK, Wroclawski M, Abreu A, Porpiglia F, Gill IS, Declan G (2023) Awareness and Use of ChatGPT and Large Language Models: A Prospective Cross-sectional Global Survey in Urology. Eur Urol 85(2):146–153 Eppler M, Ganjavi C, Ramacciotti LS, Piazza P, Rodler S, Checcucci E, Rivas JG, Kowalewski KF, Belenchón IR, Puliatti S, Taratkin M, Veccia A, BaekelandtL, Teoh JY-C, Somani BK, Wroclawski M, Abreu A, Porpiglia F, Gill IS, Declan G (2023) Awareness and Use of ChatGPT and Large Language Models: A Prospective Cross-sectional Global Survey in Urology. Eur Urol 85(2):146–153
9.
go back to reference Cocci A, Pezzoli M, Lo Re M, Russo GI, Asmundo MG, Fode M, Cacciamani G, Cimino S, Minervini A, Durukan E (2023) Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis 27(1):103–108 Cocci A, Pezzoli M, Lo Re M, Russo GI, Asmundo MG, Fode M, Cacciamani G, Cimino S, Minervini A, Durukan E (2023) Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis 27(1):103–108
10.
go back to reference Coskun B, Ocakoglu G, Yetemen M, Kaygisiz O (2023) Can ChatGPT, an artificial intelligence language model, provide accurate and high-quality patient information on prostate cancer? Urology 180:35–58CrossRefPubMed Coskun B, Ocakoglu G, Yetemen M, Kaygisiz O (2023) Can ChatGPT, an artificial intelligence language model, provide accurate and high-quality patient information on prostate cancer? Urology 180:35–58CrossRefPubMed
11.
go back to reference Szczesniewski JJ, Tellez Fouz C, Ramos Alba A, Diaz Goizueta FJ, García Tello A, Llanes González L (2023) ChatGPT and most frequent urological diseases: analysing the quality of information and potential risks for patients. World J Urol 41(11):3149–3153 Szczesniewski JJ, Tellez Fouz C, Ramos Alba A, Diaz Goizueta FJ, García Tello A, Llanes González L (2023) ChatGPT and most frequent urological diseases: analysing the quality of information and potential risks for patients. World J Urol 41(11):3149–3153
12.
go back to reference Whiles BB, Bird VG, Canales BK, DiBianco JM, Terry RS (2023) Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology 180:278–284CrossRefPubMed Whiles BB, Bird VG, Canales BK, DiBianco JM, Terry RS (2023) Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology 180:278–284CrossRefPubMed
13.
go back to reference Kleebayoon A, Wiwanitkit V (2024) ChatGPT in answering questions related to pediatric urology: comment. J Pediatr Urol 20(1):28 Kleebayoon A, Wiwanitkit V (2024) ChatGPT in answering questions related to pediatric urology: comment. J Pediatr Urol 20(1):28
14.
go back to reference Cakir H, Caglar U, Yildiz O, Meric A, Ayranci A, Ozgor F (2024) Evaluating the performance of ChatGPT in answering questions related to urolithiasis. Internat Urol Nephrol 56(1):17–21 Cakir H, Caglar U, Yildiz O, Meric A, Ayranci A, Ozgor F (2024) Evaluating the performance of ChatGPT in answering questions related to urolithiasis. Internat Urol Nephrol 56(1):17–21
16.
go back to reference Deebel NA, Terlecki R (2023) ChatGPT performance on the American Urological Association (AUA) Self-Assessment Study Program and the potential influence of artificial intelligence (AI) in urologic training. Urology 177:29–33 Deebel NA, Terlecki R (2023) ChatGPT performance on the American Urological Association (AUA) Self-Assessment Study Program and the potential influence of artificial intelligence (AI) in urologic training. Urology 177:29–33
17.
go back to reference Bhayana R, Krishna S, Bleakney RR (2023) Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology 307(5):e230582 Bhayana R, Krishna S, Bleakney RR (2023) Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology 307(5):e230582
18.
go back to reference Antaki F, Touma S, Milad D, El-Khoury J, Duval R (2023) Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings. Ophthalmol Sci 3(4):100324 Antaki F, Touma S, Milad D, El-Khoury J, Duval R (2023) Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings. Ophthalmol Sci 3(4):100324
19.
go back to reference Kumah-Crystal Y, Mankowitz S, Embi P, Lehmann CU (2023) ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? J Am Med Informat Assoc 30(9):1558-1560 Kumah-Crystal Y, Mankowitz S, Embi P, Lehmann CU (2023) ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? J Am Med Informat Assoc 30(9):1558-1560
20.
go back to reference Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D (2023) How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educat 9(1):e45312CrossRef Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D (2023) How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educat 9(1):e45312CrossRef
21.
go back to reference Kaneda Y, Tanimoto T, Ozaki A, Sato T, Takahashi K (2023) Can ChatGPT Pass the 2023 Japanese National Medical Licensing Examination? Preprints 2023:2023030191 Kaneda Y, Tanimoto T, Ozaki A, Sato T, Takahashi K (2023) Can ChatGPT Pass the 2023 Japanese National Medical Licensing Examination? Preprints 2023:2023030191
22.
go back to reference Weng T-L, Wang Y-M, Chang S, Chen T-J, Hwang S-J (2023) ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc 86(8):762–766CrossRefPubMed Weng T-L, Wang Y-M, Chang S, Chen T-J, Hwang S-J (2023) ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc 86(8):762–766CrossRefPubMed
24.
go back to reference Lai VD, Ngo NT, Veyseh APB, Man H, Dernoncourt F, Bui T, Nguyen TH (2023) Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:230405613 Lai VD, Ngo NT, Veyseh APB, Man H, Dernoncourt F, Bui T, Nguyen TH (2023) Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:230405613
25.
go back to reference Xiao Y, Wang WY (2021) On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:210315025 Xiao Y, Wang WY (2021) On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:210315025
Metadata
Title
Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings
Authors
Chung-You Tsai
Shang-Ju Hsieh
Hung-Hsiang Huang
Juinn-Horng Deng
Yi-You Huang
Pai-Yu Cheng
Publication date
01-12-2024
Publisher
Springer Berlin Heidelberg
Published in
World Journal of Urology / Issue 1/2024
Print ISSN: 0724-4983
Electronic ISSN: 1433-8726
DOI
https://doi.org/10.1007/s00345-024-04957-8

Other articles of this Issue 1/2024

World Journal of Urology 1/2024 Go to the issue