Skip to main content
Top
Published in: Journal of Orthopaedic Surgery and Research 1/2024

Open Access 01-12-2024 | Research article

Assessing ChatGPT’s orthopedic in-service training exam performance and applicability in the field

Authors: Neil Jain, Caleb Gottlich, John Fisher, Dominic Campano, Travis Winston

Published in: Journal of Orthopaedic Surgery and Research | Issue 1/2024

Login to get access

Abstract

Background

ChatGPT has gained widespread attention for its ability to understand and provide human-like responses to inputs. However, few works have focused on its use in Orthopedics. This study assessed ChatGPT’s performance on the Orthopedic In-Service Training Exam (OITE) and evaluated its decision-making process to determine whether adoption as a resource in the field is practical.

Methods

ChatGPT’s performance on three OITE exams was evaluated through inputting multiple choice questions. Questions were classified by their orthopedic subject area. Yearly, OITE technical reports were used to gauge scores against resident physicians. ChatGPT’s rationales were compared with testmaker explanations using six different groups denoting answer accuracy and logic consistency. Variables were analyzed using contingency table construction and Chi-squared analyses.

Results

Of 635 questions, 360 were useable as inputs (56.7%). ChatGPT-3.5 scored 55.8%, 47.7%, and 54% for the years 2020, 2021, and 2022, respectively. Of 190 correct outputs, 179 provided a consistent logic (94.2%). Of 170 incorrect outputs, 133 provided an inconsistent logic (78.2%). Significant associations were found between test topic and correct answer (p = 0.011), and type of logic used and tested topic (p =  < 0.001). Basic Science and Sports had adjusted residuals greater than 1.96. Basic Science and correct, no logic; Basic Science and incorrect, inconsistent logic; Sports and correct, no logic; and Sports and incorrect, inconsistent logic; had adjusted residuals greater than 1.96.

Conclusions

Based on annual OITE technical reports for resident physicians, ChatGPT-3.5 performed around the PGY-1 level. When answering correctly, it displayed congruent reasoning with testmakers. When answering incorrectly, it exhibited some understanding of the correct answer. It outperformed in Basic Science and Sports, likely due to its ability to output rote facts. These findings suggest that it lacks the fundamental capabilities to be a comprehensive tool in Orthopedic Surgery in its current form.
Level of Evidence: II.
Literature
1.
go back to reference Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and other large language models are double-edged swords. Radiology. 2023;307(2): e230163.CrossRefPubMed Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, et al. ChatGPT and other large language models are double-edged swords. Radiology. 2023;307(2): e230163.CrossRefPubMed
2.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst 2017:5998–6008. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst 2017:5998–6008.
3.
go back to reference Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States medical licensing examination? The Implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9: e45312.CrossRefPubMedPubMedCentral Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States medical licensing examination? The Implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9: e45312.CrossRefPubMedPubMedCentral
4.
go back to reference Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. medRxiv. 2023. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. medRxiv. 2023.
5.
go back to reference Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2): e0000198.CrossRefPubMedPubMedCentral Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2): e0000198.CrossRefPubMedPubMedCentral
9.
go back to reference Le HV, Wick JB, Haus BM, Dyer GSM. Orthopaedic in-training examination: history, perspective, and tips for residents. J Am Acad Orthop Surg. 2021;29(9):e427–37.CrossRefPubMed Le HV, Wick JB, Haus BM, Dyer GSM. Orthopaedic in-training examination: history, perspective, and tips for residents. J Am Acad Orthop Surg. 2021;29(9):e427–37.CrossRefPubMed
11.
go back to reference Sinha RK, Deb Roy A, Kumar N, Mondal H. Applicability of ChatGPT in assisting to solve higher order problems in pathology. Cureus. 2023;15(2): e35237.PubMedPubMedCentral Sinha RK, Deb Roy A, Kumar N, Mondal H. Applicability of ChatGPT in assisting to solve higher order problems in pathology. Cureus. 2023;15(2): e35237.PubMedPubMedCentral
12.
go back to reference Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High rates of fabricated and inaccurate references in ChatGPT-generated medical Content. Cureus. 2023;15(5): e39238.PubMedPubMedCentral Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High rates of fabricated and inaccurate references in ChatGPT-generated medical Content. Cureus. 2023;15(5): e39238.PubMedPubMedCentral
15.
go back to reference Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB 3rd. Evaluating ChatGPT performance on the orthopaedic in-training examination. JB JS Open Access. 2023;8(3):e23.PubMedCentral Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB 3rd. Evaluating ChatGPT performance on the orthopaedic in-training examination. JB JS Open Access. 2023;8(3):e23.PubMedCentral
17.
go back to reference OpenAI. GPT-4V(ision) system card. OpenAI Research. 2023. OpenAI. GPT-4V(ision) system card. OpenAI Research. 2023.
18.
go back to reference Fraval A, Chandrananth J, Chong YM, Coventry LS, Tran P. Internet based patient education improves informed consent for elective orthopaedic surgery: a randomized controlled trial. BMC Musculoskelet Disord. 2015;16:14.CrossRefPubMedPubMedCentral Fraval A, Chandrananth J, Chong YM, Coventry LS, Tran P. Internet based patient education improves informed consent for elective orthopaedic surgery: a randomized controlled trial. BMC Musculoskelet Disord. 2015;16:14.CrossRefPubMedPubMedCentral
19.
go back to reference Fijačko N, Gosak L, Štiglic G, Picard CT, John DM. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation. 2023;185: 109732.CrossRefPubMed Fijačko N, Gosak L, Štiglic G, Picard CT, John DM. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation. 2023;185: 109732.CrossRefPubMed
20.
go back to reference Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study. J Educ Eval Health Prof. 2023;20:1.PubMedPubMedCentral Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study. J Educ Eval Health Prof. 2023;20:1.PubMedPubMedCentral
21.
go back to reference Rees EL, Quinn PJ, Davies B, Fotheringham V. How does peer teaching compare to faculty teaching? A systematic review and meta-analysis. Med Teach. 2016;38(8):829–37.CrossRefPubMed Rees EL, Quinn PJ, Davies B, Fotheringham V. How does peer teaching compare to faculty teaching? A systematic review and meta-analysis. Med Teach. 2016;38(8):829–37.CrossRefPubMed
22.
go back to reference Lahat A, Shachar E, Avidan B, Shatz Z, Glicksberg BS, Klang E. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci Rep. 2023;13(1):4164.CrossRefPubMedPubMedCentral Lahat A, Shachar E, Avidan B, Shatz Z, Glicksberg BS, Klang E. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci Rep. 2023;13(1):4164.CrossRefPubMedPubMedCentral
Metadata
Title
Assessing ChatGPT’s orthopedic in-service training exam performance and applicability in the field
Authors
Neil Jain
Caleb Gottlich
John Fisher
Dominic Campano
Travis Winston
Publication date
01-12-2024
Publisher
BioMed Central
Published in
Journal of Orthopaedic Surgery and Research / Issue 1/2024
Electronic ISSN: 1749-799X
DOI
https://doi.org/10.1186/s13018-023-04467-0

Other articles of this Issue 1/2024

Journal of Orthopaedic Surgery and Research 1/2024 Go to the issue