Top

Published in:

Open Access 01-12-2023 | Original Article

Natural language processing for automatic evaluation of free-text answers — a feasibility study based on the European Diploma in Radiology examination

Authors: Fabian Stoehr, Benedikt Kämpgen, Lukas Müller, Laura Oleaga Zufiría, Vanesa Junquero, Cristina Merino, Peter Mildenberger, Roman Kloeckner

Published in: Insights into Imaging | Issue 1/2023

Abstract

Background

Written medical examinations consist of multiple-choice questions and/or free-text answers. The latter require manual evaluation and rating, which is time-consuming and potentially error-prone. We tested whether natural language processing (NLP) can be used to automatically analyze free-text answers to support the review process.

Methods

The European Board of Radiology of the European Society of Radiology provided representative datasets comprising sample questions, answer keys, participant answers, and reviewer markings from European Diploma in Radiology examinations. Three free-text questions with the highest number of corresponding answers were selected: Questions 1 and 2 were “unstructured” and required a typical free-text answer whereas question 3 was “structured” and offered a selection of predefined wordings/phrases for participants to use in their free-text answer. The NLP engine was designed using word lists, rule-based synonyms, and decision tree learning based on the answer keys and its performance tested against the gold standard of reviewer markings.

Results

After implementing the NLP approach in Python, F1 scores were calculated as a measure of NLP performance: 0.26 (unstructured question 1, n = 96), 0.33 (unstructured question 2, n = 327), and 0.5 (more structured question, n = 111). The respective precision/recall values were 0.26/0.27, 0.4/0.32, and 0.62/0.55.

Conclusion

This study showed the successful design of an NLP-based approach for automatic evaluation of free-text answers in the EDiR examination. Thus, as a future field of application, NLP could work as a decision-support system for reviewers and support the design of examinations being adjusted to the requirements of an automated, NLP-based review process.

Clinical relevance statement

Natural language processing can be successfully used to automatically evaluate free-text answers, performing better with more structured question-answer formats. Furthermore, this study provides a baseline for further work applying, e.g., more elaborated NLP approaches/large language models.

Key points

• Free-text answers require manual evaluation, which is time-consuming and potentially error-prone.

• We developed a simple NLP-based approach — requiring only minimal effort/modeling — to automatically analyze and mark free-text answers.

• Our NLP engine has the potential to support the manual evaluation process.

• NLP performance is better on a more structured question-answer format.

Graphical Abstract

Available only for authorised users

Vanderbilt A, Feldman M, Wood I (2013) Assessment in undergraduate medical education: a review of course exams. Med Educ Online 18:20438. https://doi.org/10.3402/meo.v18i0.20438CrossRef

Case S, Swanson D (2002) Constructing Written Test Questions For the Basic and Clinical Sciences. Natl Board Exam

Schuwirth LWT (2003) ABC of learning and teaching in medicine: written assessment. BMJ 326:643–645. https://doi.org/10.1136/bmj.326.7390.643CrossRefPubMedPubMedCentral

Bauer D, Holzer M, Kopp V, Fischer MR (2011) Pick-N multiple choice-exams: a comparison of scoring algorithms. Adv Heal Sci Educ 16:211–221. https://doi.org/10.1007/s10459-010-9256-1CrossRef

Roediger HL, Marsh EJ (2005) The positive and negative consequences of multiple-choice testing. J Exp Psychol Learn Mem Cogn 31:1155–1159. https://doi.org/10.1037/0278-7393.31.5.1155CrossRefPubMed

Smith MA, Karpicke JD (2014) Retrieval practice with short-answer, multiple-choice, and hybrid tests. Memory 22:784–802. https://doi.org/10.1080/09658211.2013.831454CrossRefPubMed

Kang SHK, McDermott KB, Roediger HL (2007) Test format and corrective feedback modify the effect of testing on long-term retention. Eur J Cogn Psychol 19:528–558. https://doi.org/10.1080/09541440601056620CrossRef

Sarker A, Klein AZ, Mee J et al (2019) An interpretable natural language processing system for written medical examination assessment. J Biomed Inform 98:103268CrossRefPubMed

Engelhard JG, Wang J, Wind SA (2018) A tale of two models: psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings. Psychol Test Assess Model 60:33–52

10.

Mozayan A, Fabbri AR, Maneevese M et al (2021) Practical guide to natural language processing for radiology. Radiographics 41:1446–1453. https://doi.org/10.1148/rg.2021200113CrossRefPubMed

11.

Cai T, Giannopoulos AA, Yu S et al (2016) Natural language processing technologies in radiology research and clinical applications. Radiographics 36:176–191. https://doi.org/10.1148/rg.2016150080CrossRefPubMed

12.

Jungmann F, Kämpgen B, Mildenberger P et al (2020) Towards data-driven medical imaging using natural language processing in patients with suspected urolithiasis. Int J Med Inform 137:104106. https://doi.org/10.1016/j.ijmedinf.2020.104106CrossRefPubMed

13.

Tibbo ME, Wyles CC, Fu S et al (2019) Use of natural language processing tools to identify and classify periprosthetic femur fractures. J Arthroplasty 34:2216–2219. https://doi.org/10.1016/j.arth.2019.07.025CrossRefPubMedPubMedCentral

14.

Swartz J, Koziatek C, Theobald J et al (2017) Creation of a simple natural language processing tool to support an imaging utilization quality dashboard. Int J Med Inform 101:93–99. https://doi.org/10.1016/j.ijmedinf.2017.02.011CrossRefPubMed

15.

Sanuvala G, Fatima SS (2021) A study of automated evaluation of student’s examination paper using machine learning techniques. In: 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). IEEE, pp 1049–1054

16.

Furlan R, Gatti M, Menè R et al (2021) A natural language processing–based virtual patient simulator and intelligent tutoring system for the clinical diagnostic process: simulator development and case study. JMIR Med Informatics 9:e24073. https://doi.org/10.2196/24073CrossRef

17.

Zehner F, Sälzer C, Goldhammer F (2016) Automatic coding of short text responses via clustering in educational assessment. Educ Psychol Meas 76:280–303. https://doi.org/10.1177/0013164415590022CrossRefPubMed

18.

Pons E, Braun LMM, Hunink MGM, Kors JA (2016) Natural language processing in radiology: a systematic review. Radiology 279:329–343. https://doi.org/10.1148/radiol.16142770CrossRefPubMed

19.

Bird JB, Olvet DM, Willey JM, Brenner J (2019) Patients don’t come with multiple choice options: essay-based assessment in UME. Med Educ Online 24:1649959. https://doi.org/10.1080/10872981.2019.1649959CrossRefPubMedPubMedCentral

20.

Schuwirth LWT, van der Vleuten CPM (2004) Changing education, changing assessment, changing research? Med Educ 38:805–812. https://doi.org/10.1111/j.1365-2929.2004.01851.xCrossRefPubMed

21.

Relyea-Chew A, Talner LB (2011) A dedicated general competencies curriculum for radiology residents. Acad Radiol 18:650–654. https://doi.org/10.1016/j.acra.2010.12.016CrossRefPubMed

22.

Schuwirth LWT, Van der Vleuten CPM (2011) Programmatic assessment: from assessment of learning to assessment for learning. Med Teach 33:478–485. https://doi.org/10.3109/0142159X.2011.565828CrossRefPubMed

23.

Scouller K (1998) The influence of assessment method on students’ learning approaches: multiple choice question examination versus assignment essay. High Educ 35:453–472. https://doi.org/10.1023/A:1003196224280CrossRef

24.

Jungmann F, Arnhold G, Kämpgen B et al (2020) A hybrid reporting platform for extended RadLex coding combining structured reporting templates and natural language processing. J Digit Imaging 33:1026–1033. https://doi.org/10.1007/s10278-020-00342-0CrossRefPubMedPubMedCentral

25.

Lee B, Whitehead MT (2017) Radiology reports: what YOU think you’re saying and what THEY think you’re saying. Curr Probl Diagn Radiol 46:186–195. https://doi.org/10.1067/j.cpradiol.2016.11.005CrossRefPubMed

26.

Fatehi M, Pinto dos Santos D (2022) Structured reporting in radiology. Springer International Publishing, ChamCrossRef

27.

Turkbey B, Rosenkrantz AB, Haider MA et al (2019) Prostate imaging reporting and data system version 2.1: 2019 update of prostate imaging reporting and data system version 2. Eur Urol 76:340–351. https://doi.org/10.1016/j.eururo.2019.02.033CrossRefPubMed

28.

Pinto dos Santos D, Baeßler B (2018) Big data, artificial intelligence, and structured reporting. Eur Radiol Exp 2:10–14. https://doi.org/10.1186/s41747-018-0071-4CrossRef

29.

Kung TH, Cheatham M, Medenilla A et al (2023) Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Heal 2:e0000198. https://doi.org/10.1371/journal.pdig.0000198CrossRef

Title: Natural language processing for automatic evaluation of free-text answers — a feasibility study based on the European Diploma in Radiology examination
Authors: Fabian Stoehr
Benedikt Kämpgen
Lukas Müller
Laura Oleaga Zufiría
Vanesa Junquero
Cristina Merino
Peter Mildenberger
Roman Kloeckner
Publication date: 01-12-2023
Publisher: Springer Vienna
Published in: Insights into Imaging / Issue 1/2023
Electronic ISSN: 1869-4101
DOI: https://doi.org/10.1186/s13244-023-01507-5

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Natural language processing for automatic evaluation of free-text answers — a feasibility study based on the European Diploma in Radiology examination

Abstract

Background

Methods

Results

Conclusion

Clinical relevance statement

Key points

Graphical Abstract

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Abstract

Background

Methods

Results

Conclusion

Clinical relevance statement

Key points

Graphical Abstract

Please log in to get access to this content

Other articles of this Issue 1/2023

Image quality in whole-body MRI using the MY-RADS protocol in a prospective multi-centre multiple myeloma study

CT and chest radiography in evaluation of mechanical circulatory support devices for acute heart failure

Reply to the Letter to the Editor referring to “Dual-energy CT for the detection of skull base invasion in nasopharyngeal carcinoma: comparison of simulated single-energy CT and MRI”

Hip MRI in flexion abduction external rotation for assessment of the ischiofemoral interval in patients with hip pain—a feasibility study

European survey on the use of patient contact shielding during radiological examinations

Differentiation of pulmonary solid nodules attached to the pleura detected by thin-section CT