Skip to main content
Top

Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education

  • 14-10-2024
  • Original Research
Published in:

Abstract

Background

The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses.

Objective

To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education.

Design and Participants

This was a cross-sectional study of pre-clinical students’ critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022–2023 academic year.

Intervention

An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade.

Main Measures

Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied.

Key Results

In this cross-sectional study, 111 pre-clinical students’ faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61–0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours.

Conclusions

This study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.
Title
Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education
Authors
Radhika Sreedhar, MD, MS
Linda Chang, PharmD, MPH
Ananya Gangopadhyaya, MD
Peggy Woziwodzki Shiels, MD
Julie Loza, MD
Euna Chi, MD
Elizabeth Gabel, MD
Yoon Soo Park, PhD
Publication date
14-10-2024
Publisher
Springer International Publishing
Published in
Journal of General Internal Medicine / Issue 1/2025
Print ISSN: 0884-8734
Electronic ISSN: 1525-1497
DOI
https://doi.org/10.1007/s11606-024-09050-9
This content is only visible if you are logged in and have the appropriate permissions.
This content is only visible if you are logged in and have the appropriate permissions.

Keynote webinar | Spotlight on progress in colorectal cancer

  • Live
  • Webinar | 11-12-2025 | 18:00 (CET)

CRC remains a major global health burden, but advances in screening, treatment, and lifestyle-based prevention continue to reshape clinical practice. Gain insights into how the latest research can be leveraged to optimize patient care across the CRC continuum.

Watch it live: Thursday 11 December 2025, 18:00-19:30 (CET)

Prof. Antoni Castells
Prof. Edward Giovannucci
Prof. Harpreet Wasan
Join the webinar
Webinar

Keynote webinar | Spotlight on functional neurological disorder

FND perplexes and frustrates patients and physicians alike. Limited knowledge and insufficient awareness delays diagnosis and treatment, and many patients feel misunderstood and stigmatized. How can you recognize FND and what are the treatment options?

Prof. Mark Edwards
Watch now
Video
Image Credits
Colon cancer illustration/© (M) KATERYNA KON / SCIENCE PHOTO LIBRARY / Getty Images, Human brain illustration/© (M) CHRISTOPH BURGSTEDT / SCIENCE PHOTO LIBRARY / Getty Images