Why assessment in medical education needs a solid foundation in modern test theory

Schauber, Stefan K.; Hecht, Martin; Nouns, Zineb M.

doi:10.1007/s10459-017-9771-4

Why assessment in medical education needs a solid foundation in modern test theory

Reflections
Published: 16 March 2017

Volume 23, pages 217–232, (2018)
Cite this article

Advances in Health Sciences Education Aims and scope Submit manuscript

Stefan K. Schauber¹,
Martin Hecht² &
Zineb M. Nouns³

2948 Accesses
18 Citations
5 Altmetric
1 Mention
Explore all metrics

Abstract

Despite the frequent use of state-of-the-art psychometric models in the field of medical education, there is a growing body of literature that questions their usefulness in the assessment of medical competence. Essentially, a number of authors raised doubt about the appropriateness of psychometric models as a guiding framework to secure and refine current approaches to the assessment of medical competence. In addition, an intriguing phenomenon known as case specificity is specific to the controversy on the use of psychometric models for the assessment of medical competence. Broadly speaking, case specificity is the finding of instability of performances across clinical cases, tasks, or problems. As stability of performances is, generally speaking, a central assumption in psychometric models, case specificity may limit their applicability. This has probably fueled critiques of the field of psychometrics with a substantial amount of potential empirical evidence. This article aimed to explain the fundamental ideas employed in psychometric theory, and how they might be problematic in the context of assessing medical competence. We further aimed to show why and how some critiques do not hold for the field of psychometrics as a whole, but rather only for specific psychometric approaches. Hence, we highlight approaches that, from our perspective, seem to offer promising possibilities when applied in the assessment of medical competence. In conclusion, we advocate for a more differentiated view on psychometric models and their usage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

It does not have to be either or! Assessing competence in medicine should be a continuum between an analytic and a holistic approach

Article Open access 29 March 2021

A history of assessment in medical education

Article 28 October 2020

The key-features approach to assess clinical decisions: validity evidence to date

Article 17 May 2018

Notes

R scripts for this simulation are available upon request from the corresponding author.

References

Bartroff, J., Lai, T. L., & Shih, M.-C. (2013). Sequential experimentation in clinical trials. New York, NY: Springer.
Book Google Scholar
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software. doi:10.18637/jss.v067.i01.
Google Scholar
Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation perspective. Psychological Bulletin, 110(2), 305.
Article Google Scholar
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440. doi:10.1007/s11336-006-1447-6.
Article Google Scholar
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110, 203–219. doi:10.1037/0033-295X.110.2.203.
Article Google Scholar
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi:10.1037/0033-295X.111.4.1061.
Article Google Scholar
Brannick, M. T., Erol-Korkmaz, H. T., & Prewett, M. (2011). A systematic review of the reliability of objective structured clinical examination scores. Medical Education, 45, 1181–1189. doi:10.1111/j.1365-2923.2011.04075.x.
Article Google Scholar
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.
Book Google Scholar
Colliver, J. A., Markwell, S. J., Vu, N. V., & Barrows, H. S. (1990). Case specificity of standardized-patient examinations: Consistency of performance on components of clinical competence within and between cases. Evaluation & the Health Professions, 13, 252–261. doi:10.1177/016327879001300208.
Article Google Scholar
Cook, D. A., Kuper, A., Hatala, R., & Ginsburg, S. (2016). When assessment data are words: Validity evidence for qualitative educational assessments. Academic Medicine. doi:10.1097/ACM.0000000000001175.
Google Scholar
Cooksey, R. W. (1996). The methodology of social judgement theory. Thinking & Reasoning, 2, 141–174. doi:10.1080/135467896394483.
Article Google Scholar
Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. doi:10.1177/0013164404266386.
Article Google Scholar
Crossley, J. G. M. (2010). Vive la difference! A recall from knowing to exploring. Medical Education, 44, 946–948. doi:10.1111/j.1365-2923.2010.03786.x.
Article Google Scholar
Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalisability: A key to unlock professional assessment. Medical Education, 36(10), 972–978.
Article Google Scholar
De Champlain, A., MacMillan, M. K., King, A. M., Klass, D. J., & Margolis, M. J. (1999). Assessing the impacts of intra-site and inter-site checklist recording discrepancies on the reliability of scores obtained in a nationally administered standardized patient examination. Academic Medicine, 74(10), S52–S54.
Article Google Scholar
Doran, H., Bates, D., Bliese, P., & Dowling, M. (2007). Estimating the multilevel Rasch model: With the lme4 package. Journal of Statistical Software. doi:10.18637/jss.v020.i02.
Google Scholar
Dory, V., Gagnon, R., & Charlin, B. (2010). Is case-specificity content-specificity? An analysis of data from extended-matching questions. Advances in Health Science Education, 15, 55–63. doi:10.1007/s10459-009-9169-z.
Article Google Scholar
Driessen, E., van der Vleuten, C. P. M., Schuwirth, L., van Tartwijk, J., & Vermunt, J. (2005). The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: A case study. Medical Education, 39, 214–220. doi:10.1111/j.1365-2929.2004.02059.x.
Article Google Scholar
Durning, S. J., Artino, A. R., Boulet, J. R., Dorrance, K., van der Vleuten, C. P. M., & Schuwirth, L. (2012). The impact of selected contextual factors on experts’ clinical reasoning performance (does context impact clinical reasoning performance in experts?). Advances in Health Science Education, 17, 65–79. doi:10.1007/s10459-011-9294-3.
Article Google Scholar
Edwards, J. R. (2011). The fallacy of formative measurement. Organizational Research Methods, 14, 370–388. doi:10.1177/1094428110378369.
Article Google Scholar
Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs and measures. Psychological Methods, 5(2), 155–174.
Article Google Scholar
Elstein, A. S. (1978). Medical problem solving: An analysis of clinical reasoning. Cambridge, MA: Harvard Univ. Press.
Book Google Scholar
Eva, K. W. (2003). On the generality of specificity. Medical Education, 37, 587–588. doi:10.1046/j.1365-2923.2003.01563.x.
Article Google Scholar
Eva, K. (2011). On the relationship between problem-solving skills and professional practice. In C. Kanes (Ed.), Elaborating professionalism (Vol. 5, pp. 17–34, Innovation and change in professional education). Dordrecht: Springer.
Eva, K. W., & Hodges, B. D. (2012). Scylla or Charybdis? Can we navigate between objectification and judgement in assessment? Medical Education, 46, 914–919. doi:10.1111/j.1365-2923.2012.04310.x.
Article Google Scholar
Evans, J. S. B. T., Clibbens, J., Cattani, A., Harris, A., & Dennis, I. (2003). Explicit and implicit processes in multicue judgment. Memory & Cognition, 31, 608–618. doi:10.3758/BF03196101.
Article Google Scholar
Gick, M. L., & Holyoak, K. J. (1980). Analogical problem solving. Cognitive Psychology, 12, 306–355. doi:10.1016/0010-0285(80)90013-4.
Article Google Scholar
Godden, D. R., & Baddeley, A. D. (1975). Context-dependent memory in two natural environments: On land and underwater. British Journal of Psychology, 66, 325–331. doi:10.1111/j.2044-8295.1975.tb01468.x.
Article Google Scholar
Goldberg, L. R. (1970). Man versus model of man: A rationale, plus some evidence, for a method of improving on clinical inferences. Psychological Bulletin, 73, 422–432. doi:10.1037/h0029230.
Article Google Scholar
Goldstein, H. (1979). Consequences of using the Rasch model for educational assessment. British Educational Research Journal, 5, 211–220. doi:10.2307/1501031.
Article Google Scholar
Goldstein, H. (2012). Francis Galton, measurement, psychometrics and social progress. Assessment in Education: Principles, Policy & Practice, 19(2), 147–158.
Article Google Scholar
Goodwin, D. W., Powell, B., Bremer, D., Hoine, H., & Stern, J. (1969). Alcohol and recall: State-dependent effects in man. Science, 163, 1358–1360. doi:10.1126/science.163.3873.1358.
Article Google Scholar
Grek, S. (2009). Governing by numbers: The PISA ‘effect’ in Europe. Journal of Education Policy, 24, 23–37. doi:10.1080/02680930802412669.
Article Google Scholar
Hammond, K. R., Hamm, R. M., Grassia, J., & Pearson, T. (1987). Direct comparison of the efficacy of intuitive and analytical cognition in expert judgment. IEEE Transactions on Systems, Man, and Cybernetics, 17, 753–770. doi:10.1109/TSMC.1987.6499282.
Article Google Scholar
Hammond, K. R., Hursch, C. J., & Todd, F. J. (1964). Analyzing the components of clinical inference. Psychological Review, 71, 438–456. doi:10.1037/h0040736.
Article Google Scholar
Hecht, M., Weirich, S., Siegle, T., & Frey, A. (2015). Modeling booklet effects for nonequivalent group designs in large-scale assessment. Educational and Psychological Measurement, 75, 568–584. doi:10.1177/0013164414554219.
Article Google Scholar
Hertwig, R., Meier, N., Nickel, C., Zimmermann, P.-C., Ackermann, S., Woike, J. K., et al. (2013). Correlates of diagnostic accuracy in patients with nonspecific complaints. Medical Decision Making : An International Journal of the Society for Medical Decision Making, 33, 533–543. doi:10.1177/0272989X12470975.
Article Google Scholar
Hodges, B. (2013). Assessment in the post-psychometric era: Learning to love the subjective and collective. Medical Teacher, 35, 564–568. doi:10.3109/0142159X.2013.789134.
Article Google Scholar
Jarjoura, D., Early, L., & Androulakakis, V. (2004). A multivariate generalizability model for clinical skills assessments. Educational and Psychological Measurement, 64, 22–39. doi:10.1177/0013164403258466.
Article Google Scholar
Jones, P., Smith, R. W., & Talley, D. (2006). Developing test forms for small-scale achievement testing systems. In S. M. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 487–525). New York, NY: L. Erlbaum Associates.
Google Scholar
Kane, M. (1996). The precision of measurements. Applied Measurement in Education, 9, 355–379. doi:10.1207/s15324818ame0904_4.
Article Google Scholar
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. doi:10.1111/jedm.12000.
Article Google Scholar
Karelaia, N., & Hogarth, R. M. (2008). Determinants of linear judgment: A meta-analysis of lens model studies. Psychological Bulletin, 134, 404–426. doi:10.1037/0033-2909.134.3.404.
Article Google Scholar
Kaufmann, E., & Athanasou, J. A. (2009). A meta-analysis of judgment achievement as defined by the lens model equation. Swiss Journal of Psychology, 68, 99–112. doi:10.1024/1421-0185.68.2.99.
Article Google Scholar
Keller, L. A., Clauser, B. E., & Swanson, D. B. (2010). Using multivariate generalizability theory to assess the effect of content stratification on the reliability of a performance assessment. Advances in Health Science Education, 15, 717–733. doi:10.1007/s10459-010-9233-8.
Article Google Scholar
Kotovsky, K., Hayes, J., & Simon, H. (1985). Why are some problems hard? Evidence from Tower of Hanoi. Cognitive Psychology, 17, 248–294. doi:10.1016/0010-0285(85)90009-X.
Article Google Scholar
Kreiter, C. (2008). A comment on the continuing impact of case specificity. Medical Education, 42, 548–549. doi:10.1111/j.1365-2923.2008.03085.x.
Article Google Scholar
Kreiter, C. D., & Bergus, G. R. (2007). Case specificity: Empirical phenomenon or measurement artifact? Teaching and Learning in Medicine, 19, 378–381. doi:10.1080/10401330701542776.
Article Google Scholar
Larson, J. S., & Billeter, D. M. (2016). Adaptation and fallibility in experts’ judgments of novice performers. Journal of Experimental Psychology. Learning, Memory, and Cognition.. doi:10.1037/xlm0000304.
Google Scholar
Leight, K. A., & Ellis, H. C. (1981). Emotional mood states, strategies, and state-dependency in memory. Journal of Verbal Learning and Verbal Behavior, 20, 251–266. doi:10.1016/S0022-5371(81)90406-0.
Article Google Scholar
Marcoulides, G. A. (1996). Estimating variance components in generalizability theory: The covariance structure analysis approach. Structural Equation Modeling: A Multidisciplinary Journal, 3, 290–299. doi:10.1080/10705519609540045.
Article Google Scholar
Mattick, K., Dennis, I., Bradley, P., & Bligh, J. (2008). Content specificity: Is it the full story? Statistical modelling of a clinical skills examination. Medical Education, 42, 589–599. doi:10.1111/j.1365-2923.2008.03020.x.
Article Google Scholar
McClelland, D. C. (1973). Testing for competence rather than for intelligence. American Psychologist, 28(1), 1–14.
Article Google Scholar
Mellenbergh, G. J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1, 293–299. doi:10.1037/1082-989X.1.3.293.
Article Google Scholar
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5–11. doi:10.3102/0013189X018002005.
Article Google Scholar
Norcini, J., Anderson, B., Bollela, V., Burch, V., Costa, M. J., Duvivier, R., et al. (2011). Criteria for good assessment: Consensus statement and recommendations from the Ottawa 2010 Conference. Medical Teacher, 33, 206–214. doi:10.3109/0142159X.2011.551559.
Article Google Scholar
Norman, G. R. (2008). The glass is a little full-of something: Revisiting the issue of content specificity of problem solving. Medical Education, 42, 549–551. doi:10.1111/j.1365-2923.2008.03096.x.
Article Google Scholar
Norman, G., Bordage, G., Page, G., & Keane, D. (2006). How specific is case specificity? Medical Education, 40, 618–623. doi:10.1111/j.1365-2929.2006.02511.x.
Article Google Scholar
Norman, G. R., Tugwell, P., Feightner, J. W., Muzzin, L. J., & Jacoby, L. L. (1985). Knowledge and clinical problem-solving. Medical Education, 19(5), 344–356.
Article Google Scholar
Popham, W. J. (2009). Assessment literacy for teachers: Faddish or fundamental? Theory into Practice, 48, 4–11. doi:10.1080/00405840802577536.
Article Google Scholar
Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6(1), 1–9.
Article Google Scholar
R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria. http://www.R-project.org/.
Ray, A., & Wu, M. (2003). PISA programme for international student assessment (PISA): PISA 2000 technical report. Paris: OECD Publishing.
Google Scholar
Richter Lagha, R. A., Boscardin, C., May, W., & Fung, C.-C. (2012). A comparison of two standard-setting approaches in high-stakes clinical performance assessment using generalizability theory. Academic Medicine, 87, 1077–1082. doi:10.1097/ACM.0b013e31825cea4b.
Article Google Scholar
Ricketts, C., Freeman, A., Pagliuca, G., Coombes, L., & Archer, J. (2010). Difficult decisions for progress testing: How much and how often? Medical Teacher, 32, 513–515. doi:10.3109/0142159X.2010.485651.
Article Google Scholar
Roberts, J., & Norman, G. (1990). Reliability and learning from the objective structured clinical examination. Medical Education, 24, 219–223. doi:10.1111/j.1365-2923.1990.tb00004.x.
Article Google Scholar
Rutkowski, L., von Davier, M., & Rutkowski, D. (2013). Handbook of International large-scale assessment: Background, technical issues, and methods of data analysis. Boca Raton: Chapman and Hall/CRC.
Google Scholar
Schuwirth, L. (2009). Is assessment of clinical reasoning still the Holy Grail? Medical Education, 43, 298–300. doi:10.1111/j.1365-2923.2009.03290.x.
Article Google Scholar
Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2006). A plea for new psychometric models in educational assessment. Medical Education, 40, 296–300. doi:10.1111/j.1365-2929.2006.02405.x.
Article Google Scholar
Schuwirth, L. W. T., & van der Vleuten, C. P. (2011). Programmatic assessment: From assessment of learning to assessment for learning. Medical Teacher, 33, 478–485. doi:10.3109/0142159X.2011.565828.
Article Google Scholar
Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215–232. doi:10.2307/1435044.
Article Google Scholar
Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (1999). Note on sources of sampling variability in science performance assessments. Journal of Educational Measurement, 36(1), 61–71.
Article Google Scholar
Sijtsma, K. (2006). Psychometrics in psychological research: Role model or partner in science? Psychometrika, 71, 451–455. doi:10.1007/s11336-006-1497-9.
Article Google Scholar
Skrondal, A., & Rabe-Hesketh, S. (2007). Latent variable modelling: A survey. Scandinavian Journal of Statistics, 34, 712–745. doi:10.1111/j.1467-9469.2007.00573.x.
Article Google Scholar
Slovic, P., & Lichtenstein, S. (1971). Comparison of Bayesian and regression approaches to the study of information processing in judgment. Organizational Behavior and Human Performance, 6, 649–744. doi:10.1016/0030-5073(71)90033-X.
Article Google Scholar
Swanson, D. B., Norman, G. R., & Linn, R. L. (1995). Performance-based assessment: Lessons from the health professions. Educational Researcher, 24, 5–11. doi:10.3102/0013189X024005005.
Article Google Scholar
van der Vleuten, C. P. M. (2014). When I say … context specificity. Medical Education, 48, 234–235. doi:10.1111/medu.12263.
Article Google Scholar
van der Vleuten, C. P. M., Schuwirth, L. W. T., Driessen, E. W., Govaerts, M. J. B., & Heeneman, S. (2014). 12 Tips for programmatic assessment. Medical Teacher. doi:10.3109/0142159X.2014.973388.
Google Scholar
von Davier, M., Sinharay, S., Oranje, A., & Beaton, A. (2006). The statistical procedures used in national assessment of educational progress: Recent developments and future directions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 1039–1055). Amsterdam: Elsevier.
Chapter Google Scholar
Webb, N. M., Shavelson, R. J., & Haertel, E. H. (2006). Reliability coefficients and generalizability theory. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 81–124, Handbook of Statistics): Elsevier Science.
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
Wimmers, P. F., & Fung, C.-C. (2008). The impact of case specificity and generalisable skills on clinical performance: A correlated traits–correlated methods approach. Medical Education, 42, 580–588. doi:10.1111/j.1365-2923.2008.03089.x.
Article Google Scholar
Wimmers, P. F., Splinter, T. A., Hancock, G. R., & Schmidt, H. G. (2007). Clinical competence: General ability or case-specific? Advances in Health Science Education, 12, 299–314. doi:10.1007/s10459-006-9002-x.
Article Google Scholar
Wrigley, W., van der Vleuten, C. P. M., Freeman, A., & Muijtjens, A. (2012). A systemic framework for the progress test: Strengths, constraints and issues: AMEE Guide No. 71. Medical Teacher, 34, 683–697. doi:10.3109/0142159X.2012.704437.
Article Google Scholar
Zumbo, B. D. (2006). Validity: Foundational issues and statistical methodology. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 45–80). Amsterdam: Elsevier.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Educational Measurement at the University of Oslo (CEMO) and Centre for Health Sciences Education, University of Oslo, Oslo, Norway
Stefan K. Schauber
Department of Psychology, Humboldt–Universität zu Berlin, Berlin, Germany
Martin Hecht
Institute of Medical Education, Faculty of Medicine, University of Bern, Konsumstrasse 13, 3010, Bern, Switzerland
Zineb M. Nouns

Authors

Stefan K. Schauber
View author publications
You can also search for this author in PubMed Google Scholar
Martin Hecht
View author publications
You can also search for this author in PubMed Google Scholar
Zineb M. Nouns
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan K. Schauber.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schauber, S.K., Hecht, M. & Nouns, Z.M. Why assessment in medical education needs a solid foundation in modern test theory. Adv in Health Sci Educ 23, 217–232 (2018). https://doi.org/10.1007/s10459-017-9771-4

Download citation

Received: 03 August 2015
Accepted: 09 March 2017
Published: 16 March 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10459-017-9771-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Why assessment in medical education needs a solid foundation in modern test theory

Abstract

Access this article

Similar content being viewed by others

It does not have to be either or! Assessing competence in medicine should be a continuum between an analytic and a holistic approach

A history of assessment in medical education

The key-features approach to assess clinical decisions: validity evidence to date

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Why assessment in medical education needs a solid foundation in modern test theory

Abstract

Access this article

Similar content being viewed by others

It does not have to be either or! Assessing competence in medicine should be a continuum between an analytic and a holistic approach

A history of assessment in medical education

The key-features approach to assess clinical decisions: validity evidence to date

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation