Abstract
Despite the frequent use of state-of-the-art psychometric models in the field of medical education, there is a growing body of literature that questions their usefulness in the assessment of medical competence. Essentially, a number of authors raised doubt about the appropriateness of psychometric models as a guiding framework to secure and refine current approaches to the assessment of medical competence. In addition, an intriguing phenomenon known as case specificity is specific to the controversy on the use of psychometric models for the assessment of medical competence. Broadly speaking, case specificity is the finding of instability of performances across clinical cases, tasks, or problems. As stability of performances is, generally speaking, a central assumption in psychometric models, case specificity may limit their applicability. This has probably fueled critiques of the field of psychometrics with a substantial amount of potential empirical evidence. This article aimed to explain the fundamental ideas employed in psychometric theory, and how they might be problematic in the context of assessing medical competence. We further aimed to show why and how some critiques do not hold for the field of psychometrics as a whole, but rather only for specific psychometric approaches. Hence, we highlight approaches that, from our perspective, seem to offer promising possibilities when applied in the assessment of medical competence. In conclusion, we advocate for a more differentiated view on psychometric models and their usage.
Similar content being viewed by others
Notes
R scripts for this simulation are available upon request from the corresponding author.
References
Bartroff, J., Lai, T. L., & Shih, M.-C. (2013). Sequential experimentation in clinical trials. New York, NY: Springer.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software. doi:10.18637/jss.v067.i01.
Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation perspective. Psychological Bulletin, 110(2), 305.
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440. doi:10.1007/s11336-006-1447-6.
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110, 203–219. doi:10.1037/0033-295X.110.2.203.
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi:10.1037/0033-295X.111.4.1061.
Brannick, M. T., Erol-Korkmaz, H. T., & Prewett, M. (2011). A systematic review of the reliability of objective structured clinical examination scores. Medical Education, 45, 1181–1189. doi:10.1111/j.1365-2923.2011.04075.x.
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.
Colliver, J. A., Markwell, S. J., Vu, N. V., & Barrows, H. S. (1990). Case specificity of standardized-patient examinations: Consistency of performance on components of clinical competence within and between cases. Evaluation & the Health Professions, 13, 252–261. doi:10.1177/016327879001300208.
Cook, D. A., Kuper, A., Hatala, R., & Ginsburg, S. (2016). When assessment data are words: Validity evidence for qualitative educational assessments. Academic Medicine. doi:10.1097/ACM.0000000000001175.
Cooksey, R. W. (1996). The methodology of social judgement theory. Thinking & Reasoning, 2, 141–174. doi:10.1080/135467896394483.
Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. doi:10.1177/0013164404266386.
Crossley, J. G. M. (2010). Vive la difference! A recall from knowing to exploring. Medical Education, 44, 946–948. doi:10.1111/j.1365-2923.2010.03786.x.
Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalisability: A key to unlock professional assessment. Medical Education, 36(10), 972–978.
De Champlain, A., MacMillan, M. K., King, A. M., Klass, D. J., & Margolis, M. J. (1999). Assessing the impacts of intra-site and inter-site checklist recording discrepancies on the reliability of scores obtained in a nationally administered standardized patient examination. Academic Medicine, 74(10), S52–S54.
Doran, H., Bates, D., Bliese, P., & Dowling, M. (2007). Estimating the multilevel Rasch model: With the lme4 package. Journal of Statistical Software. doi:10.18637/jss.v020.i02.
Dory, V., Gagnon, R., & Charlin, B. (2010). Is case-specificity content-specificity? An analysis of data from extended-matching questions. Advances in Health Science Education, 15, 55–63. doi:10.1007/s10459-009-9169-z.
Driessen, E., van der Vleuten, C. P. M., Schuwirth, L., van Tartwijk, J., & Vermunt, J. (2005). The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: A case study. Medical Education, 39, 214–220. doi:10.1111/j.1365-2929.2004.02059.x.
Durning, S. J., Artino, A. R., Boulet, J. R., Dorrance, K., van der Vleuten, C. P. M., & Schuwirth, L. (2012). The impact of selected contextual factors on experts’ clinical reasoning performance (does context impact clinical reasoning performance in experts?). Advances in Health Science Education, 17, 65–79. doi:10.1007/s10459-011-9294-3.
Edwards, J. R. (2011). The fallacy of formative measurement. Organizational Research Methods, 14, 370–388. doi:10.1177/1094428110378369.
Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs and measures. Psychological Methods, 5(2), 155–174.
Elstein, A. S. (1978). Medical problem solving: An analysis of clinical reasoning. Cambridge, MA: Harvard Univ. Press.
Eva, K. W. (2003). On the generality of specificity. Medical Education, 37, 587–588. doi:10.1046/j.1365-2923.2003.01563.x.
Eva, K. (2011). On the relationship between problem-solving skills and professional practice. In C. Kanes (Ed.), Elaborating professionalism (Vol. 5, pp. 17–34, Innovation and change in professional education). Dordrecht: Springer.
Eva, K. W., & Hodges, B. D. (2012). Scylla or Charybdis? Can we navigate between objectification and judgement in assessment? Medical Education, 46, 914–919. doi:10.1111/j.1365-2923.2012.04310.x.
Evans, J. S. B. T., Clibbens, J., Cattani, A., Harris, A., & Dennis, I. (2003). Explicit and implicit processes in multicue judgment. Memory & Cognition, 31, 608–618. doi:10.3758/BF03196101.
Gick, M. L., & Holyoak, K. J. (1980). Analogical problem solving. Cognitive Psychology, 12, 306–355. doi:10.1016/0010-0285(80)90013-4.
Godden, D. R., & Baddeley, A. D. (1975). Context-dependent memory in two natural environments: On land and underwater. British Journal of Psychology, 66, 325–331. doi:10.1111/j.2044-8295.1975.tb01468.x.
Goldberg, L. R. (1970). Man versus model of man: A rationale, plus some evidence, for a method of improving on clinical inferences. Psychological Bulletin, 73, 422–432. doi:10.1037/h0029230.
Goldstein, H. (1979). Consequences of using the Rasch model for educational assessment. British Educational Research Journal, 5, 211–220. doi:10.2307/1501031.
Goldstein, H. (2012). Francis Galton, measurement, psychometrics and social progress. Assessment in Education: Principles, Policy & Practice, 19(2), 147–158.
Goodwin, D. W., Powell, B., Bremer, D., Hoine, H., & Stern, J. (1969). Alcohol and recall: State-dependent effects in man. Science, 163, 1358–1360. doi:10.1126/science.163.3873.1358.
Grek, S. (2009). Governing by numbers: The PISA ‘effect’ in Europe. Journal of Education Policy, 24, 23–37. doi:10.1080/02680930802412669.
Hammond, K. R., Hamm, R. M., Grassia, J., & Pearson, T. (1987). Direct comparison of the efficacy of intuitive and analytical cognition in expert judgment. IEEE Transactions on Systems, Man, and Cybernetics, 17, 753–770. doi:10.1109/TSMC.1987.6499282.
Hammond, K. R., Hursch, C. J., & Todd, F. J. (1964). Analyzing the components of clinical inference. Psychological Review, 71, 438–456. doi:10.1037/h0040736.
Hecht, M., Weirich, S., Siegle, T., & Frey, A. (2015). Modeling booklet effects for nonequivalent group designs in large-scale assessment. Educational and Psychological Measurement, 75, 568–584. doi:10.1177/0013164414554219.
Hertwig, R., Meier, N., Nickel, C., Zimmermann, P.-C., Ackermann, S., Woike, J. K., et al. (2013). Correlates of diagnostic accuracy in patients with nonspecific complaints. Medical Decision Making : An International Journal of the Society for Medical Decision Making, 33, 533–543. doi:10.1177/0272989X12470975.
Hodges, B. (2013). Assessment in the post-psychometric era: Learning to love the subjective and collective. Medical Teacher, 35, 564–568. doi:10.3109/0142159X.2013.789134.
Jarjoura, D., Early, L., & Androulakakis, V. (2004). A multivariate generalizability model for clinical skills assessments. Educational and Psychological Measurement, 64, 22–39. doi:10.1177/0013164403258466.
Jones, P., Smith, R. W., & Talley, D. (2006). Developing test forms for small-scale achievement testing systems. In S. M. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 487–525). New York, NY: L. Erlbaum Associates.
Kane, M. (1996). The precision of measurements. Applied Measurement in Education, 9, 355–379. doi:10.1207/s15324818ame0904_4.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. doi:10.1111/jedm.12000.
Karelaia, N., & Hogarth, R. M. (2008). Determinants of linear judgment: A meta-analysis of lens model studies. Psychological Bulletin, 134, 404–426. doi:10.1037/0033-2909.134.3.404.
Kaufmann, E., & Athanasou, J. A. (2009). A meta-analysis of judgment achievement as defined by the lens model equation. Swiss Journal of Psychology, 68, 99–112. doi:10.1024/1421-0185.68.2.99.
Keller, L. A., Clauser, B. E., & Swanson, D. B. (2010). Using multivariate generalizability theory to assess the effect of content stratification on the reliability of a performance assessment. Advances in Health Science Education, 15, 717–733. doi:10.1007/s10459-010-9233-8.
Kotovsky, K., Hayes, J., & Simon, H. (1985). Why are some problems hard? Evidence from Tower of Hanoi. Cognitive Psychology, 17, 248–294. doi:10.1016/0010-0285(85)90009-X.
Kreiter, C. (2008). A comment on the continuing impact of case specificity. Medical Education, 42, 548–549. doi:10.1111/j.1365-2923.2008.03085.x.
Kreiter, C. D., & Bergus, G. R. (2007). Case specificity: Empirical phenomenon or measurement artifact? Teaching and Learning in Medicine, 19, 378–381. doi:10.1080/10401330701542776.
Larson, J. S., & Billeter, D. M. (2016). Adaptation and fallibility in experts’ judgments of novice performers. Journal of Experimental Psychology. Learning, Memory, and Cognition.. doi:10.1037/xlm0000304.
Leight, K. A., & Ellis, H. C. (1981). Emotional mood states, strategies, and state-dependency in memory. Journal of Verbal Learning and Verbal Behavior, 20, 251–266. doi:10.1016/S0022-5371(81)90406-0.
Marcoulides, G. A. (1996). Estimating variance components in generalizability theory: The covariance structure analysis approach. Structural Equation Modeling: A Multidisciplinary Journal, 3, 290–299. doi:10.1080/10705519609540045.
Mattick, K., Dennis, I., Bradley, P., & Bligh, J. (2008). Content specificity: Is it the full story? Statistical modelling of a clinical skills examination. Medical Education, 42, 589–599. doi:10.1111/j.1365-2923.2008.03020.x.
McClelland, D. C. (1973). Testing for competence rather than for intelligence. American Psychologist, 28(1), 1–14.
Mellenbergh, G. J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1, 293–299. doi:10.1037/1082-989X.1.3.293.
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5–11. doi:10.3102/0013189X018002005.
Norcini, J., Anderson, B., Bollela, V., Burch, V., Costa, M. J., Duvivier, R., et al. (2011). Criteria for good assessment: Consensus statement and recommendations from the Ottawa 2010 Conference. Medical Teacher, 33, 206–214. doi:10.3109/0142159X.2011.551559.
Norman, G. R. (2008). The glass is a little full-of something: Revisiting the issue of content specificity of problem solving. Medical Education, 42, 549–551. doi:10.1111/j.1365-2923.2008.03096.x.
Norman, G., Bordage, G., Page, G., & Keane, D. (2006). How specific is case specificity? Medical Education, 40, 618–623. doi:10.1111/j.1365-2929.2006.02511.x.
Norman, G. R., Tugwell, P., Feightner, J. W., Muzzin, L. J., & Jacoby, L. L. (1985). Knowledge and clinical problem-solving. Medical Education, 19(5), 344–356.
Popham, W. J. (2009). Assessment literacy for teachers: Faddish or fundamental? Theory into Practice, 48, 4–11. doi:10.1080/00405840802577536.
Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6(1), 1–9.
R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria. http://www.R-project.org/.
Ray, A., & Wu, M. (2003). PISA programme for international student assessment (PISA): PISA 2000 technical report. Paris: OECD Publishing.
Richter Lagha, R. A., Boscardin, C., May, W., & Fung, C.-C. (2012). A comparison of two standard-setting approaches in high-stakes clinical performance assessment using generalizability theory. Academic Medicine, 87, 1077–1082. doi:10.1097/ACM.0b013e31825cea4b.
Ricketts, C., Freeman, A., Pagliuca, G., Coombes, L., & Archer, J. (2010). Difficult decisions for progress testing: How much and how often? Medical Teacher, 32, 513–515. doi:10.3109/0142159X.2010.485651.
Roberts, J., & Norman, G. (1990). Reliability and learning from the objective structured clinical examination. Medical Education, 24, 219–223. doi:10.1111/j.1365-2923.1990.tb00004.x.
Rutkowski, L., von Davier, M., & Rutkowski, D. (2013). Handbook of International large-scale assessment: Background, technical issues, and methods of data analysis. Boca Raton: Chapman and Hall/CRC.
Schuwirth, L. (2009). Is assessment of clinical reasoning still the Holy Grail? Medical Education, 43, 298–300. doi:10.1111/j.1365-2923.2009.03290.x.
Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2006). A plea for new psychometric models in educational assessment. Medical Education, 40, 296–300. doi:10.1111/j.1365-2929.2006.02405.x.
Schuwirth, L. W. T., & van der Vleuten, C. P. (2011). Programmatic assessment: From assessment of learning to assessment for learning. Medical Teacher, 33, 478–485. doi:10.3109/0142159X.2011.565828.
Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215–232. doi:10.2307/1435044.
Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (1999). Note on sources of sampling variability in science performance assessments. Journal of Educational Measurement, 36(1), 61–71.
Sijtsma, K. (2006). Psychometrics in psychological research: Role model or partner in science? Psychometrika, 71, 451–455. doi:10.1007/s11336-006-1497-9.
Skrondal, A., & Rabe-Hesketh, S. (2007). Latent variable modelling: A survey. Scandinavian Journal of Statistics, 34, 712–745. doi:10.1111/j.1467-9469.2007.00573.x.
Slovic, P., & Lichtenstein, S. (1971). Comparison of Bayesian and regression approaches to the study of information processing in judgment. Organizational Behavior and Human Performance, 6, 649–744. doi:10.1016/0030-5073(71)90033-X.
Swanson, D. B., Norman, G. R., & Linn, R. L. (1995). Performance-based assessment: Lessons from the health professions. Educational Researcher, 24, 5–11. doi:10.3102/0013189X024005005.
van der Vleuten, C. P. M. (2014). When I say … context specificity. Medical Education, 48, 234–235. doi:10.1111/medu.12263.
van der Vleuten, C. P. M., Schuwirth, L. W. T., Driessen, E. W., Govaerts, M. J. B., & Heeneman, S. (2014). 12 Tips for programmatic assessment. Medical Teacher. doi:10.3109/0142159X.2014.973388.
von Davier, M., Sinharay, S., Oranje, A., & Beaton, A. (2006). The statistical procedures used in national assessment of educational progress: Recent developments and future directions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 1039–1055). Amsterdam: Elsevier.
Webb, N. M., Shavelson, R. J., & Haertel, E. H. (2006). Reliability coefficients and generalizability theory. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 81–124, Handbook of Statistics): Elsevier Science.
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum Associates.
Wimmers, P. F., & Fung, C.-C. (2008). The impact of case specificity and generalisable skills on clinical performance: A correlated traits–correlated methods approach. Medical Education, 42, 580–588. doi:10.1111/j.1365-2923.2008.03089.x.
Wimmers, P. F., Splinter, T. A., Hancock, G. R., & Schmidt, H. G. (2007). Clinical competence: General ability or case-specific? Advances in Health Science Education, 12, 299–314. doi:10.1007/s10459-006-9002-x.
Wrigley, W., van der Vleuten, C. P. M., Freeman, A., & Muijtjens, A. (2012). A systemic framework for the progress test: Strengths, constraints and issues: AMEE Guide No. 71. Medical Teacher, 34, 683–697. doi:10.3109/0142159X.2012.704437.
Zumbo, B. D. (2006). Validity: Foundational issues and statistical methodology. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 45–80). Amsterdam: Elsevier.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Schauber, S.K., Hecht, M. & Nouns, Z.M. Why assessment in medical education needs a solid foundation in modern test theory. Adv in Health Sci Educ 23, 217–232 (2018). https://doi.org/10.1007/s10459-017-9771-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10459-017-9771-4