Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2021

Open Access 01-12-2021 | Research article

An empirical analysis of dealing with patients who are lost to follow-up when developing prognostic models using a cohort design

Authors: Jenna M. Reps, Peter Rijnbeek, Alana Cuthbert, Patrick B. Ryan, Nicole Pratt, Martijn Schuemie

Published in: BMC Medical Informatics and Decision Making | Issue 1/2021

Login to get access

Abstract

Background

Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.

Methods

We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: (1) include all patients (including those lost to follow-up), (2) exclude all patients lost to follow-up or (3) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance.

Results

The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on model discrimination performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided.

Conclusion

Based on this study we therefore recommend (1) developing models using data that includes patients that are lost to follow-up and (2) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.
Appendix
Available only for authorised users
Literature
1.
go back to reference NICE Lipid modification: cardiovascular risk assessment and the modification of blood lipids for the primary and secondary prevention of cardiovascular disease 2014. NICE Lipid modification: cardiovascular risk assessment and the modification of blood lipids for the primary and secondary prevention of cardiovascular disease 2014.
2.
go back to reference Goldstein BA, Navar AM, Pencina MJ, Ioannidis J. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2017;24(1):198–208.CrossRef Goldstein BA, Navar AM, Pencina MJ, Ioannidis J. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2017;24(1):198–208.CrossRef
3.
go back to reference Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969–75.CrossRef Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969–75.CrossRef
4.
go back to reference Steyerberg EW, Moons KG, van der Windt DA, et al. Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013;102:e1001381.CrossRef Steyerberg EW, Moons KG, van der Windt DA, et al. Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013;102:e1001381.CrossRef
6.
go back to reference Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med. 2015;131:1–9.CrossRef Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med. 2015;131:1–9.CrossRef
7.
go back to reference Xue X, Kim MY, Gaudet MM, Park Y, Heo M, Hollenbeck AR, Strickler HD, Gunter MJ. A comparison of the polytomous logistic regression and joint cox proportional hazards models for evaluating multiple disease subtypes in prospective cohort studies. Cancer Epidemiol Prev Biomarkers. 2013;22(2):275–85.CrossRef Xue X, Kim MY, Gaudet MM, Park Y, Heo M, Hollenbeck AR, Strickler HD, Gunter MJ. A comparison of the polytomous logistic regression and joint cox proportional hazards models for evaluating multiple disease subtypes in prospective cohort studies. Cancer Epidemiol Prev Biomarkers. 2013;22(2):275–85.CrossRef
8.
go back to reference Howards PP, Hertz-Picciotto I, Poole C. Conditions for bias from differential left truncation. Am J Epidemiol. 2006;165(4):444–52.CrossRef Howards PP, Hertz-Picciotto I, Poole C. Conditions for bias from differential left truncation. Am J Epidemiol. 2006;165(4):444–52.CrossRef
9.
go back to reference Moriguchi S, Hayashi Y, Nose Y, Maehara Y, Korenaga D, Sugimachi K. A comparison of the logistic regression and the cox proportional hazard models in retrospective studies on the prognosis of patients with castric cancer. J Surg Oncol. 1993;52(1):9–13.CrossRef Moriguchi S, Hayashi Y, Nose Y, Maehara Y, Korenaga D, Sugimachi K. A comparison of the logistic regression and the cox proportional hazard models in retrospective studies on the prognosis of patients with castric cancer. J Surg Oncol. 1993;52(1):9–13.CrossRef
10.
go back to reference Peduzzi P, Holford T, Detre K, Chan YK. Comparison of the logistic and Cox regression models when outcome is determined in all patients after a fixed period of time. J Chronic Dis. 1987;40(8):761–7.CrossRef Peduzzi P, Holford T, Detre K, Chan YK. Comparison of the logistic and Cox regression models when outcome is determined in all patients after a fixed period of time. J Chronic Dis. 1987;40(8):761–7.CrossRef
11.
go back to reference Vock, D.M., Wolfson, J., Bandyopadhyay, S., Adomavicius, G., Johnson, P.E., Vazquez-Benitez, G. and O’Connor, P.J. Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting. J Biomed Inf. 2016;61:119–131. Vock, D.M., Wolfson, J., Bandyopadhyay, S., Adomavicius, G., Johnson, P.E., Vazquez-Benitez, G. and O’Connor, P.J. Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting. J Biomed Inf. 2016;61:119–131.
12.
go back to reference Macaulay D, Sun SX, Sorg RA, Yan SY, De G, Wu EQ, Simonelli PF. Development and validation of a claims-based prediction model for COPD severity. Respir Med. 2013;107(10):1568–77.CrossRef Macaulay D, Sun SX, Sorg RA, Yan SY, De G, Wu EQ, Simonelli PF. Development and validation of a claims-based prediction model for COPD severity. Respir Med. 2013;107(10):1568–77.CrossRef
13.
go back to reference Chandran U, Reps J, Stang PE, Ryan PB. Inferring disease severity in rheumatoid arthritis using predictive modeling in administrative claims databases. PLoS ONE. 2019;14(12). Chandran U, Reps J, Stang PE, Ryan PB. Inferring disease severity in rheumatoid arthritis using predictive modeling in administrative claims databases. PLoS ONE. 2019;14(12).
14.
go back to reference Tai D, Dick P, To T, Wright JG. Development of pediatric comorbidity prediction model. Arch Pediatr Adolesc Med. 2006;160(3):293–9.CrossRef Tai D, Dick P, To T, Wright JG. Development of pediatric comorbidity prediction model. Arch Pediatr Adolesc Med. 2006;160(3):293–9.CrossRef
15.
go back to reference Wang Q., Reps JM., Kostka KF., Ryan PB., Zou Y., et al. Development and validation of a prognostic model predicting symptomatic hemorrhagic transformation in acute ischemic stroke at scale in the OHDSI network. PLoS ONE. Wang Q., Reps JM., Kostka KF., Ryan PB., Zou Y., et al. Development and validation of a prognostic model predicting symptomatic hemorrhagic transformation in acute ischemic stroke at scale in the OHDSI network. PLoS ONE.
16.
go back to reference Ezaz G, Long JB, Gross CP, Chen J. Risk prediction model for heart failure and cardiomyopathy after adjuvant trastuzumab therapy for breast cancer. J Am Heart Assoc. 2014;3(1):e000472.CrossRef Ezaz G, Long JB, Gross CP, Chen J. Risk prediction model for heart failure and cardiomyopathy after adjuvant trastuzumab therapy for breast cancer. J Am Heart Assoc. 2014;3(1):e000472.CrossRef
17.
go back to reference Suchard MA, Simpson SE, Zorych I, Ryan P, Madigan D. Massive parallelization of serial inference algorithms for complex generalized linear models. ACM Trans Model Comput Simul. 2013;23:10. Suchard MA, Simpson SE, Zorych I, Ryan P, Madigan D. Massive parallelization of serial inference algorithms for complex generalized linear models. ACM Trans Model Comput Simul. 2013;23:10.
18.
go back to reference Bootkrajang, J. and Kabán, A., 2012, September. Label-noise robust logistic regression and its applications. In Joint European conference on machine learning and knowledge discovery in databases (pp. 143–158). Springer, Berlin. Bootkrajang, J. and Kabán, A., 2012, September. Label-noise robust logistic regression and its applications. In Joint European conference on machine learning and knowledge discovery in databases (pp. 143–158). Springer, Berlin.
19.
go back to reference Natarajan, N., Dhillon, I.S., Ravikumar, P.K. and Tewari, A., 2013. Learning with noisy labels. In Advances in neural information processing systems (pp. 1196–1204). Natarajan, N., Dhillon, I.S., Ravikumar, P.K. and Tewari, A., 2013. Learning with noisy labels. In Advances in neural information processing systems (pp. 1196–1204).
20.
go back to reference Jakobsen JC, Gluud C, Wetterslev J, Winkel P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—a practical guide with flowcharts. BMC Med Res Methodol. 2017;17(1):162.CrossRef Jakobsen JC, Gluud C, Wetterslev J, Winkel P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—a practical guide with flowcharts. BMC Med Res Methodol. 2017;17(1):162.CrossRef
21.
go back to reference Al-Janabi S, Alkaim AF. A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation. Soft Comput. 2020;24(1):555–69.CrossRef Al-Janabi S, Alkaim AF. A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation. Soft Comput. 2020;24(1):555–69.CrossRef
22.
go back to reference Sullivan TR, Lee KJ, Ryan P, Salter AB. Multiple imputation for handling missing outcome data when estimating the relative risk. BMC Med Res Methodol. 2017;17(1):134.CrossRef Sullivan TR, Lee KJ, Ryan P, Salter AB. Multiple imputation for handling missing outcome data when estimating the relative risk. BMC Med Res Methodol. 2017;17(1):134.CrossRef
Metadata
Title
An empirical analysis of dealing with patients who are lost to follow-up when developing prognostic models using a cohort design
Authors
Jenna M. Reps
Peter Rijnbeek
Alana Cuthbert
Patrick B. Ryan
Nicole Pratt
Martijn Schuemie
Publication date
01-12-2021
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2021
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-021-01408-x

Other articles of this Issue 1/2021

BMC Medical Informatics and Decision Making 1/2021 Go to the issue