Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2010

Open Access 01-12-2010 | Research article

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

Authors: Andrea Marshall, Douglas G Altman, Patrick Royston, Roger L Holder

Published in: BMC Medical Research Methodology | Issue 1/2010

Login to get access

Abstract

Background

There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.

Methods

Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.

Results

Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.

Conclusion

The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.
Appendix
Available only for authorised users
Literature
1.
go back to reference Burton A, Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. British Journal of Cancer. 2004, 91 (1): 4-8. 10.1038/sj.bjc.6601907.CrossRefPubMedPubMedCentral Burton A, Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. British Journal of Cancer. 2004, 91 (1): 4-8. 10.1038/sj.bjc.6601907.CrossRefPubMedPubMedCentral
2.
go back to reference Vach W, Blettner M, Armitage P, Colton T: Missing data in epidemiologic studies. Encyclopedia of Biostatistics. 1998, New York: John Wiley & Sons, 2641-2654. Vach W, Blettner M, Armitage P, Colton T: Missing data in epidemiologic studies. Encyclopedia of Biostatistics. 1998, New York: John Wiley & Sons, 2641-2654.
3.
go back to reference Demissie S, LaValley MP, Horton NJ, Glynn RJ, Cupples LA: Bias due to missing exposure data using complete-case analysis in the proportional hazards regression model. Statistics in Medicine. 2003, 22 (4): 545-557. 10.1002/sim.1340.CrossRefPubMed Demissie S, LaValley MP, Horton NJ, Glynn RJ, Cupples LA: Bias due to missing exposure data using complete-case analysis in the proportional hazards regression model. Statistics in Medicine. 2003, 22 (4): 545-557. 10.1002/sim.1340.CrossRefPubMed
4.
go back to reference Lipsitz SR, Ibrahim JG: Using the EM-algorithm for survival data with incomplete categorical covariates. Lifetime Data Analysis. 1996, 2 (1): 5-14. 10.1007/BF00128467.CrossRefPubMed Lipsitz SR, Ibrahim JG: Using the EM-algorithm for survival data with incomplete categorical covariates. Lifetime Data Analysis. 1996, 2 (1): 5-14. 10.1007/BF00128467.CrossRefPubMed
5.
go back to reference Lipsitz SR, Ibrahim JG: Estimating equations with incomplete categorical covariates in the Cox model. Biometrics. 1998, 54 (3): 1002-1013. 10.2307/2533852.CrossRefPubMed Lipsitz SR, Ibrahim JG: Estimating equations with incomplete categorical covariates in the Cox model. Biometrics. 1998, 54 (3): 1002-1013. 10.2307/2533852.CrossRefPubMed
6.
go back to reference Meng XL, Schenker N: Maximum likelihood estimation for linear regression models with right censored outcomes and missing predictors. Computational Statistics & Data Analysis. 1999, 29 (4): 471-483. 10.1016/S0167-9473(98)00074-7.CrossRef Meng XL, Schenker N: Maximum likelihood estimation for linear regression models with right censored outcomes and missing predictors. Computational Statistics & Data Analysis. 1999, 29 (4): 471-483. 10.1016/S0167-9473(98)00074-7.CrossRef
7.
go back to reference Rubin DB: Multiple Imputation for Nonresponse in Surveys. 2004, New York: John Wiley and Sons Rubin DB: Multiple Imputation for Nonresponse in Surveys. 2004, New York: John Wiley and Sons
8.
go back to reference Little RJA, Rubin DB: Statistical Analysis with Missing Data, Second edition. 2002, New York: John Wiley and Sons Little RJA, Rubin DB: Statistical Analysis with Missing Data, Second edition. 2002, New York: John Wiley and Sons
9.
go back to reference van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999, 18 (6): 681-694. 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R.CrossRefPubMed van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999, 18 (6): 681-694. 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R.CrossRefPubMed
10.
go back to reference Meng XL: Multiple-imputation inferences with uncongenial sources of input. Statistical Science. 1994, 9 (4): 538-558. Meng XL: Multiple-imputation inferences with uncongenial sources of input. Statistical Science. 1994, 9 (4): 538-558.
11.
go back to reference Hu M, Salvucci S, Weng S, Cohen MP: Evaluation of Proc Impute and Schafer's imputation software. Proceedings of the survey research methods section of the American Statistical Association. Chicago, Illinois. 1996, 287-292. Hu M, Salvucci S, Weng S, Cohen MP: Evaluation of Proc Impute and Schafer's imputation software. Proceedings of the survey research methods section of the American Statistical Association. Chicago, Illinois. 1996, 287-292.
12.
go back to reference Schafer JL, Graham JW: Missing data: our view of the state of the art. Psychological Methods. 2002, 7 (2): 147-177. 10.1037/1082-989X.7.2.147.CrossRefPubMed Schafer JL, Graham JW: Missing data: our view of the state of the art. Psychological Methods. 2002, 7 (2): 147-177. 10.1037/1082-989X.7.2.147.CrossRefPubMed
13.
go back to reference Schafer J, Ezzati-Rice T, Johnson W, Khare M, Little R, Rubin D: The NHANES III multiple imputation project. Proceedings of the Survey Research Methods Section of the American Statistical Association. Chicago, Illnois. 1996, 28-37. Schafer J, Ezzati-Rice T, Johnson W, Khare M, Little R, Rubin D: The NHANES III multiple imputation project. Proceedings of the Survey Research Methods Section of the American Statistical Association. Chicago, Illnois. 1996, 28-37.
14.
go back to reference Schenker N, Taylor JMG: Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis. 1996, 22 (4): 425-446. 10.1016/0167-9473(95)00057-7.CrossRef Schenker N, Taylor JMG: Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis. 1996, 22 (4): 425-446. 10.1016/0167-9473(95)00057-7.CrossRef
15.
go back to reference Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML: Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. Journal of Clinical Epidemiology. 2002, 55 (2): 184-191. 10.1016/S0895-4356(01)00433-4.CrossRefPubMed Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML: Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. Journal of Clinical Epidemiology. 2002, 55 (2): 184-191. 10.1016/S0895-4356(01)00433-4.CrossRefPubMed
16.
go back to reference Greenland S, Finkle WD: A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology. 1995, 142 (12): 1255-1264.PubMed Greenland S, Finkle WD: A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology. 1995, 142 (12): 1255-1264.PubMed
17.
go back to reference Chen HY: Double-semiparametric method for missing covariates in Cox regression models. Journal of the American Statistical Association. 2002, 97 (458): 565-576. 10.1198/016214502760047096.CrossRef Chen HY: Double-semiparametric method for missing covariates in Cox regression models. Journal of the American Statistical Association. 2002, 97 (458): 565-576. 10.1198/016214502760047096.CrossRef
18.
go back to reference Herring AH, Ibrahim JG, Lipsitz SR: Non-ignorable missing covariate data in survival analysis: a case-study of an International Breast Cancer Study Group trial. Journal of the Royal Statistical Society Series C-Applied Statistics. 2004, 53 (2): 293-310. 10.1046/j.1467-9876.2003.05168.x.CrossRef Herring AH, Ibrahim JG, Lipsitz SR: Non-ignorable missing covariate data in survival analysis: a case-study of an International Breast Cancer Study Group trial. Journal of the Royal Statistical Society Series C-Applied Statistics. 2004, 53 (2): 293-310. 10.1046/j.1467-9876.2003.05168.x.CrossRef
19.
go back to reference Oostenbrink R, Moons KGM, Bleeker SE, Moll HA, Grobbee DE: Diagnostic research on routine care data prospects and problems. Journal of Clinical Epidemiology. 2003, 56 (6): 501-506. 10.1016/S0895-4356(03)00080-5.CrossRefPubMed Oostenbrink R, Moons KGM, Bleeker SE, Moll HA, Grobbee DE: Diagnostic research on routine care data prospects and problems. Journal of Clinical Epidemiology. 2003, 56 (6): 501-506. 10.1016/S0895-4356(03)00080-5.CrossRefPubMed
20.
go back to reference Harrell FE: Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: Springer-Verlag Harrell FE: Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: Springer-Verlag
21.
go back to reference Barzi F, Woodward M: Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. American Journal of Epidemiology. 2004, 160 (1): 34-45. 10.1093/aje/kwh175.CrossRefPubMed Barzi F, Woodward M: Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. American Journal of Epidemiology. 2004, 160 (1): 34-45. 10.1093/aje/kwh175.CrossRefPubMed
22.
go back to reference Scheffer J: Dealing with missing data. Research Letters in the Information and Mathematical Sciences. 2002, 3: 153-160. Scheffer J: Dealing with missing data. Research Letters in the Information and Mathematical Sciences. 2002, 3: 153-160.
23.
go back to reference R Development Core Team: R: A language and environment for statistical computing. 2004, Vienna, Austria: R Foundation for Statistical Computing R Development Core Team: R: A language and environment for statistical computing. 2004, Vienna, Austria: R Foundation for Statistical Computing
24.
go back to reference Sauerbrei W, Royston P, Bojar H, Schmoor C, Schumacher M: Modelling the effects of standard prognostic factors in node-positive breast cancer. German Breast Cancer Study Group (GBSG). British Journal of Cancer. 1999, 79 (11-12): 1752-1760. 10.1038/sj.bjc.6690279.CrossRefPubMedPubMedCentral Sauerbrei W, Royston P, Bojar H, Schmoor C, Schumacher M: Modelling the effects of standard prognostic factors in node-positive breast cancer. German Breast Cancer Study Group (GBSG). British Journal of Cancer. 1999, 79 (11-12): 1752-1760. 10.1038/sj.bjc.6690279.CrossRefPubMedPubMedCentral
25.
go back to reference Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics. Statistics in Medicine. 2006, 25 (24): 4279-4292. 10.1002/sim.2673.CrossRefPubMed Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics. Statistics in Medicine. 2006, 25 (24): 4279-4292. 10.1002/sim.2673.CrossRefPubMed
26.
go back to reference Bender R, Augustin T, Blettner M: Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine. 2005, 24 (11): 1713-1723. 10.1002/sim.2059.CrossRefPubMed Bender R, Augustin T, Blettner M: Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine. 2005, 24 (11): 1713-1723. 10.1002/sim.2059.CrossRefPubMed
27.
go back to reference Clark TG, Stewart ME, Altman DG, Gabra H, Smyth JF: A prognostic model for ovarian cancer. British Journal of Cancer. 2001, 85 (7): 944-952. 10.1054/bjoc.2001.2030.CrossRefPubMedPubMedCentral Clark TG, Stewart ME, Altman DG, Gabra H, Smyth JF: A prognostic model for ovarian cancer. British Journal of Cancer. 2001, 85 (7): 944-952. 10.1054/bjoc.2001.2030.CrossRefPubMedPubMedCentral
28.
go back to reference Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods. 2001, 6 (4): 330-351.CrossRefPubMed Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods. 2001, 6 (4): 330-351.CrossRefPubMed
29.
go back to reference Royston P, Sauerbrei W: A new measure of prognostic separation in survival data. Statistics in Medicine. 2004, 23 (5): 723-748. 10.1002/sim.1621.CrossRefPubMed Royston P, Sauerbrei W: A new measure of prognostic separation in survival data. Statistics in Medicine. 2004, 23 (5): 723-748. 10.1002/sim.1621.CrossRefPubMed
30.
go back to reference Kong FH: Adjusting regression attenuation in the Cox proportional hazards model. Journal of Statistical Planning and Inference. 1999, 79 (1): 31-44. 10.1016/S0378-3758(98)00178-5.CrossRef Kong FH: Adjusting regression attenuation in the Cox proportional hazards model. Journal of Statistical Planning and Inference. 1999, 79 (1): 31-44. 10.1016/S0378-3758(98)00178-5.CrossRef
31.
go back to reference Schafer JL: Analysis of Incomplete Multivariate Data. 1997, New York: Chapman and HallCrossRef Schafer JL: Analysis of Incomplete Multivariate Data. 1997, New York: Chapman and HallCrossRef
32.
go back to reference Marshall A, Altman D, Holder R, Royston P: Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Medical Research Methodology. 2009, 9 (1): 57-10.1186/1471-2288-9-57.CrossRefPubMedPubMedCentral Marshall A, Altman D, Holder R, Royston P: Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Medical Research Methodology. 2009, 9 (1): 57-10.1186/1471-2288-9-57.CrossRefPubMedPubMedCentral
33.
go back to reference Li KH, Meng XL, Raghunathan TE, Rubin DB: Significance levels from repeated p-values with multiply-imputed data. Statistica Sinica. 1991, 1 (1): 65-92. Li KH, Meng XL, Raghunathan TE, Rubin DB: Significance levels from repeated p-values with multiply-imputed data. Statistica Sinica. 1991, 1 (1): 65-92.
34.
go back to reference Wood AM, White IR, Royston P: How should variable selection be performed with multiply imputed data?. Statistics in Medicine. 2008, 27 (17): 3227-3246. 10.1002/sim.3177.CrossRefPubMed Wood AM, White IR, Royston P: How should variable selection be performed with multiply imputed data?. Statistics in Medicine. 2008, 27 (17): 3227-3246. 10.1002/sim.3177.CrossRefPubMed
35.
go back to reference Rubin DB, Schenker N: Multiple imputation in health-care databases: an overview and some applications. Statistics in Medicine. 1991, 10 (4): 585-598. 10.1002/sim.4780100410.CrossRefPubMed Rubin DB, Schenker N: Multiple imputation in health-care databases: an overview and some applications. Statistics in Medicine. 1991, 10 (4): 585-598. 10.1002/sim.4780100410.CrossRefPubMed
36.
go back to reference Tang LQ, Song JW, Belin TR, Unutzer J: A comparison of imputation methods in a longitudinal randomized clinical trial. Statistics in Medicine. 2005, 24 (14): 2111-2128. 10.1002/sim.2099.CrossRefPubMed Tang LQ, Song JW, Belin TR, Unutzer J: A comparison of imputation methods in a longitudinal randomized clinical trial. Statistics in Medicine. 2005, 24 (14): 2111-2128. 10.1002/sim.2099.CrossRefPubMed
37.
go back to reference Rubin DB: Multiple imputation after 18+ years. Journal of the American Statistical Association. 1996, 91 (434): 473-489. 10.2307/2291635.CrossRef Rubin DB: Multiple imputation after 18+ years. Journal of the American Statistical Association. 1996, 91 (434): 473-489. 10.2307/2291635.CrossRef
38.
go back to reference Schafer JL, Olsen MK: Modelling and imputation of semicontinuous survey variables. 2000, The Methodology Center, Penn State University, USA Schafer JL, Olsen MK: Modelling and imputation of semicontinuous survey variables. 2000, The Methodology Center, Penn State University, USA
39.
go back to reference Horton NJ, Lipsitz SR, Parzen M: A potential for bias when rounding in multiple imputation. American Statistician. 2003, 57 (4): 229-232. 10.1198/0003130032314.CrossRef Horton NJ, Lipsitz SR, Parzen M: A potential for bias when rounding in multiple imputation. American Statistician. 2003, 57 (4): 229-232. 10.1198/0003130032314.CrossRef
40.
41.
go back to reference Schafer JL, Novo AA: norm: Analysis of multivariate normal datasets with missing values. 2002, R package version 1.0.9 Schafer JL, Novo AA: norm: Analysis of multivariate normal datasets with missing values. 2002, R package version 1.0.9
42.
go back to reference Schafer JL: mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data. 2003, R package version 1.0.4 Schafer JL: mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data. 2003, R package version 1.0.4
43.
go back to reference van Buuren S, Oudshoorn CGM: mice: Multivariate Imputation by Chained Equations library. 2005, R package version 1.13.1 van Buuren S, Oudshoorn CGM: mice: Multivariate Imputation by Chained Equations library. 2005, R package version 1.13.1
44.
go back to reference Harrell FE: Hmisc: Harrell Miscellaneous library for R statistical software. 2004, R package 2.2-3 Harrell FE: Hmisc: Harrell Miscellaneous library for R statistical software. 2004, R package 2.2-3
Metadata
Title
Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
Authors
Andrea Marshall
Douglas G Altman
Patrick Royston
Roger L Holder
Publication date
01-12-2010
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2010
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/1471-2288-10-7

Other articles of this Issue 1/2010

BMC Medical Research Methodology 1/2010 Go to the issue