Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2014

Open Access 01-12-2014 | Research article

Validation of prediction models based on lasso regression with multiply imputed data

Authors: Jammbe Z Musoro, Aeilko H Zwinderman, Milo A Puhan, Gerben ter Riet, Ronald B Geskus

Published in: BMC Medical Research Methodology | Issue 1/2014

Login to get access

Abstract

Background

In prognostic studies, the lasso technique is attractive since it improves the quality of predictions by shrinking regression coefficients, compared to predictions based on a model fitted via unpenalized maximum likelihood. Since some coefficients are set to zero, parsimony is achieved as well. It is unclear whether the performance of a model fitted using the lasso still shows some optimism. Bootstrap methods have been advocated to quantify optimism and generalize model performance to new subjects. It is unclear how resampling should be performed in the presence of multiply imputed data.

Method

The data were based on a cohort of Chronic Obstructive Pulmonary Disease patients. We constructed models to predict Chronic Respiratory Questionnaire dyspnea 6 months ahead. Optimism of the lasso model was investigated by comparing 4 approaches of handling multiply imputed data in the bootstrap procedure, using the study data and simulated data sets. In the first 3 approaches, data sets that had been completed via multiple imputation (MI) were resampled, while the fourth approach resampled the incomplete data set and then performed MI.

Results

The discriminative model performance of the lasso was optimistic. There was suboptimal calibration due to over-shrinkage. The estimate of optimism was sensitive to the choice of handling imputed data in the bootstrap resampling procedure. Resampling the completed data sets underestimates optimism, especially if, within a bootstrap step, selected individuals differ over the imputed data sets. Incorporating the MI procedure in the validation yields estimates of optimism that are closer to the true value, albeit slightly too larger.

Conclusion

Performance of prognostic models constructed using the lasso technique can be optimistic as well. Results of the internal validation are sensitive to how bootstrap resampling is performed.
Appendix
Available only for authorised users
Literature
1.
go back to reference Tibshirani R: Regression shrinkage and selection via lasso. J Roy Stat Soc B. 1996, 58: 267-288. Tibshirani R: Regression shrinkage and selection via lasso. J Roy Stat Soc B. 1996, 58: 267-288.
2.
go back to reference Tibshirani R: The lasso method for variable selection in the Cox model. Stat Med. 1997, 16: 385-395. 10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3.CrossRefPubMed Tibshirani R: The lasso method for variable selection in the Cox model. Stat Med. 1997, 16: 385-395. 10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3.CrossRefPubMed
3.
go back to reference Steyerberg EW: Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2010, New York: Springer Steyerberg EW: Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2010, New York: Springer
4.
go back to reference Justice AC, Covinsky KE, Berlin JA: Assessing the generalizability of prognostic information. Ann Intern Med. 1999, 130: 515-524. 10.7326/0003-4819-130-6-199903160-00016.CrossRefPubMed Justice AC, Covinsky KE, Berlin JA: Assessing the generalizability of prognostic information. Ann Intern Med. 1999, 130: 515-524. 10.7326/0003-4819-130-6-199903160-00016.CrossRefPubMed
5.
go back to reference Steyerberg EW, Harrell FE, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD: Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001, 8: 774-781.CrossRef Steyerberg EW, Harrell FE, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD: Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001, 8: 774-781.CrossRef
6.
go back to reference Harrell FE, Lee KL, Mark DB: Multivariate prognostic models: issues in developing models, evaluating assumptions and accuracy, and measuring and reducing errors. Stat Med. 1996, 15: 361-387. 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.CrossRefPubMed Harrell FE, Lee KL, Mark DB: Multivariate prognostic models: issues in developing models, evaluating assumptions and accuracy, and measuring and reducing errors. Stat Med. 1996, 15: 361-387. 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.CrossRefPubMed
7.
go back to reference Breiman L: The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J Am Stat Assoc. 1992, 87: 738-754. 10.1080/01621459.1992.10475276.CrossRef Breiman L: The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J Am Stat Assoc. 1992, 87: 738-754. 10.1080/01621459.1992.10475276.CrossRef
8.
go back to reference Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1986, New York: Chapman & Hall Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1986, New York: Chapman & Hall
9.
go back to reference Harrell FE: Regression Modeling Strategies: with Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: SpringerCrossRef Harrell FE: Regression Modeling Strategies: with Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: SpringerCrossRef
10.
go back to reference Van Houwelingen JC, Le Cessie S: Predictive value of statistical models. Stat Med. 1990, 9: 1303-1325. 10.1002/sim.4780091109.CrossRefPubMed Van Houwelingen JC, Le Cessie S: Predictive value of statistical models. Stat Med. 1990, 9: 1303-1325. 10.1002/sim.4780091109.CrossRefPubMed
11.
go back to reference Copas JB: Regression, prediction and shrinkage. J Roy Stat Soc B. 1983, 45: 311-354. Copas JB: Regression, prediction and shrinkage. J Roy Stat Soc B. 1983, 45: 311-354.
12.
go back to reference Heymans MW, van Buuren S, Knol DL, van Mechelen W, de Vet HCW: Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol. 2007, 7: 33-10.1186/1471-2288-7-33.CrossRefPubMedPubMedCentral Heymans MW, van Buuren S, Knol DL, van Mechelen W, de Vet HCW: Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol. 2007, 7: 33-10.1186/1471-2288-7-33.CrossRefPubMedPubMedCentral
13.
go back to reference Rubin DB: Multiple Imputation for Nonresponse in Surveys. 1987, New York: John Wiley & SonsCrossRef Rubin DB: Multiple Imputation for Nonresponse in Surveys. 1987, New York: John Wiley & SonsCrossRef
14.
go back to reference White IR, Royston P, Wood AM: Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011, 30: 377-399. 10.1002/sim.4067.CrossRefPubMed White IR, Royston P, Wood AM: Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011, 30: 377-399. 10.1002/sim.4067.CrossRefPubMed
15.
go back to reference Vergouwe Y, Royston P, Moons KG, Altman DG: Development and validation of a prediction model with missing predictor data: a practical approach. J Clin Epidemiol. 2010, 63: 205-214. 10.1016/j.jclinepi.2009.03.017.CrossRefPubMed Vergouwe Y, Royston P, Moons KG, Altman DG: Development and validation of a prediction model with missing predictor data: a practical approach. J Clin Epidemiol. 2010, 63: 205-214. 10.1016/j.jclinepi.2009.03.017.CrossRefPubMed
16.
go back to reference Vergouw D, Heymans MW, Peat GM, Kuijpers T, Croft PR, de Vet HCW, van der Horst HE, van der Windt DAWM: The search for stable prognostic models in multiple imputed data sets. BMC Med Res Methodol. 2010, 10: 81-10.1186/1471-2288-10-81.CrossRefPubMedPubMedCentral Vergouw D, Heymans MW, Peat GM, Kuijpers T, Croft PR, de Vet HCW, van der Horst HE, van der Windt DAWM: The search for stable prognostic models in multiple imputed data sets. BMC Med Res Methodol. 2010, 10: 81-10.1186/1471-2288-10-81.CrossRefPubMedPubMedCentral
17.
go back to reference Siebeling L, Puhan MA, Muggensturm P, Zoller M, ter Riet G: Characteristics of Dutch and Swiss primary care COPD patients - baseline data of the ICE COLD ERIC study. Clin Epidemiol. 2011, 3: 273-283.CrossRefPubMedPubMedCentral Siebeling L, Puhan MA, Muggensturm P, Zoller M, ter Riet G: Characteristics of Dutch and Swiss primary care COPD patients - baseline data of the ICE COLD ERIC study. Clin Epidemiol. 2011, 3: 273-283.CrossRefPubMedPubMedCentral
18.
go back to reference Siebeling L, ter Riet G, van der Wal WM, Geskus RB, Zoller M, Muggensturm P, Joleska I, Puhan MA: Ice cold eric–international collaborative effort on chronic obstructive lung disease: exacerbation risk index cohorts–study protocol for an international copd cohort study. BMC Pulm Med. 2009, 9: 16-10.1186/1471-2466-9-16.CrossRef Siebeling L, ter Riet G, van der Wal WM, Geskus RB, Zoller M, Muggensturm P, Joleska I, Puhan MA: Ice cold eric–international collaborative effort on chronic obstructive lung disease: exacerbation risk index cohorts–study protocol for an international copd cohort study. BMC Pulm Med. 2009, 9: 16-10.1186/1471-2466-9-16.CrossRef
19.
go back to reference Puhan MA, Behnke M, Frey M, Grueter T, Brandli O, Lichtenschop A, Guyatt GH, Schunemann HJ: Self-administration and interviewer-administration of the German chronic respiratory questionnaire: instrument development and assessment of validity and reliability in two randomised studies. Health Qual Life Outcomes. 2004, 2: 1-10.1186/1477-7525-2-1.CrossRefPubMedPubMedCentral Puhan MA, Behnke M, Frey M, Grueter T, Brandli O, Lichtenschop A, Guyatt GH, Schunemann HJ: Self-administration and interviewer-administration of the German chronic respiratory questionnaire: instrument development and assessment of validity and reliability in two randomised studies. Health Qual Life Outcomes. 2004, 2: 1-10.1186/1477-7525-2-1.CrossRefPubMedPubMedCentral
20.
go back to reference Puhan MA, Behnke M, Laschke M, Lichtenschopf A, Brändli O, Guyatt GH, Schünemann HJ: Self-administration and standardisation of the chronic respiratory questionnaire: a randomised trial in three German-speaking countries. Respir Med. 2004, 98: 342-350. 10.1016/j.rmed.2003.10.013.CrossRefPubMed Puhan MA, Behnke M, Laschke M, Lichtenschopf A, Brändli O, Guyatt GH, Schünemann HJ: Self-administration and standardisation of the chronic respiratory questionnaire: a randomised trial in three German-speaking countries. Respir Med. 2004, 98: 342-350. 10.1016/j.rmed.2003.10.013.CrossRefPubMed
21.
go back to reference van Buuren S, Karin G: Mice: multivariate imputation by chained equations in R. J Stat Software. 2011, 45: 1-67. van Buuren S, Karin G: Mice: multivariate imputation by chained equations in R. J Stat Software. 2011, 45: 1-67.
22.
go back to reference Moons KGM, Donders RART, Stijnen T, Harrell FE: Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006, 59: 1092-1101. 10.1016/j.jclinepi.2006.01.009.CrossRefPubMed Moons KGM, Donders RART, Stijnen T, Harrell FE: Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006, 59: 1092-1101. 10.1016/j.jclinepi.2006.01.009.CrossRefPubMed
23.
go back to reference vonHippel PT: Regression with missing Ys: an improved strategy for analyzing multiply imputed data. Socio Meth. 2007, 37: 83-117. 10.1111/j.1467-9531.2007.00180.x.CrossRef vonHippel PT: Regression with missing Ys: an improved strategy for analyzing multiply imputed data. Socio Meth. 2007, 37: 83-117. 10.1111/j.1467-9531.2007.00180.x.CrossRef
24.
go back to reference Cox DR: Two further applications of a model for binary regression. Biometrika. 1958, 45: 562-565. 10.1093/biomet/45.3-4.562.CrossRef Cox DR: Two further applications of a model for binary regression. Biometrika. 1958, 45: 562-565. 10.1093/biomet/45.3-4.562.CrossRef
25.
go back to reference Schunemann HJ, Puhan M, Goldstein R, Jaeschke R, Guyatt GH: Measurement properties and interpretability of the chronic respiratory disease questionnaire (crq). COPD. 2005, 2: 81-89. 10.1081/COPD-200050651.CrossRefPubMed Schunemann HJ, Puhan M, Goldstein R, Jaeschke R, Guyatt GH: Measurement properties and interpretability of the chronic respiratory disease questionnaire (crq). COPD. 2005, 2: 81-89. 10.1081/COPD-200050651.CrossRefPubMed
26.
go back to reference R Core Team: R: A Language and Environment for Statistical Computing. 2012, Vienna: R foundation for statistical computing, R foundation for statistical computing. ISBN 3-900051-07-0. [http://www.R-project.org/] R Core Team: R: A Language and Environment for Statistical Computing. 2012, Vienna: R foundation for statistical computing, R foundation for statistical computing. ISBN 3-900051-07-0. [http://​www.​R-project.​org/​]
27.
go back to reference Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Software. 2010, 33: 1-22.CrossRef Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Software. 2010, 33: 1-22.CrossRef
29.
go back to reference Van Houwelingen JC, Sauerbrei W: Cross-validation, shrinkage and variable selection in linear regression revisited. Open J Stat. 2013, 3: 79-10.4236/ojs.2013.32011.CrossRef Van Houwelingen JC, Sauerbrei W: Cross-validation, shrinkage and variable selection in linear regression revisited. Open J Stat. 2013, 3: 79-10.4236/ojs.2013.32011.CrossRef
30.
go back to reference Wan Y, Datta S, Conklin DJ, Kong M: Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. J Stat Comput Simulat. 2014, 1-15. [doi:10.1080/00949655.2014.907801], Wan Y, Datta S, Conklin DJ, Kong M: Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. J Stat Comput Simulat. 2014, 1-15. [doi:10.1080/00949655.2014.907801],
31.
go back to reference Chen Q, Wang S: Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med. 2013, 32: 3646-3659. 10.1002/sim.5783.CrossRefPubMed Chen Q, Wang S: Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med. 2013, 32: 3646-3659. 10.1002/sim.5783.CrossRefPubMed
32.
go back to reference Wood AM, White IR, Royston P: How should variable selection be performed with multiply imputed data?. Stat Med. 2008, 27: 3227-3246. 10.1002/sim.3177.CrossRefPubMed Wood AM, White IR, Royston P: How should variable selection be performed with multiply imputed data?. Stat Med. 2008, 27: 3227-3246. 10.1002/sim.3177.CrossRefPubMed
33.
go back to reference Yang X, Belin TR, Boscardin W: Imputation and variable selection in linear regression models with missing covariates. Biometrics. 2005, 61: 498-506. 10.1111/j.1541-0420.2005.00317.x.CrossRefPubMed Yang X, Belin TR, Boscardin W: Imputation and variable selection in linear regression models with missing covariates. Biometrics. 2005, 61: 498-506. 10.1111/j.1541-0420.2005.00317.x.CrossRefPubMed
Metadata
Title
Validation of prediction models based on lasso regression with multiply imputed data
Authors
Jammbe Z Musoro
Aeilko H Zwinderman
Milo A Puhan
Gerben ter Riet
Ronald B Geskus
Publication date
01-12-2014
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2014
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/1471-2288-14-116

Other articles of this Issue 1/2014

BMC Medical Research Methodology 1/2014 Go to the issue

Reviewer acknowledgement

Reviewer acknowledgement 2013