Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2007

Open Access 01-12-2007 | Research article

Variable selection under multiple imputation using the bootstrap in a prognostic study

Authors: Martijn W Heymans, Stef van Buuren, Dirk L Knol, Willem van Mechelen, Henrica CW de Vet

Published in: BMC Medical Research Methodology | Issue 1/2007

Login to get access

Abstract

Background

Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection.

Method

In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels.

Results

We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found.

Conclusion

We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values.
Appendix
Available only for authorised users
Literature
1.
go back to reference Staal JB, Hlobil H, Twisk JW, Smid T, Koke AJ, van Mechelen W: Graded activity for low back pain in occupational health care: a randomized, controlled trial. Ann Intern Med. 2004, 140: 77-84.CrossRefPubMed Staal JB, Hlobil H, Twisk JW, Smid T, Koke AJ, van Mechelen W: Graded activity for low back pain in occupational health care: a randomized, controlled trial. Ann Intern Med. 2004, 140: 77-84.CrossRefPubMed
2.
go back to reference Steenstra IA, Anema JR, Bongers PM, de Vet HC, van Mechelen W: The effectiveness of graded activity for low back pain in occupational healthcare. Occup Environ Med. 2006, 63 (11): 718-25. 10.1136/oem.2005.021675.CrossRefPubMedPubMedCentral Steenstra IA, Anema JR, Bongers PM, de Vet HC, van Mechelen W: The effectiveness of graded activity for low back pain in occupational healthcare. Occup Environ Med. 2006, 63 (11): 718-25. 10.1136/oem.2005.021675.CrossRefPubMedPubMedCentral
3.
go back to reference Heymans MW, de Vet HC, Bongers PM, Koes BW, van Mechelen W: The Effectiveness of High Intensity versus Low Intensity Back Schools in an Occupational Setting: a pragmatic randomised controlled trial. Spine. 2006, 31: 1075-82. 10.1097/01.brs.0000216443.46783.4d.CrossRefPubMed Heymans MW, de Vet HC, Bongers PM, Koes BW, van Mechelen W: The Effectiveness of High Intensity versus Low Intensity Back Schools in an Occupational Setting: a pragmatic randomised controlled trial. Spine. 2006, 31: 1075-82. 10.1097/01.brs.0000216443.46783.4d.CrossRefPubMed
4.
go back to reference Schafer JL: Analysis of Incomplete Multivariate Data. 1997, London: Chapman & HallCrossRef Schafer JL: Analysis of Incomplete Multivariate Data. 1997, London: Chapman & HallCrossRef
5.
go back to reference Little RJA, Rubin DB: Statistical Analysis with Missing Data. 2002, New York: John Wiley & SonsCrossRef Little RJA, Rubin DB: Statistical Analysis with Missing Data. 2002, New York: John Wiley & SonsCrossRef
6.
go back to reference Rubin DB: Multiple imputation for nonresponse in surveys. 1987, New York: John Wiley & SonsCrossRef Rubin DB: Multiple imputation for nonresponse in surveys. 1987, New York: John Wiley & SonsCrossRef
7.
go back to reference Wood AM, White IR, Hillsdon M, Carpenter J: Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes. International Journal of Epidemiology. 2005, 34: 89-99. 10.1093/ije/dyh297.CrossRefPubMed Wood AM, White IR, Hillsdon M, Carpenter J: Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes. International Journal of Epidemiology. 2005, 34: 89-99. 10.1093/ije/dyh297.CrossRefPubMed
8.
go back to reference Brand JPL: Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. 1999, Enschede: Print Partners Ipskamp Brand JPL: Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. 1999, Enschede: Print Partners Ipskamp
9.
go back to reference Clark TG, Altman DG: Developing a prognostic model in the presence of missing data – an ovarian cancer case study. Journal of clinical epidemiology. 2003, 56: 28-37. 10.1016/S0895-4356(02)00539-5.CrossRefPubMed Clark TG, Altman DG: Developing a prognostic model in the presence of missing data – an ovarian cancer case study. Journal of clinical epidemiology. 2003, 56: 28-37. 10.1016/S0895-4356(02)00539-5.CrossRefPubMed
10.
go back to reference Steyerberg EW, Eijkemans MJ, Harrell FE, Habbema JD: Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making. 2001, 21: 45-56.CrossRefPubMed Steyerberg EW, Eijkemans MJ, Harrell FE, Habbema JD: Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making. 2001, 21: 45-56.CrossRefPubMed
11.
12.
go back to reference Viallefont V, Raftery AE, Richardson S: Variable selection and Bayesian model averaging in case-control studies. Stat Med. 2001, 20: 3215-3230. 10.1002/sim.976.CrossRefPubMed Viallefont V, Raftery AE, Richardson S: Variable selection and Bayesian model averaging in case-control studies. Stat Med. 2001, 20: 3215-3230. 10.1002/sim.976.CrossRefPubMed
13.
go back to reference Austin PC, Tu JV: Bootstrap Methods for Developing Predictive Models. American Statistician. 2004, 58: 131-137.CrossRef Austin PC, Tu JV: Bootstrap Methods for Developing Predictive Models. American Statistician. 2004, 58: 131-137.CrossRef
14.
go back to reference Sauerbrei W, Schumacher M: A bootstrap resampling procedure for model building: application to the Cox regression model. Stat Med. 1992, 11: 2093-109. 10.1002/sim.4780111607.CrossRefPubMed Sauerbrei W, Schumacher M: A bootstrap resampling procedure for model building: application to the Cox regression model. Stat Med. 1992, 11: 2093-109. 10.1002/sim.4780111607.CrossRefPubMed
15.
go back to reference Hollander N, Augustin NH, Sauerbrei W: Investigation on the improvement of prediction by bootstrap model averaging. Methods Inf Med. 2006, 45: 44-50.PubMed Hollander N, Augustin NH, Sauerbrei W: Investigation on the improvement of prediction by bootstrap model averaging. Methods Inf Med. 2006, 45: 44-50.PubMed
16.
go back to reference Altman DG, Andersen PK: Bootstrap investigation of the stability of a Cox regression model. Stat Med. 1989, 8: 771-83. 10.1002/sim.4780080702.CrossRefPubMed Altman DG, Andersen PK: Bootstrap investigation of the stability of a Cox regression model. Stat Med. 1989, 8: 771-83. 10.1002/sim.4780080702.CrossRefPubMed
17.
go back to reference Steyerberg EW, Eijkemans MJ, Harrell FE, Habbema JD: Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med. 2000, 19: 1059-79. 10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0.CrossRefPubMed Steyerberg EW, Eijkemans MJ, Harrell FE, Habbema JD: Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med. 2000, 19: 1059-79. 10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0.CrossRefPubMed
18.
go back to reference van den Hoogen JMM, Koes BW, Deville W, van Eijk JthM, Bouter LM: The prognosis of low back pain in general practice. Spine. 1997, 22: 1515-21. 10.1097/00007632-199707010-00019.CrossRefPubMed van den Hoogen JMM, Koes BW, Deville W, van Eijk JthM, Bouter LM: The prognosis of low back pain in general practice. Spine. 1997, 22: 1515-21. 10.1097/00007632-199707010-00019.CrossRefPubMed
19.
go back to reference van Poppel MN, Koes BW, van der Ploeg T, Smid T, Bouter LM: Lumbar supports and education for the prevention of low back pain in industry: a randomised controlled trial. JAMA. 1998, 279: 1789-94. 10.1001/jama.279.22.1789.CrossRefPubMed van Poppel MN, Koes BW, van der Ploeg T, Smid T, Bouter LM: Lumbar supports and education for the prevention of low back pain in industry: a randomised controlled trial. JAMA. 1998, 279: 1789-94. 10.1001/jama.279.22.1789.CrossRefPubMed
20.
go back to reference van der Weide WE, Verbeek HAM, Salle HJA, van Dijk FJH: Prognostic factors for chronic disability of acute low-back pain in occupational health care. Scand J Work Environ Health. 1999, 25: 50-6.CrossRefPubMed van der Weide WE, Verbeek HAM, Salle HJA, van Dijk FJH: Prognostic factors for chronic disability of acute low-back pain in occupational health care. Scand J Work Environ Health. 1999, 25: 50-6.CrossRefPubMed
21.
go back to reference Carlsson AM: Assessment of chronic pain. I. Aspects of the reliability and validity of the visual analogue scale. Pain. 1983, 16: 87-101. 10.1016/0304-3959(83)90088-X.CrossRefPubMed Carlsson AM: Assessment of chronic pain. I. Aspects of the reliability and validity of the visual analogue scale. Pain. 1983, 16: 87-101. 10.1016/0304-3959(83)90088-X.CrossRefPubMed
22.
go back to reference Gommans IHB, Koes BW, van Tulder MW: Validity and responsivity of the Dutch Roland Disability Questionnaire. [In Dutch: Validiteit en responsiviteit van de Nederlandstalige Roland Disability Questionnaire]. Ned Tijdschr Fysioth. 1997, 107: 28-33. Gommans IHB, Koes BW, van Tulder MW: Validity and responsivity of the Dutch Roland Disability Questionnaire. [In Dutch: Validiteit en responsiviteit van de Nederlandstalige Roland Disability Questionnaire]. Ned Tijdschr Fysioth. 1997, 107: 28-33.
23.
go back to reference Hildebrandt VH, Bongers PM, van Dijk FJ, Kemper HC, Dul J: Dutch Musculoskeletal Questionnaire: description and basic qualities. Ergonomics. 2001, 44: 1038-55. 10.1080/00140130110087437.CrossRefPubMed Hildebrandt VH, Bongers PM, van Dijk FJ, Kemper HC, Dul J: Dutch Musculoskeletal Questionnaire: description and basic qualities. Ergonomics. 2001, 44: 1038-55. 10.1080/00140130110087437.CrossRefPubMed
24.
go back to reference Baecke JA, Burema J, Frijters JE: A short questionnaire for the measurement of habitual physical activity in epidemiological studies. Am J Clin Nutr. 1982, 36: 936-42.PubMed Baecke JA, Burema J, Frijters JE: A short questionnaire for the measurement of habitual physical activity in epidemiological studies. Am J Clin Nutr. 1982, 36: 936-42.PubMed
25.
go back to reference Karasek RA, Brisson C: The Job Content Questionnaire (JCQ): An Instrument for Internationally Comparative Assessments of Psychosocial Job Characteristics. Journal of Occupational Health Psychology. 1998, 3: 322-355. 10.1037/1076-8998.3.4.322.CrossRefPubMed Karasek RA, Brisson C: The Job Content Questionnaire (JCQ): An Instrument for Internationally Comparative Assessments of Psychosocial Job Characteristics. Journal of Occupational Health Psychology. 1998, 3: 322-355. 10.1037/1076-8998.3.4.322.CrossRefPubMed
26.
go back to reference Bigos SJ, Battie MC, Spengler DM, Fisher LD, Fordyce WE, Hansson TH, Nachemson AL, Wortley MD: A prospective study of work perceptions and psychosocial factors affecting the report of back injury. Spine. 1991, 16: 1-6. 10.1097/00007632-199101000-00001.CrossRefPubMed Bigos SJ, Battie MC, Spengler DM, Fisher LD, Fordyce WE, Hansson TH, Nachemson AL, Wortley MD: A prospective study of work perceptions and psychosocial factors affecting the report of back injury. Spine. 1991, 16: 1-6. 10.1097/00007632-199101000-00001.CrossRefPubMed
27.
go back to reference Swinkels-Meewisse EJ, Swinkels RA, Verbeek AL, Vlaeyen JW, Oostendorp RA: Psychometric properties of the Tampa Scale for kinesiophobia and the fear-avoidance beliefs questionnaire in acute low back pain. Man Ther. 2003, 8: 29-36. 10.1054/math.2002.0484.CrossRefPubMed Swinkels-Meewisse EJ, Swinkels RA, Verbeek AL, Vlaeyen JW, Oostendorp RA: Psychometric properties of the Tampa Scale for kinesiophobia and the fear-avoidance beliefs questionnaire in acute low back pain. Man Ther. 2003, 8: 29-36. 10.1054/math.2002.0484.CrossRefPubMed
28.
go back to reference Waddell G, Newton M, Henderson I, Somerville D, Main CJ: A Fear-Avoidance Beliefs Questionnaire (FABQ) and the role of fear-avoidance beliefs in chronic low back pain and disability. Pain. 1993, 52: 157-68. 10.1016/0304-3959(93)90127-B.CrossRefPubMed Waddell G, Newton M, Henderson I, Somerville D, Main CJ: A Fear-Avoidance Beliefs Questionnaire (FABQ) and the role of fear-avoidance beliefs in chronic low back pain and disability. Pain. 1993, 52: 157-68. 10.1016/0304-3959(93)90127-B.CrossRefPubMed
29.
go back to reference Kraaimaat FW, Bakker A, Evers AWM: Pain Coping Strategies in chronic pain patients: the development of the Pain-Coping-Inventory list. Gedragstherapie. 1997, 30: 185-201. Kraaimaat FW, Bakker A, Evers AWM: Pain Coping Strategies in chronic pain patients: the development of the Pain-Coping-Inventory list. Gedragstherapie. 1997, 30: 185-201.
30.
go back to reference Sauerbrei W: The use of resampling methods to simplify regression models in medical statistics. Applied Statistics. 1999, 48: 313-329. Sauerbrei W: The use of resampling methods to simplify regression models in medical statistics. Applied Statistics. 1999, 48: 313-329.
31.
go back to reference Van Buuren S, Oudshoorn K: Flexible multivariate imputation by MICE. Technical report. 1999, Leiden, The Netherlands: TNO Quality of Life Van Buuren S, Oudshoorn K: Flexible multivariate imputation by MICE. Technical report. 1999, Leiden, The Netherlands: TNO Quality of Life
32.
go back to reference Van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999, 18 (6): 681-94. 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R.CrossRefPubMed Van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999, 18 (6): 681-94. 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R.CrossRefPubMed
33.
go back to reference Harrell F, Lee K, Mark D: Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15: 361-87. 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.CrossRefPubMed Harrell F, Lee K, Mark D: Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15: 361-87. 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.CrossRefPubMed
34.
go back to reference Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD: Validity of prognostic models: when is a model clinically useful?. Semin Urol Oncol. 2002, 20: 96-107. 10.1053/suro.2002.32521.CrossRefPubMed Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD: Validity of prognostic models: when is a model clinically useful?. Semin Urol Oncol. 2002, 20: 96-107. 10.1053/suro.2002.32521.CrossRefPubMed
37.
go back to reference Burton A, Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. Br J Cancer. 2004, 91: 4-8. 10.1038/sj.bjc.6601907.CrossRefPubMedPubMedCentral Burton A, Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. Br J Cancer. 2004, 91: 4-8. 10.1038/sj.bjc.6601907.CrossRefPubMedPubMedCentral
38.
go back to reference Davison AC, Hinkley DV: Bootstrap Methods and Their Application. 1997, New York: Cambridge University PressCrossRef Davison AC, Hinkley DV: Bootstrap Methods and Their Application. 1997, New York: Cambridge University PressCrossRef
39.
go back to reference Steyerberg EW, Eijkemans MJ, Van Houwelingen JC, Lee KL, Habbema JD: Prognostic models based on literature and individual patient data in logistic regression analysis. Stat Med. 2000, 19: 141-60. 10.1002/(SICI)1097-0258(20000130)19:2<141::AID-SIM334>3.0.CO;2-O.CrossRefPubMed Steyerberg EW, Eijkemans MJ, Van Houwelingen JC, Lee KL, Habbema JD: Prognostic models based on literature and individual patient data in logistic regression analysis. Stat Med. 2000, 19: 141-60. 10.1002/(SICI)1097-0258(20000130)19:2<141::AID-SIM334>3.0.CO;2-O.CrossRefPubMed
40.
go back to reference Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001, 6 (4): 330-51.CrossRefPubMed Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001, 6 (4): 330-51.CrossRefPubMed
41.
go back to reference Van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB: Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation. 2006, 76: 1049-64. 10.1080/10629360600810434.CrossRef Van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB: Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation. 2006, 76: 1049-64. 10.1080/10629360600810434.CrossRef
42.
go back to reference Crawford SL, Tennstedt SL, McKinlay JB: A comparison of anlaytic methods for non-random missingness of outcome data. J Clin Epidemiol. 1995, 48 (2): 209-19. 10.1016/0895-4356(94)00124-9.CrossRefPubMed Crawford SL, Tennstedt SL, McKinlay JB: A comparison of anlaytic methods for non-random missingness of outcome data. J Clin Epidemiol. 1995, 48 (2): 209-19. 10.1016/0895-4356(94)00124-9.CrossRefPubMed
43.
go back to reference Rubin DB: Multiple imputation after 18+ years. Journal of the American Statistical Association. 1996, 91: 473-489. 10.2307/2291635.CrossRef Rubin DB: Multiple imputation after 18+ years. Journal of the American Statistical Association. 1996, 91: 473-489. 10.2307/2291635.CrossRef
44.
go back to reference Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR: A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology. 1996, 49: 1373-1379. 10.1016/S0895-4356(96)00236-3.CrossRefPubMed Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR: A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology. 1996, 49: 1373-1379. 10.1016/S0895-4356(96)00236-3.CrossRefPubMed
45.
go back to reference Royston P, Sauerbrei W: Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation. Stat Med. 2003, 22 (4): 639-59. 10.1002/sim.1310.CrossRefPubMed Royston P, Sauerbrei W: Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation. Stat Med. 2003, 22 (4): 639-59. 10.1002/sim.1310.CrossRefPubMed
46.
go back to reference Chen CH, George SL: The bootstrap and identification of prognostic factors via Cox's proportional hazards regression model. Stat Med. 1985, 4: 39-46. 10.1002/sim.4780040107.CrossRefPubMed Chen CH, George SL: The bootstrap and identification of prognostic factors via Cox's proportional hazards regression model. Stat Med. 1985, 4: 39-46. 10.1002/sim.4780040107.CrossRefPubMed
47.
go back to reference Augustin NH, Sauerbrei W, Schumacher M: The practical utility of incorporating model selection uncertainty into prognostic models for survival data. Statistical Modelling. 2005, 5: 95-118. 10.1191/1471082X05st089oa.CrossRef Augustin NH, Sauerbrei W, Schumacher M: The practical utility of incorporating model selection uncertainty into prognostic models for survival data. Statistical Modelling. 2005, 5: 95-118. 10.1191/1471082X05st089oa.CrossRef
48.
go back to reference Buckland ST, Burnham KP, Augustin NH: Model selection: An integral part of inference. Biometrics. 1995, 53: 603-618. 10.2307/2533961.CrossRef Buckland ST, Burnham KP, Augustin NH: Model selection: An integral part of inference. Biometrics. 1995, 53: 603-618. 10.2307/2533961.CrossRef
Metadata
Title
Variable selection under multiple imputation using the bootstrap in a prognostic study
Authors
Martijn W Heymans
Stef van Buuren
Dirk L Knol
Willem van Mechelen
Henrica CW de Vet
Publication date
01-12-2007
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2007
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/1471-2288-7-33

Other articles of this Issue 1/2007

BMC Medical Research Methodology 1/2007 Go to the issue