Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2015

Open Access 01-12-2015 | Research article

Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia

Authors: Olga Morozova, Olga Levina, Anneli Uusküla, Robert Heimer

Published in: BMC Medical Research Methodology | Issue 1/2015

Login to get access

Abstract

Background

Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorithms remain the dominant method in medical and epidemiological research.

Methods

Performance of stepwise (backward elimination and forward selection algorithms using AIC, BIC, and Likelihood Ratio Test, p = 0.05 (LRT)) and alternative subset selection methods in linear regression, including Bayesian model averaging (BMA) and penalized regression (lasso, adaptive lasso, and adaptive elastic net) was investigated in a dataset from a cross-sectional study of drug users in St. Petersburg, Russia in 2012–2013. Dependent variable measured health-related quality of life, and independent correlates included 44 variables measuring demographics, behavioral, and structural factors.

Results

In our case study all methods returned models of different size and composition varying from 41 to 11 variables. The percentage of significant variables among those selected in final model varied from 100 % to 27 %. Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero). Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise. By incorporating model uncertainty into subset selection and estimation of coefficients and their standard deviations, BMA returned a parsimonious model with the most conservative results in terms of covariates significance.

Conclusions

BMA and adaptive elastic net performed best in our analysis. Based on our results and previous theoretical studies the use of stepwise methods in medical and epidemiological research may be outperformed by alternative methods in cases such as ours. In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree. We recommend that researchers, at a minimum, should explore model uncertainty and stability as part of their analyses, and report these details in epidemiological papers.
Appendix
Available only for authorised users
Literature
1.
go back to reference George EI. The Variable Selection Problem. J Am Stat Assoc. 2000;95(452):1304–8.CrossRef George EI. The Variable Selection Problem. J Am Stat Assoc. 2000;95(452):1304–8.CrossRef
3.
go back to reference Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins; 2008 Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins; 2008
4.
go back to reference Miller A. Subset Selection in Regression. Boca Raton: Taylor & Francis; 2002CrossRef Miller A. Subset Selection in Regression. Boca Raton: Taylor & Francis; 2002CrossRef
5.
go back to reference Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. New York: Springer; 2002 Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. New York: Springer; 2002
6.
go back to reference Efroymson MA. Multiple regression analysis. In: Ralston A, Wilf H, editors. Mathematical Methods for Digital Computers, vol. 1. New York: John Wiley & Sons; 1960. p. 191–203. Efroymson MA. Multiple regression analysis. In: Ralston A, Wilf H, editors. Mathematical Methods for Digital Computers, vol. 1. New York: John Wiley & Sons; 1960. p. 191–203.
7.
9.
go back to reference Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br J Math Stat Psychol. 1992;45(2):265–82.CrossRef Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br J Math Stat Psychol. 1992;45(2):265–82.CrossRef
10.
go back to reference Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer; 2001 Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer; 2001
11.
go back to reference Viallefont V, Raftery AE, Richardson S. Variable selection and Bayesian model averaging in case–control studies. Stat Med. 2001;20(21):3215–30.PubMedCrossRef Viallefont V, Raftery AE, Richardson S. Variable selection and Bayesian model averaging in case–control studies. Stat Med. 2001;20(21):3215–30.PubMedCrossRef
12.
go back to reference Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. Why do we still use stepwise modelling in ecology and behaviour? J Anim Ecol. 2006;75(5):1182–9.PubMedCrossRef Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP. Why do we still use stepwise modelling in ecology and behaviour? J Anim Ecol. 2006;75(5):1182–9.PubMedCrossRef
13.
go back to reference Flack VF, Chang PC. Frequency of Selecting Noise Variables in Subset Regression Analysis: A Simulation Study. Am Stat. 1987;41(1):84–6. Flack VF, Chang PC. Frequency of Selecting Noise Variables in Subset Regression Analysis: A Simulation Study. Am Stat. 1987;41(1):84–6.
14.
go back to reference Hurvich CM, Tsai CL. The Impact of Model Selection on Inference in Linear Regression. Am Stat. 1990;44(3):214–7. Hurvich CM, Tsai CL. The Impact of Model Selection on Inference in Linear Regression. Am Stat. 1990;44(3):214–7.
15.
go back to reference Mundry R, Nunn Charles L. Stepwise Model Fitting and Statistical Inference: Turning Noise into Signal Pollution. Am Nat. 2009;173(1):119–23.PubMedCrossRef Mundry R, Nunn Charles L. Stepwise Model Fitting and Statistical Inference: Turning Noise into Signal Pollution. Am Nat. 2009;173(1):119–23.PubMedCrossRef
16.
go back to reference Wiegand RE. Performance of using multiple stepwise algorithms for variable selection. Stat Med. 2010;29(15):1647–59.PubMed Wiegand RE. Performance of using multiple stepwise algorithms for variable selection. Stat Med. 2010;29(15):1647–59.PubMed
17.
go back to reference Greenland S. Bayesian perspectives for epidemiological research. II. Regression analysis. Int J Epidemiol. 2007;36(1):195–202.PubMedCrossRef Greenland S. Bayesian perspectives for epidemiological research. II. Regression analysis. Int J Epidemiol. 2007;36(1):195–202.PubMedCrossRef
18.
go back to reference Hutmacher MM, Kowalski KG. Covariate Selection in Pharmacometric Analyses: A Review of Methods. Br J Clin Pharmacol. 2014;79(1):132–47.PubMedCentralCrossRef Hutmacher MM, Kowalski KG. Covariate Selection in Pharmacometric Analyses: A Review of Methods. Br J Clin Pharmacol. 2014;79(1):132–47.PubMedCentralCrossRef
19.
go back to reference Kadane JB, Lazar NA. Methods and Criteria for Model Selection. J Am Stat Assoc. 2004;99(465):279–90.CrossRef Kadane JB, Lazar NA. Methods and Criteria for Model Selection. J Am Stat Assoc. 2004;99(465):279–90.CrossRef
20.
go back to reference Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian Model Averaging: A Tutorial. Stat Sci. 1999;14(4):382–401.CrossRef Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian Model Averaging: A Tutorial. Stat Sci. 1999;14(4):382–401.CrossRef
21.
go back to reference Tibshirani R. Regression Shrinkage and Selection via the Lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88. Tibshirani R. Regression Shrinkage and Selection via the Lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88.
22.
go back to reference Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol). 2005;67(2):301–20.CrossRef Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol). 2005;67(2):301–20.CrossRef
23.
go back to reference Heckatorn D. Respondent-driven sampling: A new approach to the study of hidden population. Soc Probl. 1997;44:174–99.CrossRef Heckatorn D. Respondent-driven sampling: A new approach to the study of hidden population. Soc Probl. 1997;44:174–99.CrossRef
25.
26.
go back to reference Group EQ. EuroQol - a new facility for the measurement of health-related quality of life. Health Policy. 1990;16(3):199–208.CrossRef Group EQ. EuroQol - a new facility for the measurement of health-related quality of life. Health Policy. 1990;16(3):199–208.CrossRef
28.
go back to reference Holmes WC. A short, psychiatric, case-finding measure for HIV seropositive outpatients: performance characteristics of the 5-item mental health subscale of the SF-20 in a male, seropositive sample. Med Care. 1998;36(2):237–43.PubMedCrossRef Holmes WC. A short, psychiatric, case-finding measure for HIV seropositive outpatients: performance characteristics of the 5-item mental health subscale of the SF-20 in a male, seropositive sample. Med Care. 1998;36(2):237–43.PubMedCrossRef
29.
go back to reference Miller LC, Berg JH, Archer RL. Openers - Individuals Who Elicit Intimate Self-Disclosure. J Pers Soc Psychol. 1983;44(6):1234–44.CrossRef Miller LC, Berg JH, Archer RL. Openers - Individuals Who Elicit Intimate Self-Disclosure. J Pers Soc Psychol. 1983;44(6):1234–44.CrossRef
30.
go back to reference Kalichman SC, Simbayi LC, Cloete A, Mthembu PP, Mkhonta RN, Ginindza T. Measuring AIDS stigmas in people living with HIV/AIDS: the Internalized AIDS-Related Stigma Scale. AIDS Care. 2009;21(1):87–93.PubMedCrossRef Kalichman SC, Simbayi LC, Cloete A, Mthembu PP, Mkhonta RN, Ginindza T. Measuring AIDS stigmas in people living with HIV/AIDS: the Internalized AIDS-Related Stigma Scale. AIDS Care. 2009;21(1):87–93.PubMedCrossRef
31.
go back to reference Pinel EC. Stigma consciousness: the psychological legacy of social stereotypes. J Pers Soc Psychol. 1999;76(1):114–28.PubMedCrossRef Pinel EC. Stigma consciousness: the psychological legacy of social stereotypes. J Pers Soc Psychol. 1999;76(1):114–28.PubMedCrossRef
32.
go back to reference Venables WN, Ripley BD. Modern Applied Statistics with S. New York: Springer; 2002.CrossRef Venables WN, Ripley BD. Modern Applied Statistics with S. New York: Springer; 2002.CrossRef
33.
34.
go back to reference Yang Y, Zou H. An Efficient Algorithm for Computing the HHSVM and Its Generalizations. J Comput Graph Stat. 2013;22(2):396–415.CrossRef Yang Y, Zou H. An Efficient Algorithm for Computing the HHSVM and Its Generalizations. J Comput Graph Stat. 2013;22(2):396–415.CrossRef
35.
go back to reference Feldkircher M, Zeugner S: Benchmark Priors Revisited. On Adaptive Shrinkage and the Supermodel Effect in Bayesian Model Averaging. IMF Working Papers. 2009;09(202):1–39.CrossRef Feldkircher M, Zeugner S: Benchmark Priors Revisited. On Adaptive Shrinkage and the Supermodel Effect in Bayesian Model Averaging. IMF Working Papers. 2009;09(202):1–39.CrossRef
36.
go back to reference Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; 2001 Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; 2001
37.
go back to reference Efron B. Bootstrap Methods: Another Look at the Jackknife. Ann Stat. 1979;7(1):1–26.CrossRef Efron B. Bootstrap Methods: Another Look at the Jackknife. Ann Stat. 1979;7(1):1–26.CrossRef
38.
go back to reference Akaike H. New Look at Statistical-Model Identification. Ieee T Automat Contr. 1974;Ac19(6):716–23.CrossRef Akaike H. New Look at Statistical-Model Identification. Ieee T Automat Contr. 1974;Ac19(6):716–23.CrossRef
39.
40.
go back to reference Sauerbrei W, Boulesteix AL, Binder H. Stability investigations of multivariable regression models derived from low- and high-dimensional data. J Biopharm Stat. 2011;21(6):1206–31.PubMedCrossRef Sauerbrei W, Boulesteix AL, Binder H. Stability investigations of multivariable regression models derived from low- and high-dimensional data. J Biopharm Stat. 2011;21(6):1206–31.PubMedCrossRef
41.
go back to reference Zou H. The Adaptive Lasso and Its Oracle Properties. J Am Stat Assoc. 2006;101(476):1418–29.CrossRef Zou H. The Adaptive Lasso and Its Oracle Properties. J Am Stat Assoc. 2006;101(476):1418–29.CrossRef
44.
go back to reference Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis, Second Edition. Boca Raton: Taylor & Francis; 2003 Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis, Second Edition. Boca Raton: Taylor & Francis; 2003
45.
go back to reference Raftery AE, Madigan D, Hoeting JA. Bayesian Model Averaging for Linear Regression Models. J Am Stat Assoc. 1997;92(437):179–91.CrossRef Raftery AE, Madigan D, Hoeting JA. Bayesian Model Averaging for Linear Regression Models. J Am Stat Assoc. 1997;92(437):179–91.CrossRef
46.
47.
go back to reference Barbieri MM, Berger JO. Optimal predictive model selection. Ann Stat. 2004;32(3):870–97.CrossRef Barbieri MM, Berger JO. Optimal predictive model selection. Ann Stat. 2004;32(3):870–97.CrossRef
48.
go back to reference Genell A, Nemes S, Steineck G, Dickman PW. Model selection in medical research: a simulation study comparing Bayesian model averaging and stepwise regression. BMC Med Res Methodol. 2010;10:108.PubMedPubMedCentralCrossRef Genell A, Nemes S, Steineck G, Dickman PW. Model selection in medical research: a simulation study comparing Bayesian model averaging and stepwise regression. BMC Med Res Methodol. 2010;10:108.PubMedPubMedCentralCrossRef
49.
go back to reference Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95.PubMedCrossRef Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95.PubMedCrossRef
50.
go back to reference Ribbing J, Nyberg J, Caster O, Jonsson EN. The lasso--a novel method for predictive covariate model building in nonlinear mixed effects models. J Pharmacokinet Pharmacodyn. 2007;34(4):485–517.PubMedCrossRef Ribbing J, Nyberg J, Caster O, Jonsson EN. The lasso--a novel method for predictive covariate model building in nonlinear mixed effects models. J Pharmacokinet Pharmacodyn. 2007;34(4):485–517.PubMedCrossRef
51.
go back to reference Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med. 2007;26(30):5512–28.PubMedCrossRef Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med. 2007;26(30):5512–28.PubMedCrossRef
52.
go back to reference Rentsch C, Bebu I, Guest JL, Rimland D, Agan BK, Marconi V. Combining epidemiologic and biostatistical tools to enhance variable selection in HIV cohort analyses. PLoS One. 2014;9(1):e87352.PubMedPubMedCentralCrossRef Rentsch C, Bebu I, Guest JL, Rimland D, Agan BK, Marconi V. Combining epidemiologic and biostatistical tools to enhance variable selection in HIV cohort analyses. PLoS One. 2014;9(1):e87352.PubMedPubMedCentralCrossRef
53.
go back to reference Burnham KP, Anderson DR, Huyvaert KP. AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons. Behav Ecol Sociobiol. 2011;65(1):23–35.CrossRef Burnham KP, Anderson DR, Huyvaert KP. AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons. Behav Ecol Sociobiol. 2011;65(1):23–35.CrossRef
54.
go back to reference van de Geer SA, Buhlmann P. On the conditions used to prove oracle results for the Lasso. Elec J of Stat. 2009;3:1360-1392. van de Geer SA, Buhlmann P. On the conditions used to prove oracle results for the Lasso. Elec J of Stat. 2009;3:1360-1392.
56.
go back to reference Montgomery JM, Nyhan B. Bayesian Model Averaging: Theoretical Developments and Practical Applications. Polit Anal. 2010;18(2):245–70.CrossRef Montgomery JM, Nyhan B. Bayesian Model Averaging: Theoretical Developments and Practical Applications. Polit Anal. 2010;18(2):245–70.CrossRef
57.
go back to reference Dietze P, Stoove M, Miller P, Kinner S, Bruno R, Alati R, et al. The self-reported personal wellbeing of a sample of Australian injecting drug users. Addiction. 2010;105(12):2141–8.PubMedCrossRef Dietze P, Stoove M, Miller P, Kinner S, Bruno R, Alati R, et al. The self-reported personal wellbeing of a sample of Australian injecting drug users. Addiction. 2010;105(12):2141–8.PubMedCrossRef
58.
go back to reference Douab T, Marcellin F, Vilotitch A, Protopopescu C, Preau M, Suzan-Monti M, et al. Health-related quality of life of people living with HIV followed up in hospitals in France: comparing trends and correlates between 2003 and 2011 (ANRS-VESPA and VESPA2 national surveys). AIDS Care. 2014;26 Suppl 1:S29–40.PubMedCrossRef Douab T, Marcellin F, Vilotitch A, Protopopescu C, Preau M, Suzan-Monti M, et al. Health-related quality of life of people living with HIV followed up in hospitals in France: comparing trends and correlates between 2003 and 2011 (ANRS-VESPA and VESPA2 national surveys). AIDS Care. 2014;26 Suppl 1:S29–40.PubMedCrossRef
59.
go back to reference Jelsma J, Maclean E, Hughes J, Tinise X, Darder M. An investigation into the health-related quality of life of individuals living with HIV who are receiving HAART. AIDS Care. 2005;17(5):579–88.PubMedCrossRef Jelsma J, Maclean E, Hughes J, Tinise X, Darder M. An investigation into the health-related quality of life of individuals living with HIV who are receiving HAART. AIDS Care. 2005;17(5):579–88.PubMedCrossRef
60.
go back to reference Preau M, Protopopescu C, Spire B, Sobel A, Dellamonica P, Moatti JP, et al. Health related quality of life among both current and former injection drug users who are HIV-infected. Drug Alcohol Depend. 2007;86(2–3):175–82.PubMedCrossRef Preau M, Protopopescu C, Spire B, Sobel A, Dellamonica P, Moatti JP, et al. Health related quality of life among both current and former injection drug users who are HIV-infected. Drug Alcohol Depend. 2007;86(2–3):175–82.PubMedCrossRef
Metadata
Title
Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
Authors
Olga Morozova
Olga Levina
Anneli Uusküla
Robert Heimer
Publication date
01-12-2015
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2015
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-015-0066-2

Other articles of this Issue 1/2015

BMC Medical Research Methodology 1/2015 Go to the issue