Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2020

Open Access 01-12-2020 | Prostate Cancer | Research article

Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches

Authors: Elizabeth Handorf, Yinuo Yin, Michael Slifker, Shannon Lynch

Published in: BMC Medical Research Methodology | Issue 1/2020

Login to get access

Abstract

Background

Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome.

Methods

We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods’ ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer.

Results

In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman’s correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings.

Conclusions

This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.
Appendix
Available only for authorised users
Literature
1.
go back to reference Lynch SM, Rebbeck TR. Bridging the gap between biologic, individual, and macroenvironmental factors in Cancer: a multilevel approach. Cancer Epidemiol Biomark Prev. 2013;22(4):485–95.CrossRef Lynch SM, Rebbeck TR. Bridging the gap between biologic, individual, and macroenvironmental factors in Cancer: a multilevel approach. Cancer Epidemiol Biomark Prev. 2013;22(4):485–95.CrossRef
2.
go back to reference Patel CJ, Bhattacharya, J., Butte, A.J. An environment-wide association study (EWAS) on type 2 diabetes mellitus. PLoS One 2010;5(5):e10746. Patel CJ, Bhattacharya, J., Butte, A.J. An environment-wide association study (EWAS) on type 2 diabetes mellitus. PLoS One 2010;5(5):e10746.
3.
go back to reference Lynch SM, Mitra N, Ross M, Newcomb C, Dailey K, Jackson T, et al. A Neighborhood-Wide Association Study (NWAS): Example of prostate cancer aggressiveness. PLoS One. 2017;12(3):e0174548. Lynch SM, Mitra N, Ross M, Newcomb C, Dailey K, Jackson T, et al. A Neighborhood-Wide Association Study (NWAS): Example of prostate cancer aggressiveness. PLoS One. 2017;12(3):e0174548.
4.
go back to reference Ziegler-Johnson C, Tierney A, Rebbeck TR, Rundle A. Prostate Cancer severity associations with neighborhood deprivation. Prostate Cancer. 2011;2011:846263. Ziegler-Johnson C, Tierney A, Rebbeck TR, Rundle A. Prostate Cancer severity associations with neighborhood deprivation. Prostate Cancer. 2011;2011:846263.
5.
go back to reference Diez Roux AV, Mair C. Neighborhoods and health. Ann N Y Acad Sci. 2010;1186(1):125–45.CrossRef Diez Roux AV, Mair C. Neighborhoods and health. Ann N Y Acad Sci. 2010;1186(1):125–45.CrossRef
6.
go back to reference Tannenbaum SL, Hernandez M, Zheng DD, Sussman DA, Lee DJ. Individual- and neighborhood-level predictors of mortality in Florida colorectal Cancer patients. PLoS One. 2014;9(8):e106322.CrossRef Tannenbaum SL, Hernandez M, Zheng DD, Sussman DA, Lee DJ. Individual- and neighborhood-level predictors of mortality in Florida colorectal Cancer patients. PLoS One. 2014;9(8):e106322.CrossRef
7.
go back to reference Shenoy D, Packianathan S, Chen AM, Vijayakumar S. Do African-American men need separate prostate cancer screening guidelines? BMC Urol. 2016;16:19.CrossRef Shenoy D, Packianathan S, Chen AM, Vijayakumar S. Do African-American men need separate prostate cancer screening guidelines? BMC Urol. 2016;16:19.CrossRef
8.
go back to reference Krier J, Barfield R, Green RC, Kraft P. Reclassification of genetic-based risk predictions as GWAS data accumulate. Genome Med. 2016;8(1):20.CrossRef Krier J, Barfield R, Green RC, Kraft P. Reclassification of genetic-based risk predictions as GWAS data accumulate. Genome Med. 2016;8(1):20.CrossRef
9.
go back to reference Kichaev G, Yang W-Y, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10(10):e1004722.CrossRef Kichaev G, Yang W-Y, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10(10):e1004722.CrossRef
10.
go back to reference Chung CC, Chanock SJ. Current status of genome-wide association studies in cancer. Hum Genet. 2011;130(1):59–78.CrossRef Chung CC, Chanock SJ. Current status of genome-wide association studies in cancer. Hum Genet. 2011;130(1):59–78.CrossRef
11.
go back to reference Foulkes AS. Applied statistical genetics with R: for population-based association studies. New York: Springer Science & Business Media; 2009. p. 252.CrossRef Foulkes AS. Applied statistical genetics with R: for population-based association studies. New York: Springer Science & Business Media; 2009. p. 252.CrossRef
12.
go back to reference Hastie T, Tibshirani R, Wainwright M. Statistical learning with sparsity: the lasso and generalizations. New York: Chapman and Hall/CRC; 2015.CrossRef Hastie T, Tibshirani R, Wainwright M. Statistical learning with sparsity: the lasso and generalizations. New York: Chapman and Hall/CRC; 2015.CrossRef
13.
go back to reference Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88.
14.
go back to reference Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005;67(2):301–20.CrossRef Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005;67(2):301–20.CrossRef
15.
go back to reference Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: springer series in statistics New York; 2001.CrossRef Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: springer series in statistics New York; 2001.CrossRef
16.
go back to reference Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics. 2006;22(12):1540–2.CrossRef Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics. 2006;22(12):1540–2.CrossRef
17.
go back to reference Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Stat Soc B. 2006;68(1):49–67.CrossRef Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Stat Soc B. 2006;68(1):49–67.CrossRef
18.
go back to reference Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. J Comput Graph Stat. 2013;22(2):231–45.CrossRef Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. J Comput Graph Stat. 2013;22(2):231–45.CrossRef
19.
go back to reference Bien J, Wegkamp M. Discussion of correlated variables in regression: clustering and sparse estimation. J Stat Plan Infer. 2013;143(11):1859–62.CrossRef Bien J, Wegkamp M. Discussion of correlated variables in regression: clustering and sparse estimation. J Stat Plan Infer. 2013;143(11):1859–62.CrossRef
20.
go back to reference Efron B, Hastie T. Computer age statistical inference. New York: Cambridge University Press; 2016. Efron B, Hastie T. Computer age statistical inference. New York: Cambridge University Press; 2016.
21.
go back to reference Ishwaran H, Lu M. Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat Med. 2019;38(4):558–82.CrossRef Ishwaran H, Lu M. Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat Med. 2019;38(4):558–82.CrossRef
22.
go back to reference Kang G, Liu W, Cheng C, Wilson CL, Neale G, Yang JJ, et al. Evaluation of a two-step iterative resampling procedure for internal validation of genome-wide association studies. J Hum Genet. 2015;60(12):729.CrossRef Kang G, Liu W, Cheng C, Wilson CL, Neale G, Yang JJ, et al. Evaluation of a two-step iterative resampling procedure for internal validation of genome-wide association studies. J Hum Genet. 2015;60(12):729.CrossRef
23.
go back to reference Team RDC. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2010. Team RDC. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2010.
24.
go back to reference Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.CrossRef Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.CrossRef
25.
go back to reference Simon N, Friedman J, Hastie T, Tibshirani R. SGL: Fit a GLM (or Cox Model) with a Combination of Lasso and Group Lasso Regularization. 1.3 ed; 2019. Simon N, Friedman J, Hastie T, Tibshirani R. SGL: Fit a GLM (or Cox Model) with a Combination of Lasso and Group Lasso Regularization. 1.3 ed; 2019.
26.
go back to reference Ishwaran H, B. KU. Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). 2.9.2 2019. Ishwaran H, B. KU. Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). 2.9.2 2019.
27.
go back to reference Kapelner A, Bleich J. bartMachine: machine learning with Bayesian additive regression trees. J Stat Softw. 2016;70(4):1–40.CrossRef Kapelner A, Bleich J. bartMachine: machine learning with Bayesian additive regression trees. J Stat Softw. 2016;70(4):1–40.CrossRef
28.
go back to reference Belin L, Tan A, De Rycke Y, Dechartres A. Progression-free survival as a surrogate for overall survival in oncology trials: a methodological systematic review. Br J Cancer. 2020;122(11):1707–14.CrossRef Belin L, Tan A, De Rycke Y, Dechartres A. Progression-free survival as a surrogate for overall survival in oncology trials: a methodological systematic review. Br J Cancer. 2020;122(11):1707–14.CrossRef
29.
go back to reference Ciania O, Buyse M, Drummond M, Rasi G, Saad ED, Taylor RS. Time to review the role of surrogate end points in health policy: state of the art and the way forward. Value Health. 2017;20(3):1098–3015. Ciania O, Buyse M, Drummond M, Rasi G, Saad ED, Taylor RS. Time to review the role of surrogate end points in health policy: state of the art and the way forward. Value Health. 2017;20(3):1098–3015.
30.
go back to reference Sokolova M, Japkowicz N, Szpakowicz S, editors. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. Berlin: Springer; 2006. Sokolova M, Japkowicz N, Szpakowicz S, editors. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. Berlin: Springer; 2006.
31.
go back to reference Becker N, Werft W, Toedt G, Lichter P, Benner A. penalizedSVM: a R-package for feature selection SVM classification. Bioinformatics. 2009;25(13):1711–2.CrossRef Becker N, Werft W, Toedt G, Lichter P, Benner A. penalizedSVM: a R-package for feature selection SVM classification. Bioinformatics. 2009;25(13):1711–2.CrossRef
32.
go back to reference Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–5.CrossRef Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–5.CrossRef
33.
go back to reference Rebbeck TR. Precision prevention of cancer. Cancer Epidemiol Biomark Prev. 2014;23(12):2713–5.CrossRef Rebbeck TR. Precision prevention of cancer. Cancer Epidemiol Biomark Prev. 2014;23(12):2713–5.CrossRef
Metadata
Title
Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
Authors
Elizabeth Handorf
Yinuo Yin
Michael Slifker
Shannon Lynch
Publication date
01-12-2020
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2020
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-020-01183-9

Other articles of this Issue 1/2020

BMC Medical Research Methodology 1/2020 Go to the issue