Top

BMC Medical Research Methodology

Published in:

Open Access 01-12-2020 | Prostate Cancer | Research article

Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches

Authors: Elizabeth Handorf, Yinuo Yin, Michael Slifker, Shannon Lynch

Published in: BMC Medical Research Methodology | Issue 1/2020

Abstract

Background

Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome.

Methods

We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods’ ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer.

Results

In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman’s correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings.

Conclusions

This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.

Available only for authorised users

Lynch SM, Rebbeck TR. Bridging the gap between biologic, individual, and macroenvironmental factors in Cancer: a multilevel approach. Cancer Epidemiol Biomark Prev. 2013;22(4):485–95.CrossRef

Patel CJ, Bhattacharya, J., Butte, A.J. An environment-wide association study (EWAS) on type 2 diabetes mellitus. PLoS One 2010;5(5):e10746.

Lynch SM, Mitra N, Ross M, Newcomb C, Dailey K, Jackson T, et al. A Neighborhood-Wide Association Study (NWAS): Example of prostate cancer aggressiveness. PLoS One. 2017;12(3):e0174548.

Ziegler-Johnson C, Tierney A, Rebbeck TR, Rundle A. Prostate Cancer severity associations with neighborhood deprivation. Prostate Cancer. 2011;2011:846263.

Diez Roux AV, Mair C. Neighborhoods and health. Ann N Y Acad Sci. 2010;1186(1):125–45.CrossRef

Tannenbaum SL, Hernandez M, Zheng DD, Sussman DA, Lee DJ. Individual- and neighborhood-level predictors of mortality in Florida colorectal Cancer patients. PLoS One. 2014;9(8):e106322.CrossRef

Shenoy D, Packianathan S, Chen AM, Vijayakumar S. Do African-American men need separate prostate cancer screening guidelines? BMC Urol. 2016;16:19.CrossRef

Krier J, Barfield R, Green RC, Kraft P. Reclassification of genetic-based risk predictions as GWAS data accumulate. Genome Med. 2016;8(1):20.CrossRef

Kichaev G, Yang W-Y, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10(10):e1004722.CrossRef

10.

Chung CC, Chanock SJ. Current status of genome-wide association studies in cancer. Hum Genet. 2011;130(1):59–78.CrossRef

11.

Foulkes AS. Applied statistical genetics with R: for population-based association studies. New York: Springer Science & Business Media; 2009. p. 252.CrossRef

12.

Hastie T, Tibshirani R, Wainwright M. Statistical learning with sparsity: the lasso and generalizations. New York: Chapman and Hall/CRC; 2015.CrossRef

13.

Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88.

14.

Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005;67(2):301–20.CrossRef

15.

Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: springer series in statistics New York; 2001.CrossRef

16.

Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics. 2006;22(12):1540–2.CrossRef

17.

Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Stat Soc B. 2006;68(1):49–67.CrossRef

18.

Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. J Comput Graph Stat. 2013;22(2):231–45.CrossRef

19.

Bien J, Wegkamp M. Discussion of correlated variables in regression: clustering and sparse estimation. J Stat Plan Infer. 2013;143(11):1859–62.CrossRef

20.

Efron B, Hastie T. Computer age statistical inference. New York: Cambridge University Press; 2016.

21.

Ishwaran H, Lu M. Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat Med. 2019;38(4):558–82.CrossRef

22.

Kang G, Liu W, Cheng C, Wilson CL, Neale G, Yang JJ, et al. Evaluation of a two-step iterative resampling procedure for internal validation of genome-wide association studies. J Hum Genet. 2015;60(12):729.CrossRef

23.

Team RDC. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2010.

24.

Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.CrossRef

25.

Simon N, Friedman J, Hastie T, Tibshirani R. SGL: Fit a GLM (or Cox Model) with a Combination of Lasso and Group Lasso Regularization. 1.3 ed; 2019.

26.

Ishwaran H, B. KU. Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). 2.9.2 2019.

27.

Kapelner A, Bleich J. bartMachine: machine learning with Bayesian additive regression trees. J Stat Softw. 2016;70(4):1–40.CrossRef

28.

Belin L, Tan A, De Rycke Y, Dechartres A. Progression-free survival as a surrogate for overall survival in oncology trials: a methodological systematic review. Br J Cancer. 2020;122(11):1707–14.CrossRef

29.

Ciania O, Buyse M, Drummond M, Rasi G, Saad ED, Taylor RS. Time to review the role of surrogate end points in health policy: state of the art and the way forward. Value Health. 2017;20(3):1098–3015.

30.

Sokolova M, Japkowicz N, Szpakowicz S, editors. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. Berlin: Springer; 2006.

31.

Becker N, Werft W, Toedt G, Lichter P, Benner A. penalizedSVM: a R-package for feature selection SVM classification. Bioinformatics. 2009;25(13):1711–2.CrossRef

32.

Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–5.CrossRef

33.

Rebbeck TR. Precision prevention of cancer. Cancer Epidemiol Biomark Prev. 2014;23(12):2713–5.CrossRef

Title: Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
Authors: Elizabeth Handorf
Yinuo Yin
Michael Slifker
Shannon Lynch
Publication date: 01-12-2020
Publisher: BioMed Central
Keywords: Prostate Cancer
Prostate Cancer
Published in: BMC Medical Research Methodology / Issue 1/2020
Electronic ISSN: 1471-2288
DOI: https://doi.org/10.1186/s12874-020-01183-9

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches

Abstract

Background

Methods

Results

Conclusions

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Abstract

Background

Methods

Results

Conclusions

Please log in to get access to this content

Other articles of this Issue 1/2020

Developing strategies to improve fidelity of delivery of, and engagement with, a complex intervention to improve independence in dementia: a mixed methods study

A novel method for controlling unobserved confounding using double confounders

Using the Cochrane Central Register of Controlled Trials to identify clinical trial registration is insufficient: a cross-sectional study

Clustering of continuous and binary outcomes at the general practice level in individually randomised studies in primary care - a review of 10 years of primary care trials

A modified Delphi study to identify the features of high quality measurement plans for healthcare improvement projects

Long-term effects of asthma medication on asthma symptoms: an application of the targeted maximum likelihood estimation