Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2020

Open Access 01-12-2020 | Type 2 Diabetes | Research article

Bayesian variable selection for high dimensional predictors and self-reported outcomes

Authors: Xiangdong Gu, Mahlet G Tadesse, Andrea S Foulkes, Yunsheng Ma, Raji Balasubramanian

Published in: BMC Medical Informatics and Decision Making | Issue 1/2020

Login to get access

Abstract

Background

The onset of silent diseases such as type 2 diabetes is often registered through self-report in large prospective cohorts. Self-reported outcomes are cost-effective; however, they are subject to error. Diagnosis of silent events may also occur through the use of imperfect laboratory-based diagnostic tests. In this paper, we describe an approach for variable selection in high dimensional datasets for settings in which the outcome is observed with error.

Methods

We adapt the spike and slab Bayesian Variable Selection approach in the context of error-prone, self-reported outcomes. The performance of the proposed approach is studied through simulation studies. An illustrative application is included using data from the Women’s Health Initiative SNP Health Association Resource, which includes extensive genotypic (>900,000 SNPs) and phenotypic data on 9,873 African American and Hispanic American women.

Results

Simulation studies show improved sensitivity of our proposed method when compared to a naive approach that ignores error in the self-reported outcomes. Application of the proposed method resulted in discovery of several single nucleotide polymorphisms (SNPs) that are associated with risk of type 2 diabetes in a dataset of 9,873 African American and Hispanic participants in the Women’s Health Initiative. There was little overlap among the top ranking SNPs associated with type 2 diabetes risk between the racial groups, adding support to previous observations in the literature of disease associated genetic loci that are often not generalizable across race/ethnicity populations. The adapted Bayesian variable selection algorithm is implemented in R. The source code for the simulations are available in the Supplement.

Conclusions

Variable selection accuracy is reduced when the outcome is ascertained by error-prone self-reports. For this setting, our proposed algorithm has improved variable selection performance when compared to approaches that neglect to account for the error-prone nature of self-reports.
Appendix
Available only for authorised users
Literature
1.
go back to reference Anderson G, Cummings S, Freedman L, et al. Design of the women’s health initiative clinical trial and observational study. Control Clin Trials. 1998; 19(1):61–109.CrossRef Anderson G, Cummings S, Freedman L, et al. Design of the women’s health initiative clinical trial and observational study. Control Clin Trials. 1998; 19(1):61–109.CrossRef
2.
go back to reference Turnbull B. Empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc Ser B Methodol. 1976; 38:290–5. Turnbull B. Empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc Ser B Methodol. 1976; 38:290–5.
3.
go back to reference Finkelstein D. A proportional hazards model for interval-censored failure time data. Biometrics. 1986; 42:845–54.PubMedCrossRef Finkelstein D. A proportional hazards model for interval-censored failure time data. Biometrics. 1986; 42:845–54.PubMedCrossRef
4.
go back to reference Balasubramanian R, Lagakos S. Estimation of the timing of perinatal transmission of HIV. Biometrics. 2001; 57:1048–58.PubMedCrossRef Balasubramanian R, Lagakos S. Estimation of the timing of perinatal transmission of HIV. Biometrics. 2001; 57:1048–58.PubMedCrossRef
5.
go back to reference Balasubramanian R, Lagakos S. Estimation of a failure time distribution based on imperfect diagnostic tests. Biometrika. 2003; 90:171–82.CrossRef Balasubramanian R, Lagakos S. Estimation of a failure time distribution based on imperfect diagnostic tests. Biometrika. 2003; 90:171–82.CrossRef
6.
go back to reference McKeown K, Jewell N. Misclassification of current status data. Lifetime Data Anal. 2010; 16:215–30.PubMedCrossRef McKeown K, Jewell N. Misclassification of current status data. Lifetime Data Anal. 2010; 16:215–30.PubMedCrossRef
7.
go back to reference Meier A, Richardson B, Hughes J. Discrete proportional hazards models for mismeasured outcomes. Biometrics. 2003; 59:947–54.PubMedCrossRef Meier A, Richardson B, Hughes J. Discrete proportional hazards models for mismeasured outcomes. Biometrics. 2003; 59:947–54.PubMedCrossRef
9.
go back to reference Cook T. Adjusting survival analysis for the presence of unadjudicated study events. Control Clin Trials. 2000; 21:208–22.PubMedCrossRef Cook T. Adjusting survival analysis for the presence of unadjudicated study events. Control Clin Trials. 2000; 21:208–22.PubMedCrossRef
10.
go back to reference Cook T, Kosorok M. Analysis of time-to-event data with incomplete event adjudication. J Am Stat Assoc. 2004; 99:1140–52.CrossRef Cook T, Kosorok M. Analysis of time-to-event data with incomplete event adjudication. J Am Stat Assoc. 2004; 99:1140–52.CrossRef
11.
go back to reference Neuhaus J. Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999; 86:843–55.CrossRef Neuhaus J. Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999; 86:843–55.CrossRef
12.
go back to reference Gu X, Ma Y, Balasubramanian R. Semi-parametric time to event models in the presence of error-prone, self-reported outcomes - with application to the women’s health initiative. Ann Appl Stat. 2015; 9(2):714–30.PubMedPubMedCentralCrossRef Gu X, Ma Y, Balasubramanian R. Semi-parametric time to event models in the presence of error-prone, self-reported outcomes - with application to the women’s health initiative. Ann Appl Stat. 2015; 9(2):714–30.PubMedPubMedCentralCrossRef
13.
go back to reference Mitchell T, Beauchamp J. Bayesian variable selection in linear-regression. J Am Stat Assoc. 1988; 83(404):1023–32.CrossRef Mitchell T, Beauchamp J. Bayesian variable selection in linear-regression. J Am Stat Assoc. 1988; 83(404):1023–32.CrossRef
14.
go back to reference George E, Mcculloch R. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993; 88(423):881–9.CrossRef George E, Mcculloch R. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993; 88(423):881–9.CrossRef
15.
go back to reference Lee K, Sha N, Dougherty E, Vannucci M, Mallick B. Gene selection: a Bayesian variable selection approach. Bioinformatics. 2003; 19(1):90–7.PubMedCrossRef Lee K, Sha N, Dougherty E, Vannucci M, Mallick B. Gene selection: a Bayesian variable selection approach. Bioinformatics. 2003; 19(1):90–7.PubMedCrossRef
16.
go back to reference Sha N, Vannucci M, Tadesse M, Brown P, Dragoni I, Davies N, Roberts T, Contestabile A, Salmon M, Buckley C, et al. Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics. 2004; 60(3):812–9.PubMedCrossRef Sha N, Vannucci M, Tadesse M, Brown P, Dragoni I, Davies N, Roberts T, Contestabile A, Salmon M, Buckley C, et al. Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics. 2004; 60(3):812–9.PubMedCrossRef
17.
go back to reference Sha N, Tadesse M, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics. 2006; 22(18):2262–8.PubMedCrossRef Sha N, Tadesse M, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics. 2006; 22(18):2262–8.PubMedCrossRef
18.
go back to reference Guan Y, Stephens M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat. 2011; 5(3):1780–815.CrossRef Guan Y, Stephens M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat. 2011; 5(3):1780–815.CrossRef
19.
go back to reference Tadesse M, Sha N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005; 100(470):602–17.CrossRef Tadesse M, Sha N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005; 100(470):602–17.CrossRef
20.
go back to reference Kim S, Tadesse M, Vannucci M. Variable selection in clustering via Dirichlet process mixture models. Biometrika. 2006; 93(4):877–93.CrossRef Kim S, Tadesse M, Vannucci M. Variable selection in clustering via Dirichlet process mixture models. Biometrika. 2006; 93(4):877–93.CrossRef
21.
go back to reference Dunson D, Herring A, Engel S. Bayesian selection and clustering of polymorphisms in functionally related genes. J Am Stat Assoc. 2008; 103(482):534–46.CrossRef Dunson D, Herring A, Engel S. Bayesian selection and clustering of polymorphisms in functionally related genes. J Am Stat Assoc. 2008; 103(482):534–46.CrossRef
22.
go back to reference Li F, Zhang N. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc. 2010; 105(491):1202–14.CrossRef Li F, Zhang N. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc. 2010; 105(491):1202–14.CrossRef
23.
go back to reference Stingo F, Chen Y, Tadesse M, et al. Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 2011; 5(3):1978–2002.PubMedCrossRef Stingo F, Chen Y, Tadesse M, et al. Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 2011; 5(3):1978–2002.PubMedCrossRef
24.
go back to reference O’Hara R, Sillanpaa M. A review of Bayesian variable selection methods: What, how and which. Bayesian Anal. 2009; 4(1):85–118.CrossRef O’Hara R, Sillanpaa M. A review of Bayesian variable selection methods: What, how and which. Bayesian Anal. 2009; 4(1):85–118.CrossRef
25.
go back to reference Rockova V, Lesaffre E. Incorporating grouping information in Bayesian variable selection with applications in genomics. Bayesian Anal. 2014; 9:221–58.CrossRef Rockova V, Lesaffre E. Incorporating grouping information in Bayesian variable selection with applications in genomics. Bayesian Anal. 2014; 9:221–58.CrossRef
26.
go back to reference Rockova V, George E. EMVS: The EM approach to Bayesian variable selection. J Am Stat Assoc. 2014; 109:828–46.CrossRef Rockova V, George E. EMVS: The EM approach to Bayesian variable selection. J Am Stat Assoc. 2014; 109:828–46.CrossRef
27.
go back to reference Jacobs R, Lesaffre E, Teunis P, et al. Identifying the source of food-borne disease outbreaks: An application of Bayesian variable selection. Stat Methods Med Res. 2017; 28(4):1–15. Jacobs R, Lesaffre E, Teunis P, et al. Identifying the source of food-borne disease outbreaks: An application of Bayesian variable selection. Stat Methods Med Res. 2017; 28(4):1–15.
28.
go back to reference Chen S, Nunez S, Reilly M, Foulkes A. Bayesian variable selection for post-analytic interrogation of susceptibility loci. Biometrics. 2017; 73:603–14.PubMedCrossRef Chen S, Nunez S, Reilly M, Foulkes A. Bayesian variable selection for post-analytic interrogation of susceptibility loci. Biometrics. 2017; 73:603–14.PubMedCrossRef
30.
go back to reference Margolis K, Qi L, Brzyski R, et al. Validity of diabetes self-reports in the Women’s Health Initiative: comparison with medication inventories and fasting glucose measurements. Clin Trials. 2008; 5:240–7.PubMedPubMedCentralCrossRef Margolis K, Qi L, Brzyski R, et al. Validity of diabetes self-reports in the Women’s Health Initiative: comparison with medication inventories and fasting glucose measurements. Clin Trials. 2008; 5:240–7.PubMedPubMedCentralCrossRef
31.
go back to reference Ishwaran H, Kogalur U, Blackstone E, et al. Random survival forests. Ann Appl Stat. 2008; 2(3):841–60.CrossRef Ishwaran H, Kogalur U, Blackstone E, et al. Random survival forests. Ann Appl Stat. 2008; 2(3):841–60.CrossRef
32.
go back to reference Ishwaran H, Kogalur U. randomForestSRC: Random forests for survival, regression and classification. (RF-SRC) R package version 1.6.1. 2015. Ishwaran H, Kogalur U. randomForestSRC: Random forests for survival, regression and classification. (RF-SRC) R package version 1.6.1. 2015.
33.
go back to reference Bush W, Moore J. Genome-wide association studies. PLoS Comput Biol. 2012; 8:e1X002822.CrossRef Bush W, Moore J. Genome-wide association studies. PLoS Comput Biol. 2012; 8:e1X002822.CrossRef
34.
go back to reference Gu X, Balasubramanian R. icensmis: Study Design and Data Analysis in the presence of error-prone diagnostic tests and self-reported outcomes. R package version 1.1. 2013. Gu X, Balasubramanian R. icensmis: Study Design and Data Analysis in the presence of error-prone diagnostic tests and self-reported outcomes. R package version 1.1. 2013.
35.
go back to reference Dixit S, Wang T, Manzano E, Yoo S, Lee J, Chiang D, Ryan N, Respress J, Yechoor V, Wehrens X. Effects of CaMKII-mediated phosphorylation of ryanodine receptor type 2 on islet calcium handling, insulin secretion, and glucose tolerance. Plos ONE. 2013; 8(3):e58655.PubMedPubMedCentralCrossRef Dixit S, Wang T, Manzano E, Yoo S, Lee J, Chiang D, Ryan N, Respress J, Yechoor V, Wehrens X. Effects of CaMKII-mediated phosphorylation of ryanodine receptor type 2 on islet calcium handling, insulin secretion, and glucose tolerance. Plos ONE. 2013; 8(3):e58655.PubMedPubMedCentralCrossRef
37.
go back to reference Genecards for Gene CTNNA2. https//doi.org/www.genecards.org/cgi-bin/carddisp.pl?gene=CTNNA2. Genecards for Gene CTNNA2. https/​/​doi.​org/​www.​genecards.​org/​cgi-bin/​carddisp.​pl?​gene=​CTNNA2.
38.
go back to reference Hasstedt S, Highland H, Elbein S, Hanis C, Das S. Five linkage regions each harbor multiple type 2 diabetes genes in the African American subset of the GENNID Study. J Hum Genet. 2013; 58(6):378–83.PubMedPubMedCentralCrossRef Hasstedt S, Highland H, Elbein S, Hanis C, Das S. Five linkage regions each harbor multiple type 2 diabetes genes in the African American subset of the GENNID Study. J Hum Genet. 2013; 58(6):378–83.PubMedPubMedCentralCrossRef
41.
go back to reference Chang Y, Hee S, Lee W, Li H, Chang T, Lin M, Hung Y, Lee I, Hung K, Assimes T, et al. Genome-wide scan for circulating vascular adhesion protein-1 levels: MACROD2 as a potential transcriptional regulator of adipogenesis. J Diabetes Inv. 2018; 9(5):1067–74.CrossRef Chang Y, Hee S, Lee W, Li H, Chang T, Lin M, Hung Y, Lee I, Hung K, Assimes T, et al. Genome-wide scan for circulating vascular adhesion protein-1 levels: MACROD2 as a potential transcriptional regulator of adipogenesis. J Diabetes Inv. 2018; 9(5):1067–74.CrossRef
42.
go back to reference Collares C, Evangelista A, Xavier D, Takahashi P, Almeida R, Macedo C, Manoel-Caetano F, Foss MC, Foss-Freitas M, Rassi D, et al. Transcriptome meta-analysis of peripheral lymphomononuclear cells indicates that gestational diabetes is closer to type 1 diabetes than to type 2 diabetes mellitus. Mol Biol Rep. 2013; 40:5351–8. https://pubmed.ncbi.nlm.nih.gov/23657602/.PubMedCrossRef Collares C, Evangelista A, Xavier D, Takahashi P, Almeida R, Macedo C, Manoel-Caetano F, Foss MC, Foss-Freitas M, Rassi D, et al. Transcriptome meta-analysis of peripheral lymphomononuclear cells indicates that gestational diabetes is closer to type 1 diabetes than to type 2 diabetes mellitus. Mol Biol Rep. 2013; 40:5351–8. https://​pubmed.​ncbi.​nlm.​nih.​gov/​23657602/​.PubMedCrossRef
43.
go back to reference Hutter C, Young A, Ochs-Balcom H, Carty C, Wang T, Chen C, Rohan T, Kooperberg C, Peters U. Replication of breast cancer GWAS susceptibility loci in the Women’s Health Initiative African American SHARe Study. Cancer Epidemiol Biomarkers Prev. 2011; 20:1950–9.PubMedPubMedCentralCrossRef Hutter C, Young A, Ochs-Balcom H, Carty C, Wang T, Chen C, Rohan T, Kooperberg C, Peters U. Replication of breast cancer GWAS susceptibility loci in the Women’s Health Initiative African American SHARe Study. Cancer Epidemiol Biomarkers Prev. 2011; 20:1950–9.PubMedPubMedCentralCrossRef
Metadata
Title
Bayesian variable selection for high dimensional predictors and self-reported outcomes
Authors
Xiangdong Gu
Mahlet G Tadesse
Andrea S Foulkes
Yunsheng Ma
Raji Balasubramanian
Publication date
01-12-2020
Publisher
BioMed Central
Keyword
Type 2 Diabetes
Published in
BMC Medical Informatics and Decision Making / Issue 1/2020
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-020-01223-w

Other articles of this Issue 1/2020

BMC Medical Informatics and Decision Making 1/2020 Go to the issue