Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2014

Open Access 01-12-2014 | Research article

Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections

Authors: Yohann Mansiaux, Fabrice Carrat

Published in: BMC Medical Research Methodology | Issue 1/2014

Login to get access

Abstract

Background

Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome.

Methods

We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods.

Results

Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO.

Conclusions

Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets.
Appendix
Available only for authorised users
Literature
1.
2.
go back to reference Fontana JM, Alexander E, Salvatore M: Translational research in infectious disease: current paradigms and challenges ahead. Transl Res. 2012, 159: 430-453. 10.1016/j.trsl.2011.12.009.CrossRefPubMedPubMedCentral Fontana JM, Alexander E, Salvatore M: Translational research in infectious disease: current paradigms and challenges ahead. Transl Res. 2012, 159: 430-453. 10.1016/j.trsl.2011.12.009.CrossRefPubMedPubMedCentral
3.
go back to reference Shah NH, Tenenbaum JD: The coming age of data-driven medicine: translational bioinformatics’ next frontier. J Am Med Informatics Assoc. 2012, 19: e2-e4. 10.1136/amiajnl-2012-000969.CrossRef Shah NH, Tenenbaum JD: The coming age of data-driven medicine: translational bioinformatics’ next frontier. J Am Med Informatics Assoc. 2012, 19: e2-e4. 10.1136/amiajnl-2012-000969.CrossRef
4.
go back to reference Bougnères P, Valleron A-J: Causes of early-onset type 1 diabetes: toward data-driven environmental approaches. J Exp Med. 2008, 205: 2953-2957. 10.1084/jem.20082622.CrossRefPubMedPubMedCentral Bougnères P, Valleron A-J: Causes of early-onset type 1 diabetes: toward data-driven environmental approaches. J Exp Med. 2008, 205: 2953-2957. 10.1084/jem.20082622.CrossRefPubMedPubMedCentral
5.
go back to reference Choi H, Pavelka N: When one and one gives more than two: challenges and opportunities of integrative omics. Front Genet. 2011, 2: 105-CrossRefPubMed Choi H, Pavelka N: When one and one gives more than two: challenges and opportunities of integrative omics. Front Genet. 2011, 2: 105-CrossRefPubMed
6.
go back to reference Murdoch TB, Detsky AS: The inevitable application of big data to health care. JAMA. 2013, 309: 1351-1352. 10.1001/jama.2013.393.CrossRefPubMed Murdoch TB, Detsky AS: The inevitable application of big data to health care. JAMA. 2013, 309: 1351-1352. 10.1001/jama.2013.393.CrossRefPubMed
7.
go back to reference Liao H, Lynn HS: A survey of variable selection methods in two Chinese epidemiology journals. BMC Med Res Methodol. 2010, 10: 87-10.1186/1471-2288-10-87.CrossRefPubMedPubMedCentral Liao H, Lynn HS: A survey of variable selection methods in two Chinese epidemiology journals. BMC Med Res Methodol. 2010, 10: 87-10.1186/1471-2288-10-87.CrossRefPubMedPubMedCentral
8.
9.
go back to reference Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR: A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996, 49: 1373-1379. 10.1016/S0895-4356(96)00236-3.CrossRefPubMed Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR: A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996, 49: 1373-1379. 10.1016/S0895-4356(96)00236-3.CrossRefPubMed
10.
go back to reference Smyth P: Data mining: data analysis on a grand scale?. Stat Methods Med Res. 2000, 9: 309-327. 10.1191/096228000701555181.CrossRefPubMed Smyth P: Data mining: data analysis on a grand scale?. Stat Methods Med Res. 2000, 9: 309-327. 10.1191/096228000701555181.CrossRefPubMed
11.
go back to reference Data Mining and Knowledge Discovery Handbook. Edited by: Maimon O, Rokach L. 2010, New York: Springer Data Mining and Knowledge Discovery Handbook. Edited by: Maimon O, Rokach L. 2010, New York: Springer
12.
go back to reference Austin PC: A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Stat Med. 2007, 26: 2937-2957. 10.1002/sim.2770.CrossRefPubMed Austin PC: A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Stat Med. 2007, 26: 2937-2957. 10.1002/sim.2770.CrossRefPubMed
13.
go back to reference Maroco J, Silva D, Rodrigues A, Guerreiro M, Santana I, DE Mendonca A: Data mining methods in the prediction of dementia: a real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res Notes. 2011, 4: 299-10.1186/1756-0500-4-299.CrossRefPubMedPubMedCentral Maroco J, Silva D, Rodrigues A, Guerreiro M, Santana I, DE Mendonca A: Data mining methods in the prediction of dementia: a real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res Notes. 2011, 4: 299-10.1186/1756-0500-4-299.CrossRefPubMedPubMedCentral
14.
go back to reference Green M, Björk J, Forberg J, Ekelund U, Edenbrandt L, Ohlsson M: Comparison between neural networks and multiple logistic regression to predict acute coronary syndrome in the emergency room. Artif Intell Med. 2006, 38: 305-318. 10.1016/j.artmed.2006.07.006.CrossRefPubMed Green M, Björk J, Forberg J, Ekelund U, Edenbrandt L, Ohlsson M: Comparison between neural networks and multiple logistic regression to predict acute coronary syndrome in the emergency room. Artif Intell Med. 2006, 38: 305-318. 10.1016/j.artmed.2006.07.006.CrossRefPubMed
15.
go back to reference Regnier-Coudert O, McCall J, Lothian R, Lam T, McClinton S, N’dow J: Machine learning for improved pathological staging of prostate cancer: a performance comparison on a range of classifiers. Artif Intell Med. 2012, 55: 25-35. 10.1016/j.artmed.2011.11.003.CrossRefPubMed Regnier-Coudert O, McCall J, Lothian R, Lam T, McClinton S, N’dow J: Machine learning for improved pathological staging of prostate cancer: a performance comparison on a range of classifiers. Artif Intell Med. 2012, 55: 25-35. 10.1016/j.artmed.2011.11.003.CrossRefPubMed
16.
go back to reference Austin PC, Lee DS, Steyerberg EW, Tu JV: Regression trees for predicting mortality in patients with cardiovascular disease: what improvement is achieved by using ensemble-based methods?. Biometrical J. 2012, 54: 657-673. 10.1002/bimj.201100251.CrossRef Austin PC, Lee DS, Steyerberg EW, Tu JV: Regression trees for predicting mortality in patients with cardiovascular disease: what improvement is achieved by using ensemble-based methods?. Biometrical J. 2012, 54: 657-673. 10.1002/bimj.201100251.CrossRef
17.
go back to reference Austin PC, Tu JV, Ho JE, Levy D, Lee DS: Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. J Clin Epidemiol. 2013, 66: 398-407. 10.1016/j.jclinepi.2012.11.008.CrossRefPubMedPubMedCentral Austin PC, Tu JV, Ho JE, Levy D, Lee DS: Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. J Clin Epidemiol. 2013, 66: 398-407. 10.1016/j.jclinepi.2012.11.008.CrossRefPubMedPubMedCentral
18.
go back to reference Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996, 58: 267-288. Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996, 58: 267-288.
19.
go back to reference Xu C-J, van der Schaaf A, Schilstra C, Langendijk JA, van’t Veld AA: Impact of statistical learning methods on the predictive power of multivariate normal tissue complication probability models. Int J Radiat Oncol Biol Phys. 2012, 82: e677-e684. 10.1016/j.ijrobp.2011.09.036.CrossRefPubMed Xu C-J, van der Schaaf A, Schilstra C, Langendijk JA, van’t Veld AA: Impact of statistical learning methods on the predictive power of multivariate normal tissue complication probability models. Int J Radiat Oncol Biol Phys. 2012, 82: e677-e684. 10.1016/j.ijrobp.2011.09.036.CrossRefPubMed
20.
go back to reference Avalos M, Adroher ND, Lagarde E, Thiessard F, Grandvalet Y, Contrand B, Orriols L: Prescription-drug-related risk in driving: comparing conventional and lasso shrinkage logistic regressions. Epidemiology. 2012, 23: 706-712. 10.1097/EDE.0b013e31825fa528.CrossRefPubMed Avalos M, Adroher ND, Lagarde E, Thiessard F, Grandvalet Y, Contrand B, Orriols L: Prescription-drug-related risk in driving: comparing conventional and lasso shrinkage logistic regressions. Epidemiology. 2012, 23: 706-712. 10.1097/EDE.0b013e31825fa528.CrossRefPubMed
21.
go back to reference Lapidus N, De Lamballerie X, Salez N, Setbon M, Ferrari P, Delabre RM, Gougeon M-L, Vely F, Leruez-Ville M, Andreoletti L, Cauchemez S, Boëlle P-Y, Vivier E, Abel L, Schwarzinger M, Legeas M, Le Cann P, Flahault A, Carrat F: Integrative study of pandemic A/H1N1 influenza infections: design and methods of the CoPanFlu-France cohort. BMC Public Health. 2012, 12: 417-10.1186/1471-2458-12-417.CrossRefPubMedPubMedCentral Lapidus N, De Lamballerie X, Salez N, Setbon M, Ferrari P, Delabre RM, Gougeon M-L, Vely F, Leruez-Ville M, Andreoletti L, Cauchemez S, Boëlle P-Y, Vivier E, Abel L, Schwarzinger M, Legeas M, Le Cann P, Flahault A, Carrat F: Integrative study of pandemic A/H1N1 influenza infections: design and methods of the CoPanFlu-France cohort. BMC Public Health. 2012, 12: 417-10.1186/1471-2458-12-417.CrossRefPubMedPubMedCentral
23.
go back to reference Reijans M, Dingemans G, Klaassen CH, Meis JF, Keijdener J, Mulders B, Eadie K, van Leeuwen W, van Belkum A, Horrevorts AM, Simons G: RespiFinder: a new multiparameter test to differentially identify fifteen respiratory viruses. J Clin Microbiol. 2008, 46: 1232-1240. 10.1128/JCM.02294-07.CrossRefPubMedPubMedCentral Reijans M, Dingemans G, Klaassen CH, Meis JF, Keijdener J, Mulders B, Eadie K, van Leeuwen W, van Belkum A, Horrevorts AM, Simons G: RespiFinder: a new multiparameter test to differentially identify fifteen respiratory viruses. J Clin Microbiol. 2008, 46: 1232-1240. 10.1128/JCM.02294-07.CrossRefPubMedPubMedCentral
25.
go back to reference Lapidus N, de Lamballerie X, Salez N, Setbon M, Delabre RM, Ferrari P, Moyen N, Gougeon M-L, Vely F, Leruez-Ville M, Andreoletti L, Cauchemez S, Boëlle P-Y, Vivier E, Abel L, Schwarzinger M, Legeas M, Le Cann P, Flahault A, Carrat F: Factors associated with post-seasonal serological titer and risk factors for infection with the pandemic A/H1N1 virus in the French general population. PLoS One. 2013, 8: e60127-10.1371/journal.pone.0060127.CrossRefPubMedPubMedCentral Lapidus N, de Lamballerie X, Salez N, Setbon M, Delabre RM, Ferrari P, Moyen N, Gougeon M-L, Vely F, Leruez-Ville M, Andreoletti L, Cauchemez S, Boëlle P-Y, Vivier E, Abel L, Schwarzinger M, Legeas M, Le Cann P, Flahault A, Carrat F: Factors associated with post-seasonal serological titer and risk factors for infection with the pandemic A/H1N1 virus in the French general population. PLoS One. 2013, 8: e60127-10.1371/journal.pone.0060127.CrossRefPubMedPubMedCentral
26.
go back to reference Breiman L: Random Forests. Mach Learn. 2001, 45: 123-140. 10.1023/A:1010950718922.CrossRef Breiman L: Random Forests. Mach Learn. 2001, 45: 123-140. 10.1023/A:1010950718922.CrossRef
27.
go back to reference Friedman JH: Greedy function approximation: a gradient boosting machine. North. 2001, 29: 1189-1232. Friedman JH: Greedy function approximation: a gradient boosting machine. North. 2001, 29: 1189-1232.
28.
go back to reference Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009, New York: Springer, 2CrossRef Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009, New York: Springer, 2CrossRef
29.
go back to reference Friedman JH: Stochastic gradient boosting. Comput Stat Data Anal. 2002, 38: 367-378. 10.1016/S0167-9473(01)00065-2.CrossRef Friedman JH: Stochastic gradient boosting. Comput Stat Data Anal. 2002, 38: 367-378. 10.1016/S0167-9473(01)00065-2.CrossRef
30.
go back to reference McCullagh P, Nelder JA: Generalized Linear Models. 1989, London: Chapman and Hall/CRC, 2CrossRef McCullagh P, Nelder JA: Generalized Linear Models. 1989, London: Chapman and Hall/CRC, 2CrossRef
32.
33.
go back to reference Hesterberg T, Moore DS, Monaghan S, Clipson A, Epstein R: Bootstrap Methods and Permutation Tests. Introd to Pract Stat. Volume 5. Edited by: Moore D, McCabe G. 2005, New York: WH Freeman & Co Hesterberg T, Moore DS, Monaghan S, Clipson A, Epstein R: Bootstrap Methods and Permutation Tests. Introd to Pract Stat. Volume 5. Edited by: Moore D, McCabe G. 2005, New York: WH Freeman & Co
34.
go back to reference Altmann A, Toloşi L, Sander O, Lengauer T: Permutation importance: a corrected feature importance measure. Bioinformatics. 2010, 26: 1340-1347. 10.1093/bioinformatics/btq134.CrossRefPubMed Altmann A, Toloşi L, Sander O, Lengauer T: Permutation importance: a corrected feature importance measure. Bioinformatics. 2010, 26: 1340-1347. 10.1093/bioinformatics/btq134.CrossRefPubMed
35.
go back to reference Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting and evaluating dependencies between variables. Bioinformatics. 2002, 18 (Suppl 2): S231-S240. 10.1093/bioinformatics/18.suppl_2.S231.CrossRefPubMed Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting and evaluating dependencies between variables. Bioinformatics. 2002, 18 (Suppl 2): S231-S240. 10.1093/bioinformatics/18.suppl_2.S231.CrossRefPubMed
36.
go back to reference Liaw A, Wiener M: Classification and regression by randomForest. R News. 2002, 2/3: 18-22. Liaw A, Wiener M: Classification and regression by randomForest. R News. 2002, 2/3: 18-22.
37.
go back to reference Ridgeway G: Generalized boosted models: a guide to the gbm package. Compute. 2007, 1: 1-12. Ridgeway G: Generalized boosted models: a guide to the gbm package. Compute. 2007, 1: 1-12.
38.
go back to reference Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SAFT: Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?. Brief Bioinform. 2013, 14: 315-326. 10.1093/bib/bbs034.CrossRefPubMed Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SAFT: Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?. Brief Bioinform. 2013, 14: 315-326. 10.1093/bib/bbs034.CrossRefPubMed
39.
go back to reference Tolosi L, Lengauer T: Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics. 2011, 27: 1986-1994. 10.1093/bioinformatics/btr300.CrossRefPubMed Tolosi L, Lengauer T: Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics. 2011, 27: 1986-1994. 10.1093/bioinformatics/btr300.CrossRefPubMed
40.
go back to reference Bender R, Lange S: Adjusting for multiple testing–when and how?. J Clin Epidemiol. 2001, 54: 343-349. 10.1016/S0895-4356(00)00314-0.CrossRefPubMed Bender R, Lange S: Adjusting for multiple testing–when and how?. J Clin Epidemiol. 2001, 54: 343-349. 10.1016/S0895-4356(00)00314-0.CrossRefPubMed
42.
go back to reference Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc - Ser B Stat Methodol. 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.x. [Series B (Statistical Methodology)]CrossRef Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc - Ser B Stat Methodol. 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.x. [Series B (Statistical Methodology)]CrossRef
43.
go back to reference Ng S, Fang VJ, Ip DKM, Chan K-H, Leung GM, Peiris JSM, Cowling BJ: Estimation of the association between antibody titers and protection against confirmed influenza virus infection in children. J Infect Dis. 2013, 208: 1320-1324. 10.1093/infdis/jit372.CrossRefPubMedPubMedCentral Ng S, Fang VJ, Ip DKM, Chan K-H, Leung GM, Peiris JSM, Cowling BJ: Estimation of the association between antibody titers and protection against confirmed influenza virus infection in children. J Infect Dis. 2013, 208: 1320-1324. 10.1093/infdis/jit372.CrossRefPubMedPubMedCentral
44.
go back to reference Riley S, Kwok KO, Wu KM, Ning DY, Cowling BJ, Wu JT, Ho L-M, Tsang T, Lo S-V, Chu DKW, Ma ESK, Peiris JSM: Epidemiological characteristics of 2009 (H1N1) pandemic influenza based on paired sera from a longitudinal community cohort study. PLoS Med. 2011, 8: e1000442-10.1371/journal.pmed.1000442.CrossRefPubMedPubMedCentral Riley S, Kwok KO, Wu KM, Ning DY, Cowling BJ, Wu JT, Ho L-M, Tsang T, Lo S-V, Chu DKW, Ma ESK, Peiris JSM: Epidemiological characteristics of 2009 (H1N1) pandemic influenza based on paired sera from a longitudinal community cohort study. PLoS Med. 2011, 8: e1000442-10.1371/journal.pmed.1000442.CrossRefPubMedPubMedCentral
45.
go back to reference Simmerman JM, Suntarattiwong P, Levy J, Jarman RG, Kaewchana S, Gibbons RV, Cowling BJ, Sanasuttipun W, Maloney SA, Uyeki TM, Kamimoto L, Chotipitayasunondh T: Findings from a household randomized controlled trial of hand washing and face masks to reduce influenza transmission in Bangkok, Thailand. Influenza Other Respi Viruses. 2011, 5: 256-267. 10.1111/j.1750-2659.2011.00205.x.CrossRef Simmerman JM, Suntarattiwong P, Levy J, Jarman RG, Kaewchana S, Gibbons RV, Cowling BJ, Sanasuttipun W, Maloney SA, Uyeki TM, Kamimoto L, Chotipitayasunondh T: Findings from a household randomized controlled trial of hand washing and face masks to reduce influenza transmission in Bangkok, Thailand. Influenza Other Respi Viruses. 2011, 5: 256-267. 10.1111/j.1750-2659.2011.00205.x.CrossRef
46.
go back to reference Kloepfer KM, Olenec JP, Lee WM, Liu G, Vrtis RF, Roberg KA, Evans MD, Gangnon RE, Lemanske RF, Gern JE: Increased H1N1 infection rate in children with asthma. Am J Respir Crit Care Med. 2012, 185: 1275-1279. 10.1164/rccm.201109-1635OC.CrossRefPubMedPubMedCentral Kloepfer KM, Olenec JP, Lee WM, Liu G, Vrtis RF, Roberg KA, Evans MD, Gangnon RE, Lemanske RF, Gern JE: Increased H1N1 infection rate in children with asthma. Am J Respir Crit Care Med. 2012, 185: 1275-1279. 10.1164/rccm.201109-1635OC.CrossRefPubMedPubMedCentral
47.
go back to reference Chen MIC, Lee VJM, Barr I, Lin C, Goh R, Lee C, Singh B, Tan J, Lim WY, Cook AR, Ang B, Chow A, Tan BH, Loh J, Shaw R, Chia KS, Lin RTP, Leo YS: Risk factors for pandemic (H1N1) 2009 virus seroconversion among hospital staff, Singapore. Emerg Infect Dis. 2010, 16: 1554-1561. 10.3201/eid1610.100516.CrossRefPubMedPubMedCentral Chen MIC, Lee VJM, Barr I, Lin C, Goh R, Lee C, Singh B, Tan J, Lim WY, Cook AR, Ang B, Chow A, Tan BH, Loh J, Shaw R, Chia KS, Lin RTP, Leo YS: Risk factors for pandemic (H1N1) 2009 virus seroconversion among hospital staff, Singapore. Emerg Infect Dis. 2010, 16: 1554-1561. 10.3201/eid1610.100516.CrossRefPubMedPubMedCentral
Metadata
Title
Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections
Authors
Yohann Mansiaux
Fabrice Carrat
Publication date
01-12-2014
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2014
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/1471-2288-14-99

Other articles of this Issue 1/2014

BMC Medical Research Methodology 1/2014 Go to the issue