Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2021

Open Access 01-12-2021 | Obesity | Research

An ensemble-based feature selection framework to select risk factors of childhood obesity for policy decision making

Authors: Xi Shi, Gorana Nikolic, Gorka Epelde, Mónica Arrúe, Joseba Bidaurrazaga Van-Dierdonck , Roberto Bilbao, Bart De Moor

Published in: BMC Medical Informatics and Decision Making | Issue 1/2021

Login to get access

Abstract

Background

The increasing prevalence of childhood obesity makes it essential to study the risk factors with a sample representative of the population covering more health topics for better preventive policies and interventions. It is aimed to develop an ensemble feature selection framework for large-scale data to identify risk factors of childhood obesity with good interpretability and clinical relevance.

Methods

We analyzed the data collected from 426,813 children under 18 during 2000–2019. A BMI above the 90th percentile for the children of the same age and gender was defined as overweight. An ensemble feature selection framework, Bagging-based Feature Selection framework integrating MapReduce (BFSMR), was proposed to identify risk factors. The framework comprises 5 models (filter with mutual information/SVM-RFE/Lasso/Ridge/Random Forest) from filter, wrapper, and embedded feature selection methods. Each feature selection model identified 10 variables based on variable importance. Considering accuracy, F-score, and model characteristics, the models were classified into 3 levels with different weights: Lasso/Ridge, Filter/SVM-RFE, and Random Forest. The voting strategy was applied to aggregate the selected features, with both feature weights and model weights taken into consideration. We compared our voting strategy with another two for selecting top-ranked features in terms of 6 dimensions of interpretability.

Results

Our method performed the best to select the features with good interpretability and clinical relevance. The top 10 features selected by BFSMR are age, sex, birth year, breastfeeding type, smoking habit and diet-related knowledge of both children and mothers, exercise, and Mother’s systolic blood pressure.

Conclusion

Our framework provides a solution for identifying a diverse and interpretable feature set without model bias from large-scale data, which can help identify risk factors of childhood obesity and potentially some other diseases for future interventions or policies.
Appendix
Available only for authorised users
Literature
2.
go back to reference Kumar S, Kelly A. Review of childhood obesity. Mayo Clin Proc. 2017;92(2):251–65.CrossRef Kumar S, Kelly A. Review of childhood obesity. Mayo Clin Proc. 2017;92(2):251–65.CrossRef
6.
go back to reference Livingstone B. Epidemiology of childhood obesity in Europe. Eur J Pediatr. 2000;159(Suppl 1):s14–34.CrossRef Livingstone B. Epidemiology of childhood obesity in Europe. Eur J Pediatr. 2000;159(Suppl 1):s14–34.CrossRef
7.
go back to reference Timmins KA, Green MA, Radley D, et al. How has big data contributed to obesity research? A review of the literature. Int J Obes. 2018;42:1951–62.CrossRef Timmins KA, Green MA, Radley D, et al. How has big data contributed to obesity research? A review of the literature. Int J Obes. 2018;42:1951–62.CrossRef
8.
go back to reference Pang X, Forrest C, Le-Scherban F, et al. Prediction of early childhood obesity with machine learning and electronic health record data. Int J Med Inform. 2021;150:104454.CrossRef Pang X, Forrest C, Le-Scherban F, et al. Prediction of early childhood obesity with machine learning and electronic health record data. Int J Med Inform. 2021;150:104454.CrossRef
9.
go back to reference Bagherzadeh-Khiabani F, Ramezankhani A, Azizi F, et al. A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results. J Clin Epidemiol. 2016;71:76–85.CrossRef Bagherzadeh-Khiabani F, Ramezankhani A, Azizi F, et al. A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results. J Clin Epidemiol. 2016;71:76–85.CrossRef
10.
go back to reference Hira Z, Gillies D. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform. 2015;2015:1–13.CrossRef Hira Z, Gillies D. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform. 2015;2015:1–13.CrossRef
12.
go back to reference Poskitt EM. Defining childhood obesity: the relative body mass index (body mass index). Eur Childhood Obes Group Acta Paediatr. 1995;84:961–3. Poskitt EM. Defining childhood obesity: the relative body mass index (body mass index). Eur Childhood Obes Group Acta Paediatr. 1995;84:961–3.
14.
go back to reference Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2004;51:137–50. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2004;51:137–50.
15.
go back to reference Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40. Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40.
16.
go back to reference Kraskov A, Stogbauer H, Grassberger P. Estimating mutual information. Phys Rev. 2004;E69:066138. Kraskov A, Stogbauer H, Grassberger P. Estimating mutual information. Phys Rev. 2004;E69:066138.
17.
go back to reference Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.CrossRef Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.CrossRef
18.
go back to reference Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc. 1996;58(1):267–88. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc. 1996;58(1):267–88.
19.
go back to reference Hoerl A, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technomitrics. 1970;12:55–67.CrossRef Hoerl A, Kennard R. Ridge regression: biased estimation for nonorthogonal problems. Technomitrics. 1970;12:55–67.CrossRef
21.
go back to reference Strobl C, Boulesteix A, Zeileis A, et al. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 2007;8:25.CrossRef Strobl C, Boulesteix A, Zeileis A, et al. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 2007;8:25.CrossRef
22.
go back to reference Tran CT, Zhang M, Andreae P, et al. Bagging and feature selection for classification with incomplete data. In: Squillero G, Sim K, eds. Applications of evolutionary computation. EvoApplications 2017. Lecture Notes in Computer Science 10199. Berlin: Springer. 2017. Tran CT, Zhang M, Andreae P, et al. Bagging and feature selection for classification with incomplete data. In: Squillero G, Sim K, eds. Applications of evolutionary computation. EvoApplications 2017. Lecture Notes in Computer Science 10199. Berlin: Springer. 2017.
23.
go back to reference Sun D, Zhang D. Bagging Constraint Score for feature selection with pairwise constraints. Pattern Recogn. 2010;43(6):2106–18.CrossRef Sun D, Zhang D. Bagging Constraint Score for feature selection with pairwise constraints. Pattern Recogn. 2010;43(6):2106–18.CrossRef
24.
go back to reference Lee SJ, Xu Z, Li T, et al. A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. J Biomed Inform. 2018;78:144–55.CrossRef Lee SJ, Xu Z, Li T, et al. A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. J Biomed Inform. 2018;78:144–55.CrossRef
26.
go back to reference Robnik-Sikonja M, Bohanec M. Perturbation-based explanations of prediction models. In: Zhou J, Chen F, editors. Human and machine learning human–computer interaction series. Berlin: Springer; 2018. p. 159–75. Robnik-Sikonja M, Bohanec M. Perturbation-based explanations of prediction models. In: Zhou J, Chen F, editors. Human and machine learning human–computer interaction series. Berlin: Springer; 2018. p. 159–75.
27.
go back to reference Miller T. Explanation in artificial intelligence: insights from the social sciences. Artif Intell. 2018;267:1–38.CrossRef Miller T. Explanation in artificial intelligence: insights from the social sciences. Artif Intell. 2018;267:1–38.CrossRef
28.
go back to reference Jiang F, Zhu S, Yan C, et al. Sleep and obesity in preschool children. J Pediatr. 2009;154(6):814–8.CrossRef Jiang F, Zhu S, Yan C, et al. Sleep and obesity in preschool children. J Pediatr. 2009;154(6):814–8.CrossRef
29.
go back to reference Sekine M, Yamagami T, Handa K, et al. A dose–response relationship between short sleeping hours and childhood obesity: results of the Toyama Birth Cohort Study. Child Care, Health Dev. 2002;28:163–70.CrossRef Sekine M, Yamagami T, Handa K, et al. A dose–response relationship between short sleeping hours and childhood obesity: results of the Toyama Birth Cohort Study. Child Care, Health Dev. 2002;28:163–70.CrossRef
30.
go back to reference El-Behadli A, Sharp C, Hughes S, et al. Maternal depression, stress and feeding styles: towards a framework for theory and research in child obesity. Br J Nutr. 2015;113(S1):S55–71.CrossRef El-Behadli A, Sharp C, Hughes S, et al. Maternal depression, stress and feeding styles: towards a framework for theory and research in child obesity. Br J Nutr. 2015;113(S1):S55–71.CrossRef
31.
go back to reference Davison K, Birch L. Childhood overweight: a contextual model and recommendations for future research. Obes Rev. 2001;2(3):159–71.CrossRef Davison K, Birch L. Childhood overweight: a contextual model and recommendations for future research. Obes Rev. 2001;2(3):159–71.CrossRef
32.
go back to reference Dev D, McBride B, Fiese B, et al. Behalf of the strong kids research team risk factors for overweight/obesity in preschool children: an ecological approach. Child Obes. 2013;9(5):399–408.CrossRef Dev D, McBride B, Fiese B, et al. Behalf of the strong kids research team risk factors for overweight/obesity in preschool children: an ecological approach. Child Obes. 2013;9(5):399–408.CrossRef
33.
go back to reference Ramirez-Gallego S, Mourino-Talin H, Martinez-Rego D, et al. An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst. 2018;48(9):1441–53.CrossRef Ramirez-Gallego S, Mourino-Talin H, Martinez-Rego D, et al. An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst. 2018;48(9):1441–53.CrossRef
34.
go back to reference Seijo-Pardo B, Porto-Diaz I, Bolon-Canedo V, et al. Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst. 2017;114:124–39.CrossRef Seijo-Pardo B, Porto-Diaz I, Bolon-Canedo V, et al. Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl Based Syst. 2017;114:124–39.CrossRef
35.
go back to reference Bolon-Canedo V, Sánchez-Marono N, Alonso-Betanzos A. Distributed feature selection: an application to microarray data classification. Appl Soft Comput. 2015;30:136–50.CrossRef Bolon-Canedo V, Sánchez-Marono N, Alonso-Betanzos A. Distributed feature selection: an application to microarray data classification. Appl Soft Comput. 2015;30:136–50.CrossRef
36.
go back to reference Moran-Fernandez L, Bolon-Canedo V, Alonso-Betanzos A. Centralized vs distributed feature selection methods based on data complexity measures. Knowl Based Syst. 2017;117:27–45.CrossRef Moran-Fernandez L, Bolon-Canedo V, Alonso-Betanzos A. Centralized vs distributed feature selection methods based on data complexity measures. Knowl Based Syst. 2017;117:27–45.CrossRef
37.
go back to reference Bolon-Canedo V, Alonso-Betanzos A. Ensembles for feature selection: a review and future trends. Inform Fusion. 2019;52:1–12.CrossRef Bolon-Canedo V, Alonso-Betanzos A. Ensembles for feature selection: a review and future trends. Inform Fusion. 2019;52:1–12.CrossRef
38.
go back to reference Alvarez-Estevez D, Sanchez-Marono N, Alonso-Betanzos A, et al. Reducing dimensionality in a database of sleep EEG arousals. Expert Syst Appl. 2011;38(6):7746–54.CrossRef Alvarez-Estevez D, Sanchez-Marono N, Alonso-Betanzos A, et al. Reducing dimensionality in a database of sleep EEG arousals. Expert Syst Appl. 2011;38(6):7746–54.CrossRef
39.
go back to reference Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302(5643):249–55.CrossRef Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302(5643):249–55.CrossRef
40.
go back to reference Aerts S, Lambrechts D, Maity S, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24(5):537.CrossRef Aerts S, Lambrechts D, Maity S, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24(5):537.CrossRef
41.
go back to reference Perez-Farinos N, López-Sobaler AM, ÁngelesDalRe M, et al. The ALADINO Study: a national study of prevalence of overweight and obesity in Spanish children in 2011. BioMed Res Int. 2013;2013:163687.CrossRef Perez-Farinos N, López-Sobaler AM, ÁngelesDalRe M, et al. The ALADINO Study: a national study of prevalence of overweight and obesity in Spanish children in 2011. BioMed Res Int. 2013;2013:163687.CrossRef
44.
go back to reference Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82. Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.
45.
go back to reference Schooling CM, Jones HE. Clarifying questions about “risk factors”: predictors versus explanation. Emerg Themes Epidemiol. 2018;15:10.CrossRef Schooling CM, Jones HE. Clarifying questions about “risk factors”: predictors versus explanation. Emerg Themes Epidemiol. 2018;15:10.CrossRef
Metadata
Title
An ensemble-based feature selection framework to select risk factors of childhood obesity for policy decision making
Authors
Xi Shi
Gorana Nikolic
Gorka Epelde
Mónica Arrúe
Joseba Bidaurrazaga Van-Dierdonck
Roberto Bilbao
Bart De Moor
Publication date
01-12-2021
Publisher
BioMed Central
Keywords
Obesity
Obesity
Published in
BMC Medical Informatics and Decision Making / Issue 1/2021
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-021-01580-0

Other articles of this Issue 1/2021

BMC Medical Informatics and Decision Making 1/2021 Go to the issue