Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2024

Open Access 01-12-2024 | Research

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets

Authors: JiaHang Li, ShuXia Guo, RuLin Ma, Jia He, XiangHui Zhang, DongSheng Rui, YuSong Ding, Yu Li, LeYao Jian, Jing Cheng, Heng Guo

Published in: BMC Medical Research Methodology | Issue 1/2024

Login to get access

Abstract

Background

Missing data is frequently an inevitable issue in cohort studies and it can adversely affect the study's findings. We assess the effectiveness of eight frequently utilized statistical and machine learning (ML) imputation methods for dealing with missing data in predictive modelling of cohort study datasets. This evaluation is based on real data and predictive models for cardiovascular disease (CVD) risk.

Methods

The data is from a real-world cohort study in Xinjiang, China. It includes personal information, physical examination data, questionnaires, and laboratory biochemical results from 10,164 subjects with a total of 37 variables. Simple imputation (Simple), regression imputation (Regression), expectation-maximization(EM), multiple imputation (MICE) , K nearest neighbor classification (KNN), clustering imputation (Cluster), random forest (RF), and decision tree (Cart) were the chosen imputation methods. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are utilised to assess the performance of different methods for missing data imputation at a missing rate of 20%. The datasets processed with different missing data imputation methods were employed to construct a CVD risk prediction model utilizing the support vector machine (SVM). The predictive performance was then compared using the area under the curve (AUC).

Results

The most effective imputation results were attained by KNN (MAE: 0.2032, RMSE: 0.7438, AUC: 0.730, CI: 0.719-0.741) and RF (MAE: 0.3944, RMSE: 1.4866, AUC: 0.777, CI: 0.769-0.785). The subsequent best performances were achieved by EM, Cart, and MICE, while Simple, Regression, and Cluster attained the worst performances. The CVD risk prediction model was constructed using the complete data (AUC:0.804, CI:0.796-0.812) in comparison with all other models with p<0.05.

Conclusion

KNN and RF exhibit superior performance and are more adept at imputing missing data in predictive modelling of cohort study datasets.
Appendix
Available only for authorised users
Literature
1.
go back to reference Heymans MW, Twisk JWR. Handling missing data in clinical research. J Clin Epidemiol. 2022;151:185–8.CrossRefPubMed Heymans MW, Twisk JWR. Handling missing data in clinical research. J Clin Epidemiol. 2022;151:185–8.CrossRefPubMed
2.
go back to reference Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW, Molenberghs G, Murphy SA, et al. The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012;367(14):1355–60.CrossRefPubMedPubMedCentral Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW, Molenberghs G, Murphy SA, et al. The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012;367(14):1355–60.CrossRefPubMedPubMedCentral
3.
go back to reference Ayilara OF, Zhang L, Sajobi TT, Sawatzky R, Bohm E, Lix LM. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes. 2019;17(1):106.CrossRefPubMedPubMedCentral Ayilara OF, Zhang L, Sajobi TT, Sawatzky R, Bohm E, Lix LM. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes. 2019;17(1):106.CrossRefPubMedPubMedCentral
4.
go back to reference Nijman S, Leeuwenberg AM, Beekers I, Verkouter I, Jacobs J, Bots ML, Asselbergs FW, Moons K, Debray T. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol. 2022;142:218–29.CrossRefPubMed Nijman S, Leeuwenberg AM, Beekers I, Verkouter I, Jacobs J, Bots ML, Asselbergs FW, Moons K, Debray T. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol. 2022;142:218–29.CrossRefPubMed
6.
go back to reference Eekhout I, de Boer RM, Twisk JW, de Vet HC, Heymans MW. Missing data: a systematic review of how they are reported and handled. Epidemiology (Cambridge, Mass). 2012;23(5):729–32.CrossRefPubMed Eekhout I, de Boer RM, Twisk JW, de Vet HC, Heymans MW. Missing data: a systematic review of how they are reported and handled. Epidemiology (Cambridge, Mass). 2012;23(5):729–32.CrossRefPubMed
7.
go back to reference Little TD, Jorgensen TD, Lang KM, Moore EW. On the joys of missing data. J Pediatr Psychol. 2014;39(2):151–62.CrossRefPubMed Little TD, Jorgensen TD, Lang KM, Moore EW. On the joys of missing data. J Pediatr Psychol. 2014;39(2):151–62.CrossRefPubMed
8.
go back to reference Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50(2):105–15.CrossRefPubMed Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50(2):105–15.CrossRefPubMed
9.
go back to reference Rios R, Miller RJH, Manral N, Sharir T, Einstein AJ, Fish MB, Ruddy TD, Kaufmann PA, Sinusas AJ, Miller EJ, et al. Handling missing values in machine learning to predict patient-specific risk of adverse cardiac events: Insights from REFINE SPECT registry. Comput Biol Med. 2022;145:105449.CrossRefPubMedPubMedCentral Rios R, Miller RJH, Manral N, Sharir T, Einstein AJ, Fish MB, Ruddy TD, Kaufmann PA, Sinusas AJ, Miller EJ, et al. Handling missing values in machine learning to predict patient-specific risk of adverse cardiac events: Insights from REFINE SPECT registry. Comput Biol Med. 2022;145:105449.CrossRefPubMedPubMedCentral
10.
go back to reference Stavseth MR, Clausen T, Røislien J. How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med. 2019;7:2050312118822912.CrossRefPubMedPubMedCentral Stavseth MR, Clausen T, Røislien J. How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med. 2019;7:2050312118822912.CrossRefPubMedPubMedCentral
11.
go back to reference Stewart J, Addy K, Campbell S, Wilkinson P. Primary prevention of cardiovascular disease: Updated review of contemporary guidance and literature. JRSM Cardiovasc Dis. 2020;9:2048004020949326.PubMedPubMedCentral Stewart J, Addy K, Campbell S, Wilkinson P. Primary prevention of cardiovascular disease: Updated review of contemporary guidance and literature. JRSM Cardiovasc Dis. 2020;9:2048004020949326.PubMedPubMedCentral
13.
go back to reference DB R: Inference and missing data. Biometrika 1976, 63(3):581-592. DB R: Inference and missing data. Biometrika 1976, 63(3):581-592.
14.
go back to reference Graham JW. Missing data analysis: making it work in the real world. Ann Rev Psychol. 2009;60:549–76.CrossRef Graham JW. Missing data analysis: making it work in the real world. Ann Rev Psychol. 2009;60:549–76.CrossRef
19.
20.
go back to reference Samad MD, Abrar S, Diawara N: Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework. Knowl-Based Syst. 2022, 249. Samad MD, Abrar S, Diawara N: Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework. Knowl-Based Syst. 2022, 249.
21.
22.
23.
go back to reference Jiang Y, Zhang X, Ma R, Wang X, Liu J, Keerman M, Yan Y, Ma J, Song Y, Zhang J, et al. Cardiovascular Disease Prediction by Machine Learning Algorithms Based on Cytokines in Kazakhs of China. Clin Epidemiol. 2021;13:417–28.CrossRefPubMedPubMedCentral Jiang Y, Zhang X, Ma R, Wang X, Liu J, Keerman M, Yan Y, Ma J, Song Y, Zhang J, et al. Cardiovascular Disease Prediction by Machine Learning Algorithms Based on Cytokines in Kazakhs of China. Clin Epidemiol. 2021;13:417–28.CrossRefPubMedPubMedCentral
24.
25.
go back to reference Anil Jadhav DPKR. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl Artif Intell. 2019;33(10):913–33.CrossRef Anil Jadhav DPKR. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl Artif Intell. 2019;33(10):913–33.CrossRef
26.
go back to reference Bdpczy D. From predictive methods to missing data imputation: an optimization approach. J Machine Learn Res. 2017;18(1):7133–71.MathSciNet Bdpczy D. From predictive methods to missing data imputation: an optimization approach. J Machine Learn Res. 2017;18(1):7133–71.MathSciNet
28.
go back to reference Xu X, Xia L, Zhang Q, Wu S, Wu M, Liu H. The ability of different imputation methods for missing values in mental measurement questionnaires. BMC Med Res Methodol. 2020;20(1):42.CrossRefPubMedPubMedCentral Xu X, Xia L, Zhang Q, Wu S, Wu M, Liu H. The ability of different imputation methods for missing values in mental measurement questionnaires. BMC Med Res Methodol. 2020;20(1):42.CrossRefPubMedPubMedCentral
29.
go back to reference Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011;20(1):40–9.CrossRefPubMedPubMedCentral Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011;20(1):40–9.CrossRefPubMedPubMedCentral
30.
go back to reference Rubin LH, Witkiewitz K, Andre JS, Reilly S. Methods for handling missing data in the behavioral neurosciences: don’t throw the baby rat out with the bath water. J Undergrad Neurosci Educ. 2007;5(2):A71-77.PubMedPubMedCentral Rubin LH, Witkiewitz K, Andre JS, Reilly S. Methods for handling missing data in the behavioral neurosciences: don’t throw the baby rat out with the bath water. J Undergrad Neurosci Educ. 2007;5(2):A71-77.PubMedPubMedCentral
31.
go back to reference Malan L, Smuts CM, Baumgartner J, Ricci C. Missing data imputation via the expectation-maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns. Nutr Res (New York, NY). 2020;75:67–76.CrossRef Malan L, Smuts CM, Baumgartner J, Ricci C. Missing data imputation via the expectation-maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns. Nutr Res (New York, NY). 2020;75:67–76.CrossRef
32.
go back to reference Murti DMPaP, Utomo and Wibawa, Aji Prasetya and Akbar, Muhammad Iqbal: K-Nearest Neighbor (K-NN) based Missing Data Imputation. 2019 5th International Conference on Science in Information Technology (ICSITech) 2019:83-88. Murti DMPaP, Utomo and Wibawa, Aji Prasetya and Akbar, Muhammad Iqbal: K-Nearest Neighbor (K-NN) based Missing Data Imputation. 2019 5th International Conference on Science in Information Technology (ICSITech) 2019:83-88.
33.
go back to reference Alsaber A, Al-Herz A, Pan J, Al-Sultan AT, Mishra D. Handling missing data in a rheumatoid arthritis registry using random forest approach. Int J Rheum Dis. 2021;24(10):1282–93.CrossRefPubMed Alsaber A, Al-Herz A, Pan J, Al-Sultan AT, Mishra D. Handling missing data in a rheumatoid arthritis registry using random forest approach. Int J Rheum Dis. 2021;24(10):1282–93.CrossRefPubMed
Metadata
Title
Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets
Authors
JiaHang Li
ShuXia Guo
RuLin Ma
Jia He
XiangHui Zhang
DongSheng Rui
YuSong Ding
Yu Li
LeYao Jian
Jing Cheng
Heng Guo
Publication date
01-12-2024
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2024
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-024-02173-x

Other articles of this Issue 1/2024

BMC Medical Research Methodology 1/2024 Go to the issue