Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2023

Open Access 01-12-2023 | Research

Classification of imbalanced data using machine learning algorithms to predict the risk of renal graft failures in Ethiopia

Authors: Getahun Mulugeta, Temesgen Zewotir, Awoke Seyoum Tegegne, Leja Hamza Juhar, Mahteme Bekele Muleta

Published in: BMC Medical Informatics and Decision Making | Issue 1/2023

Login to get access

Abstract

Introduction

The prevalence of end-stage renal disease has raised the need for renal replacement therapy over recent decades. Even though a kidney transplant offers an improved quality of life and lower cost of care than dialysis, graft failure is possible after transplantation. Hence, this study aimed to predict the risk of graft failure among post-transplant recipients in Ethiopia using the selected machine learning prediction models.

Methodology

The data was extracted from the retrospective cohort of kidney transplant recipients at the Ethiopian National Kidney Transplantation Center from September 2015 to February 2022. In response to the imbalanced nature of the data, we performed hyperparameter tuning, probability threshold moving, tree-based ensemble learning, stacking ensemble learning, and probability calibrations to improve the prediction results. Merit-based selected probabilistic (logistic regression, naive Bayes, and artificial neural network) and tree-based ensemble (random forest, bagged tree, and stochastic gradient boosting) models were applied. Model comparison was performed in terms of discrimination and calibration performance. The best-performing model was then used to predict the risk of graft failure.

Results

A total of 278 completed cases were analyzed, with 21 graft failures and 3 events per predictor. Of these, 74.8% are male, and 25.2% are female, with a median age of 37. From the comparison of models at the individual level, the bagged tree and random forest have top and equal discrimination performance (AUC-ROC = 0.84). In contrast, the random forest has the best calibration performance (brier score = 0.045). Under testing the individual model as a meta-learner for stacking ensemble learning, the result of stochastic gradient boosting as a meta-learner has the top discrimination (AUC-ROC = 0.88) and calibration (brier score = 0.048) performance. Regarding feature importance, chronic rejection, blood urea nitrogen, number of post-transplant admissions, phosphorus level, acute rejection, and urological complications are the top predictors of graft failure.

Conclusions

Bagging, boosting, and stacking, with probability calibration, are good choices for clinical risk predictions working on imbalanced data. The data-driven probability threshold is more beneficial than the natural threshold of 0.5 to improve the prediction result from imbalanced data. Integrating various techniques in a systematic framework is a smart strategy to improve prediction results from imbalanced data. It is recommended for clinical experts in kidney transplantation to use the final calibrated model as a decision support system to predict the risk of graft failure for individual patients.
Literature
1.
go back to reference Stamenic, D., Joint modelling of longitudinal and time-to-event data: analysis of predictive factors of graft outcomes in kidney transplant recipients. 2018, Université de Limoges. Stamenic, D., Joint modelling of longitudinal and time-to-event data: analysis of predictive factors of graft outcomes in kidney transplant recipients. 2018, Université de Limoges.
2.
go back to reference Alemu, H., et al., Prevalence of chronic kidney Disease and Associated factors among patients with diabetes in Northwest Ethiopia: A Hospital-Based cross-sectional study. 2020. 92. Alemu, H., et al., Prevalence of chronic kidney Disease and Associated factors among patients with diabetes in Northwest Ethiopia: A Hospital-Based cross-sectional study. 2020. 92.
3.
go back to reference Wang, J.H. and A.J.K. Hart, Global perspective on kidney transplantation: United States 2021. 2(11): p. 1836. Wang, J.H. and A.J.K. Hart, Global perspective on kidney transplantation: United States 2021. 2(11): p. 1836.
4.
go back to reference Hart, A., et al., OPTN/SRTR 2017 annual data report: kidney 2019. 19: p. 19–123. Hart, A., et al., OPTN/SRTR 2017 annual data report: kidney 2019. 19: p. 19–123.
5.
go back to reference Yazigi, N.A., Long term outcomes after pediatric liver transplantation Pediatric gastroenterology, hepatology & nutrition, 2013. 16(4): p. 207–218. Yazigi, N.A., Long term outcomes after pediatric liver transplantation Pediatric gastroenterology, hepatology & nutrition, 2013. 16(4): p. 207–218.
6.
go back to reference Requião-Moura, LR, et al., Long-term outcomes after kidney transplant failure and variables related to risk of death and probability of retransplant: results from a single-center cohort study in Brazil. PloS one, 2021. 16(1): p. e0245628.CrossRefPubMedPubMedCentral Requião-Moura, LR, et al., Long-term outcomes after kidney transplant failure and variables related to risk of death and probability of retransplant: results from a single-center cohort study in Brazil. PloS one, 2021. 16(1): p. e0245628.CrossRefPubMedPubMedCentral
7.
go back to reference Senanayake, S., et al., Machine learning in predicting graft failure following kidney transplantation: A systematic review of published predictive models. 2019. 130: p. 103957. Senanayake, S., et al., Machine learning in predicting graft failure following kidney transplantation: A systematic review of published predictive models. 2019. 130: p. 103957.
8.
go back to reference Christodoulou, E., et al., A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models 2019. 110: p. 12–22. Christodoulou, E., et al., A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models 2019. 110: p. 12–22.
9.
go back to reference Haixiang, G., et al., Learning from class-imbalanced data: Review of methods and applications 2017. 73: p. 220–239. Haixiang, G., et al., Learning from class-imbalanced data: Review of methods and applications 2017. 73: p. 220–239.
10.
go back to reference Moghadam, P. and A.J.E.S.w.A. Ahmadi, A machine learning framework to predict kidney graft failure with class imbalance using red deer algorithm. 2022. 210: p. 118515. Moghadam, P. and A.J.E.S.w.A. Ahmadi, A machine learning framework to predict kidney graft failure with class imbalance using red deer algorithm. 2022. 210: p. 118515.
11.
go back to reference Oosterhoff, J.H., et al., Feasibility of machine learning and logistic regression algorithms to predict outcome in orthopaedic trauma surgery 2022. 104(6): p. 544–551. Oosterhoff, J.H., et al., Feasibility of machine learning and logistic regression algorithms to predict outcome in orthopaedic trauma surgery 2022. 104(6): p. 544–551.
12.
go back to reference Spelmen, V.S. and R. Porkodi. A review on handling imbalanced data. in 2018 international conference on current trends towards converging technologies (ICCTCT). 2018. IEEE. Spelmen, V.S. and R. Porkodi. A review on handling imbalanced data. in 2018 international conference on current trends towards converging technologies (ICCTCT). 2018. IEEE.
13.
go back to reference van den Goorbergh, R., et al., The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. 2022. 29(9): p. 1525–1534. van den Goorbergh, R., et al., The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. 2022. 29(9): p. 1525–1534.
14.
go back to reference Zhang, L., J.J.E. Wen, and Buildings, A systematic feature selection procedure for short-term data-driven building energy forecasting model development 2019. 183: p. 428–442. Zhang, L., J.J.E. Wen, and Buildings, A systematic feature selection procedure for short-term data-driven building energy forecasting model development 2019. 183: p. 428–442.
15.
go back to reference Darst, B.F., K.C. Malecki, and C.D.J.B.g. Engelman, using recursive feature elimination in random forest to account for correlated variables in high dimensional data. 2018. 19(1): p. 1–6. Darst, B.F., K.C. Malecki, and C.D.J.B.g. Engelman, using recursive feature elimination in random forest to account for correlated variables in high dimensional data. 2018. 19(1): p. 1–6.
17.
go back to reference Sadeghi, S., et al., Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. 2022. 22(1): p. 36. Sadeghi, S., et al., Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. 2022. 22(1): p. 36.
18.
go back to reference Steyerberg, E.W., E.W. J.C.p.m.a.p.a.t. d. Steyerberg, validation,, and updating, Evaluation of performance. 2019: p. 277–308. Steyerberg, E.W., E.W. J.C.p.m.a.p.a.t. d. Steyerberg, validation,, and updating, Evaluation of performance. 2019: p. 277–308.
19.
go back to reference Aydemir, O.J.J.o.C., A new performance evaluation metric for classifiers: polygon area metric 2021. 38(1): p. 16–26. Aydemir, O.J.J.o.C., A new performance evaluation metric for classifiers: polygon area metric 2021. 38(1): p. 16–26.
20.
go back to reference Picek, S., et al., The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. 2019. 2019(1): p. 1–29. Picek, S., et al., The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. 2019. 2019(1): p. 1–29.
21.
go back to reference Huang, Y., et al., A tutorial on calibration measurements and calibration models for clinical prediction models. 2020. 27(4): p. 621–633. Huang, Y., et al., A tutorial on calibration measurements and calibration models for clinical prediction models. 2020. 27(4): p. 621–633.
22.
go back to reference Kruppa, J., A. Ziegler, and I.R.J.H.g. König, Risk estimation and risk prediction using machine-learning methods. 2012. 131(10): p. 1639–1654. Kruppa, J., A. Ziegler, and I.R.J.H.g. König, Risk estimation and risk prediction using machine-learning methods. 2012. 131(10): p. 1639–1654.
23.
go back to reference Al Omari, M., et al. Sentiment classifier: Logistic regression for arabic services’ reviews in lebanon. in 2019 international conference on computer and information sciences (iccis). 2019. IEEE. Al Omari, M., et al. Sentiment classifier: Logistic regression for arabic services’ reviews in lebanon. in 2019 international conference on computer and information sciences (iccis). 2019. IEEE.
24.
go back to reference Chen, S., et al., A novel selective naïve Bayes algorithm 2020. 192: p. 105361. Chen, S., et al., A novel selective naïve Bayes algorithm 2020. 192: p. 105361.
25.
go back to reference Peling, I.B.A., et al., Implementation of Data Mining To Predict Period of Students Study Using Naive Bayes Algorithm 2017. 2(1): p. 53. Peling, I.B.A., et al., Implementation of Data Mining To Predict Period of Students Study Using Naive Bayes Algorithm 2017. 2(1): p. 53.
26.
go back to reference Walczak, S., Artificial neural networks, in Advanced methodologies and technologies in artificial intelligence, computer simulation, and human-computer interaction. 2019, IGI global. p. 40–53. Walczak, S., Artificial neural networks, in Advanced methodologies and technologies in artificial intelligence, computer simulation, and human-computer interaction. 2019, IGI global. p. 40–53.
27.
go back to reference Amato, F., et al., Artificial neural networks in medical diagnosis. 2013, Elsevier. p. 47–58. Amato, F., et al., Artificial neural networks in medical diagnosis. 2013, Elsevier. p. 47–58.
28.
go back to reference Vadapalli, P., Random Forest Classifier: Overview, How Does it Work, Pros & Cons. Vadapalli, P., Random Forest Classifier: Overview, How Does it Work, Pros & Cons.
29.
go back to reference Saha, S., et al., Predicting the deforestation probability using the binary logistic regression, random forest, ensemble rotational forest, REPTree: A case study at the Gumani River Basin, India 2020. 730: p. 139197. Saha, S., et al., Predicting the deforestation probability using the binary logistic regression, random forest, ensemble rotational forest, REPTree: A case study at the Gumani River Basin, India 2020. 730: p. 139197.
30.
go back to reference Disha, R.A. and S.J.C. Waheed, Performance analysis of machine learning models for intrusion detection system using Gini Impurity-based Weighted Random Forest (GIWRF) feature selection technique. 2022. 5(1): p. 1–22. Disha, R.A. and S.J.C. Waheed, Performance analysis of machine learning models for intrusion detection system using Gini Impurity-based Weighted Random Forest (GIWRF) feature selection technique. 2022. 5(1): p. 1–22.
31.
go back to reference Jaiswal, JK and R. Samikannu. Application of random forest algorithm on feature subset selection and classification and regression. In 2017 world congress on computing and communication technologies (WCCCT). 2017. IEEE. Jaiswal, JK and R. Samikannu. Application of random forest algorithm on feature subset selection and classification and regression. In 2017 world congress on computing and communication technologies (WCCCT). 2017. IEEE.
32.
go back to reference Zhou, T., et al., High-resolution digital mapping of soil organic carbon and soil total nitrogen using DEM derivatives, Sentinel-1 and Sentinel-2 data based on machine learning algorithms 2020. 729: p. 138244. Zhou, T., et al., High-resolution digital mapping of soil organic carbon and soil total nitrogen using DEM derivatives, Sentinel-1 and Sentinel-2 data based on machine learning algorithms 2020. 729: p. 138244.
33.
go back to reference Altman, N. and MJNM Krzywinski, Ensemble methods: bagging and random forests. 2017. 14(10): p. 933–935. Altman, N. and MJNM Krzywinski, Ensemble methods: bagging and random forests. 2017. 14(10): p. 933–935.
34.
go back to reference González, S., et al., A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities 2020. 64: p. 205–237. González, S., et al., A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities 2020. 64: p. 205–237.
35.
go back to reference Xu, Q., et al., PDC-SGB: prediction of effective drug combinations using a stochastic gradient boosting algorithm. 2017. 417: p. 1–7. Xu, Q., et al., PDC-SGB: prediction of effective drug combinations using a stochastic gradient boosting algorithm. 2017. 417: p. 1–7.
36.
go back to reference Feurer, M. and F.J.A.m.l.M. Hutter, systems, challenges, Hyperparameter optimization 2019: p. 3–33. Feurer, M. and F.J.A.m.l.M. Hutter, systems, challenges, Hyperparameter optimization 2019: p. 3–33.
37.
go back to reference Zhang, X., H. Gweon, and S. Provost. Threshold moving approaches for addressing the class imbalance problem and their application to multi-label classification. in 2020 4th International Conference on Advances in Image Processing. 2020. Zhang, X., H. Gweon, and S. Provost. Threshold moving approaches for addressing the class imbalance problem and their application to multi-label classification. in 2020 4th International Conference on Advances in Image Processing. 2020.
38.
go back to reference Wynants, L., et al., Three myths about risk thresholds for prediction models. 2019. 17(1): p. 1–7. Wynants, L., et al., Three myths about risk thresholds for prediction models. 2019. 17(1): p. 1–7.
39.
go back to reference Yi, H.-C., et al., RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information. 2020. 21(1): p. 1–10. Yi, H.-C., et al., RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information. 2020. 21(1): p. 1–10.
40.
go back to reference Dou, J., et al., Improved landslide assessment using support vector machine with bagging, boosting, and stacking ensemble machine learning framework in a mountainous watershed, Japan. 2020. 17: p. 641–658. Dou, J., et al., Improved landslide assessment using support vector machine with bagging, boosting, and stacking ensemble machine learning framework in a mountainous watershed, Japan. 2020. 17: p. 641–658.
41.
go back to reference da Silva, R.G., et al., A novel decomposition-ensemble learning framework for multi-step ahead wind energy forecasting. 2021. 216: p. 119174. da Silva, R.G., et al., A novel decomposition-ensemble learning framework for multi-step ahead wind energy forecasting. 2021. 216: p. 119174.
42.
go back to reference Manzanas, R., et al., Dynamical and statistical downscaling of seasonal temperature forecasts in Europe: Added value for user applications 2018. 9: p. 44–56. Manzanas, R., et al., Dynamical and statistical downscaling of seasonal temperature forecasts in Europe: Added value for user applications 2018. 9: p. 44–56.
43.
go back to reference Fenlon, C., et al., A discussion of calibration techniques for evaluating binary and categorical predictive models 2018. 149: p. 107–114. Fenlon, C., et al., A discussion of calibration techniques for evaluating binary and categorical predictive models 2018. 149: p. 107–114.
44.
go back to reference Vaicenavicius, J., et al. Evaluating model calibration in classification. in The 22nd International Conference on Artificial Intelligence and Statistics. 2019. PMLR. Vaicenavicius, J., et al. Evaluating model calibration in classification. in The 22nd International Conference on Artificial Intelligence and Statistics. 2019. PMLR.
45.
go back to reference Yang, L., et al., Study of cardiovascular disease prediction model based on random forest in eastern China. 2020. 10(1): p. 5245. Yang, L., et al., Study of cardiovascular disease prediction model based on random forest in eastern China. 2020. 10(1): p. 5245.
46.
go back to reference Zimmerman, N., et al., A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring 2018. 11(1): p. 291–313. Zimmerman, N., et al., A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring 2018. 11(1): p. 291–313.
47.
go back to reference Kardani, N., et al., Improved prediction of slope stability using a hybrid stacking ensemble method based on finite element analysis and field data 2021. 13(1): p. 188–201. Kardani, N., et al., Improved prediction of slope stability using a hybrid stacking ensemble method based on finite element analysis and field data 2021. 13(1): p. 188–201.
48.
go back to reference Gozdowska, J., et al. Urinary tract infections in kidney transplant recipients hospitalized at a transplantation and nephrology ward: 1-year follow-up. in Transplantation proceedings. 2016. Elsevier. Gozdowska, J., et al. Urinary tract infections in kidney transplant recipients hospitalized at a transplantation and nephrology ward: 1-year follow-up. in Transplantation proceedings. 2016. Elsevier.
49.
go back to reference Bicalho, P.R., et al., Long-term outcomes among kidney transplant recipients and after graft failure: a single-center cohort study in Brazil 2019. 2019. Bicalho, P.R., et al., Long-term outcomes among kidney transplant recipients and after graft failure: a single-center cohort study in Brazil 2019. 2019.
50.
go back to reference Brisco, M.A., et al., Blood urea nitrogen/creatinine ratio identifies a high-risk but potentially reversible form of renal dysfunction in patients with decompensated heart failure. 2013. 6(2): p. 233–239. Brisco, M.A., et al., Blood urea nitrogen/creatinine ratio identifies a high-risk but potentially reversible form of renal dysfunction in patients with decompensated heart failure. 2013. 6(2): p. 233–239.
Metadata
Title
Classification of imbalanced data using machine learning algorithms to predict the risk of renal graft failures in Ethiopia
Authors
Getahun Mulugeta
Temesgen Zewotir
Awoke Seyoum Tegegne
Leja Hamza Juhar
Mahteme Bekele Muleta
Publication date
01-12-2023
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2023
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-023-02185-5

Other articles of this Issue 1/2023

BMC Medical Informatics and Decision Making 1/2023 Go to the issue