Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 4/2018

Open Access 01-12-2018 | Research

Chronic Kidney Disease stratification using office visit records: Handling data imbalance via hierarchical meta-classification

Authors: Moumita Bhattacharya, Claudine Jurkovitz, Hagit Shatkay

Published in: BMC Medical Informatics and Decision Making | Special Issue 4/2018

Login to get access

Abstract

Background

Chronic Kidney Disease (CKD) is one of several conditions that affect a growing percentage of the US population; the disease is accompanied by multiple co-morbidities, and is hard to diagnose in-and-of itself. In its advanced forms it carries severe outcomes and can lead to death. It is thus important to detect the disease as early as possible, which can help devise effective intervention and treatment plan.
Here we investigate ways to utilize information available in electronic health records (EHRs) from regular office visits of more than 13,000 patients, in order to distinguish among several stages of the disease. While clinical data stored in EHRs provide valuable information for risk-stratification, one of the major challenges in using them arises from data imbalance. That is, records associated with a more severe condition are typically under-represented compared to those associated with a milder manifestation of the disease. To address imbalance, we propose and develop a sampling-based ensemble approach, hierarchical meta-classification, aiming to stratify CKD patients into severity stages, using simple quantitative non-text features gathered from standard office visit records.

Methods

The proposed hierarchical meta-classification method frames the multiclass classification task as a hierarchy of two subtasks. The first is binary classification, separating records associated with the majority class from those associated with all minority classes combined, using meta-classification. The second subtask separates the records assigned to the combined minority classes into the individual constituent classes.

Results

The proposed method identifies a significant proportion of patients suffering from the more advanced stages of the condition, while also correctly identifying most of the less severe cases, maintaining high sensitivity, specificity and F-measure (≥ 93%). Our results show that the high level of performance attained by our method is preserved even when the size of the training set is significantly reduced, demonstrating the stability and generalizability of our approach.

Conclusion

We present a new approach to perform classification while addressing data imbalance, which is inherent in the biomedical domain. Our model effectively identifies severity stages of CKD patients, using information readily available in office visit records within the realistic context of high data imbalance.
Footnotes
1
Stage 1 is defined by kidney damage (protein or blood in the urines) while eGFR is normal (eGFR ≥90 ml/min/1.73m2); stage 2 by kidney damage and mildly decreased eGFR (eGFR 60 – < 90); stage 3 as eGFR 30 – < 60; stage 4 as eGFR 15 – < 30 and stage 5 as eGFR < 15.
 
2
Dataset approved by Christiana Care’s IRB, with a waiver of consent according to 45CFR46.116d.
 
Literature
2.
go back to reference Levin A, Stevens PE, Bilous RW, Coresh J, et al. Kidney disease: improving global outcomes (KDIGO) CKD work group. KDIGO 2012 clinical practice guideline for the evaluation and Management of Chronic Kidney Disease. Kidney Int Suppl. 2013;3:1–150.CrossRef Levin A, Stevens PE, Bilous RW, Coresh J, et al. Kidney disease: improving global outcomes (KDIGO) CKD work group. KDIGO 2012 clinical practice guideline for the evaluation and Management of Chronic Kidney Disease. Kidney Int Suppl. 2013;3:1–150.CrossRef
3.
go back to reference Saran R, Robinson B, Abbott KC, Agodoa LYC, et al. US Renal Data. System 2016 annual data report: Epidemiology of kidney disease in the United States. Am J Kidney Dis. 2017;69(3):A7–8.CrossRef Saran R, Robinson B, Abbott KC, Agodoa LYC, et al. US Renal Data. System 2016 annual data report: Epidemiology of kidney disease in the United States. Am J Kidney Dis. 2017;69(3):A7–8.CrossRef
4.
go back to reference Agrawal V, Jaar BG, Frisby XY, Chen SC, et al. Access to health care among adults evaluated for CKD: findings from the kidney early evaluation program (KEEP). Am J Kidney Dis. 2012;59(3):S5–S15.CrossRef Agrawal V, Jaar BG, Frisby XY, Chen SC, et al. Access to health care among adults evaluated for CKD: findings from the kidney early evaluation program (KEEP). Am J Kidney Dis. 2012;59(3):S5–S15.CrossRef
5.
go back to reference Mani S, Chen Y, Elasy T, Clayton W, Denny J. Type 2 diabetes risk forecasting from EMR data using machine learning. In Proc. of the AMIA Annu Symp. 2012:606–15. Mani S, Chen Y, Elasy T, Clayton W, Denny J. Type 2 diabetes risk forecasting from EMR data using machine learning. In Proc. of the AMIA Annu Symp. 2012:606–15.
6.
go back to reference Ogunyemi O, Kermah D. Machine Learning Approaches for Detecting Diabetic Retinopathy from Clinical and Public Health Records. In Proc. of the AMIA Annu Symp. 2015:983–90. Ogunyemi O, Kermah D. Machine Learning Approaches for Detecting Diabetic Retinopathy from Clinical and Public Health Records. In Proc. of the AMIA Annu Symp. 2015:983–90.
7.
go back to reference Teixeira PL, Wei WQ, Cronin RM, Mo H, et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. J Am Med Inform Assoc. 2017;24(1):162–71.CrossRef Teixeira PL, Wei WQ, Cronin RM, Mo H, et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. J Am Med Inform Assoc. 2017;24(1):162–71.CrossRef
8.
go back to reference Huang SH, LePendu P, Iyer SV, Tai-Seale M, et al. Toward personalizing treatment for depression: predicting diagnosis and severity. J Am Med Inform Assoc. 2014;21(6):1069–75.CrossRef Huang SH, LePendu P, Iyer SV, Tai-Seale M, et al. Toward personalizing treatment for depression: predicting diagnosis and severity. J Am Med Inform Assoc. 2014;21(6):1069–75.CrossRef
9.
go back to reference Klimov D, Shknevsky A, Shahar Y. Exploration of patterns predicting renal damage in diabetes type II patients using a visual temporal analysis laboratory. J Am Med Inform Assoc. 2015;22(2):275–89.PubMed Klimov D, Shknevsky A, Shahar Y. Exploration of patterns predicting renal damage in diabetes type II patients using a visual temporal analysis laboratory. J Am Med Inform Assoc. 2015;22(2):275–89.PubMed
10.
go back to reference Kubat M, Holte R, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Mach Learn. 1998;30:195–215.CrossRef Kubat M, Holte R, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Mach Learn. 1998;30:195–215.CrossRef
11.
go back to reference Chawla NV. Data mining for imbalanced datasets: an overview. In Data mining and knowledge discovery handbook, Springer US 2005; 853–867. Chawla NV. Data mining for imbalanced datasets: an overview. In Data mining and knowledge discovery handbook, Springer US 2005; 853–867.
12.
go back to reference Rifkin R, Klautau A. In defense of one-vs-all classification. J Mach Learn Res. 2004;5:101–41. Rifkin R, Klautau A. In defense of one-vs-all classification. J Mach Learn Res. 2004;5:101–41.
13.
go back to reference Tan AC, Gilbert D, Deville Y. Multi-class protein fold classification using a new ensemble machine learning approach. Genome Inform. 2003;14:206–17.PubMed Tan AC, Gilbert D, Deville Y. Multi-class protein fold classification using a new ensemble machine learning approach. Genome Inform. 2003;14:206–17.PubMed
14.
go back to reference Zhao XM, Li X, Chen L, Aihara K. Protein classification with imbalanced data. Proteins: Structure, function, and bioinformatics. 2008;70(4):1125–32.CrossRef Zhao XM, Li X, Chen L, Aihara K. Protein classification with imbalanced data. Proteins: Structure, function, and bioinformatics. 2008;70(4):1125–32.CrossRef
15.
go back to reference Sun Y, Kamel MS, Wang Y. Boosting for learning multiple classes with imbalanced class distribution. In Proc. of the IEEE Int. Conf. on Data Mining (ICDM). 2006:592–602. Sun Y, Kamel MS, Wang Y. Boosting for learning multiple classes with imbalanced class distribution. In Proc. of the IEEE Int. Conf. on Data Mining (ICDM). 2006:592–602.
16.
go back to reference Zmiri D, Shahar Y, Taieb-Maimon M. Classification of patients by severity grades during triage in the emergency department using data-mining methods. J Eval Clin Pract. 2012;18(2):378–88.CrossRef Zmiri D, Shahar Y, Taieb-Maimon M. Classification of patients by severity grades during triage in the emergency department using data-mining methods. J Eval Clin Pract. 2012;18(2):378–88.CrossRef
17.
go back to reference Lin WH, Hauptmann A. Meta-classification: Combining multimodal classifiers. In Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining. 2002:217–31. Lin WH, Hauptmann A. Meta-classification: Combining multimodal classifiers. In Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining. 2002:217–31.
18.
go back to reference Murphy KP. Machine learning: a probabilistic perspective. MIT press; 2012. Murphy KP. Machine learning: a probabilistic perspective. MIT press; 2012.
19.
go back to reference Bhattacharya M, Jurkovitz C, Shatkay H. Assessing Chronic Kidney Disease from Office Visit Records Using Hierarchical Meta-Classification of an Imbalanced Dataset. In Proc. of the IEEE Int. Conference on Bioinformatics and Biomedicine (BIBM). 2017:663–70. Bhattacharya M, Jurkovitz C, Shatkay H. Assessing Chronic Kidney Disease from Office Visit Records Using Hierarchical Meta-Classification of an Imbalanced Dataset. In Proc. of the IEEE Int. Conference on Bioinformatics and Biomedicine (BIBM). 2017:663–70.
20.
go back to reference Pedregosa F, Varoquaux G, Gramfort A, Michel V, et al. Scikit-learn: machine learning in Python. J of Machine Learning Res. 2011:2825–30. Pedregosa F, Varoquaux G, Gramfort A, Michel V, et al. Scikit-learn: machine learning in Python. J of Machine Learning Res. 2011:2825–30.
21.
go back to reference Briesemeister S, Rahnenführer J, Kohlbacher O. Going from where to why—interpretable prediction of protein subcellular localization. Bioinformatics. 2010;26(9):1232–8.CrossRef Briesemeister S, Rahnenführer J, Kohlbacher O. Going from where to why—interpretable prediction of protein subcellular localization. Bioinformatics. 2010;26(9):1232–8.CrossRef
22.
go back to reference Sud M, Tangri N, Levin A, Pintilie M, et al. CKD stage at nephrology referral and factors influencing the risks of ESRD and death. Am J Kidney Dis. 2014;63(6):928–36.CrossRef Sud M, Tangri N, Levin A, Pintilie M, et al. CKD stage at nephrology referral and factors influencing the risks of ESRD and death. Am J Kidney Dis. 2014;63(6):928–36.CrossRef
23.
go back to reference Baek SD, Baek CH, Kim JS, Kim SM, et al. Does stage III chronic kidney disease always progress to endstage renal disease? A ten-year follow-up study. Scand J Urol Nephrol. 2012;46:232–8.CrossRef Baek SD, Baek CH, Kim JS, Kim SM, et al. Does stage III chronic kidney disease always progress to endstage renal disease? A ten-year follow-up study. Scand J Urol Nephrol. 2012;46:232–8.CrossRef
Metadata
Title
Chronic Kidney Disease stratification using office visit records: Handling data imbalance via hierarchical meta-classification
Authors
Moumita Bhattacharya
Claudine Jurkovitz
Hagit Shatkay
Publication date
01-12-2018
Publisher
BioMed Central
DOI
https://doi.org/10.1186/s12911-018-0675-x

Other articles of this Special Issue 4/2018

BMC Medical Informatics and Decision Making 4/2018 Go to the issue