Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2011

Open Access 01-12-2011 | Research article

Predicting disease risks from highly imbalanced data using random forest

Authors: Mohammed Khalilia, Sounak Chakraborty, Mihail Popescu

Published in: BMC Medical Informatics and Decision Making | Issue 1/2011

Login to get access

Abstract

Background

We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.

Methods

We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.

Results

We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.

Conclusions

In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.
Appendix
Available only for authorised users
Literature
1.
go back to reference Yu W: Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Medical Informatics and Decision Making. 2010, 10 (1): 16-10.1186/1472-6947-10-16.CrossRefPubMedPubMedCentral Yu W: Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Medical Informatics and Decision Making. 2010, 10 (1): 16-10.1186/1472-6947-10-16.CrossRefPubMedPubMedCentral
2.
go back to reference Hebert P: Identifying persons with diabetes using Medicare claims data. American Journal of Medical Quality. 1999, 14 (6): 270-10.1177/106286069901400607.CrossRefPubMed Hebert P: Identifying persons with diabetes using Medicare claims data. American Journal of Medical Quality. 1999, 14 (6): 270-10.1177/106286069901400607.CrossRefPubMed
3.
go back to reference Fuster V: Medical Underwriting for Life Insurance. 2008, McGraw-Hill's AccessMedicine Fuster V: Medical Underwriting for Life Insurance. 2008, McGraw-Hill's AccessMedicine
4.
go back to reference Yi T, Guo-Ji Z: The application of machine learning algorithm in underwriting process. Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on. 2005 Yi T, Guo-Ji Z: The application of machine learning algorithm in underwriting process. Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on. 2005
5.
go back to reference Cohen E: Cancer coverage in general-audience and black newspapers. Health Communication. 2008, 23 (5): 427-435. 10.1080/10410230802342176.CrossRefPubMed Cohen E: Cancer coverage in general-audience and black newspapers. Health Communication. 2008, 23 (5): 427-435. 10.1080/10410230802342176.CrossRefPubMed
7.
go back to reference Moturu ST, Johnson WG, Huan L: Predicting Future High-Cost Patients: A Real-World Risk Modeling Application. Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference on. 2007 Moturu ST, Johnson WG, Huan L: Predicting Future High-Cost Patients: A Real-World Risk Modeling Application. Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference on. 2007
8.
go back to reference Davis DA, Chawla NV, Blumm N, Christakis N, Barabási AL: Proceeding of the 17th ACM conference on Information and knowledge management. Predicting individual disease risk based on medical history. 2008, 769-778. Davis DA, Chawla NV, Blumm N, Christakis N, Barabási AL: Proceeding of the 17th ACM conference on Information and knowledge management. Predicting individual disease risk based on medical history. 2008, 769-778.
9.
go back to reference Mantzaris DH, Anastassopoulos GC, Lymberopoulos DK: Medical disease prediction using Artificial Neural Networks. BioInformatics and BioEngineering, 2008. BIBE 2008. 8th IEEE International Conference on. 2008 Mantzaris DH, Anastassopoulos GC, Lymberopoulos DK: Medical disease prediction using Artificial Neural Networks. BioInformatics and BioEngineering, 2008. BIBE 2008. 8th IEEE International Conference on. 2008
10.
go back to reference Zhang W: A Comparative Study of Ensemble Learning Approaches in the Classification of Breast Cancer Metastasis. Bioinformatics, Systems Biology and Intelligent Computing, 2009. IJCBS '09. International Joint Conference on. 2009, 242-245.CrossRef Zhang W: A Comparative Study of Ensemble Learning Approaches in the Classification of Breast Cancer Metastasis. Bioinformatics, Systems Biology and Intelligent Computing, 2009. IJCBS '09. International Joint Conference on. 2009, 242-245.CrossRef
11.
go back to reference Skubic M, Alexander G, Popescu M, Rantz M, Keller J: A Smart Home Application to Eldercare: Current Status and Lessons Learned, Technology and Health Care. 2009, 17 (3): 183-201. Skubic M, Alexander G, Popescu M, Rantz M, Keller J: A Smart Home Application to Eldercare: Current Status and Lessons Learned, Technology and Health Care. 2009, 17 (3): 183-201.
12.
go back to reference Provost F: Machine learning from imbalanced data sets 101. Proceedings of the AAAI'2000 Workshop on Imbalanced Data Sets. 2000 Provost F: Machine learning from imbalanced data sets 101. Proceedings of the AAAI'2000 Workshop on Imbalanced Data Sets. 2000
13.
go back to reference Japkowicz N, Stephen S: The class imbalance problem: A systematic study. Intelligent Data Analysis. 2002, 6 (5): 429-449. Japkowicz N, Stephen S: The class imbalance problem: A systematic study. Intelligent Data Analysis. 2002, 6 (5): 429-449.
14.
go back to reference Quinlan JR: Bagging, boosting, and C4. 5. Proceedings of the National Conference on Artificial Intelligence. 1996, 725-730. Quinlan JR: Bagging, boosting, and C4. 5. Proceedings of the National Conference on Artificial Intelligence. 1996, 725-730.
15.
go back to reference Breiman L: Classification and regression trees. 1984, Wadsworth. Inc., Belmont, CA, 358: Breiman L: Classification and regression trees. 1984, Wadsworth. Inc., Belmont, CA, 358:
16.
go back to reference Breiman L: Random forests. Machine learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.CrossRef Breiman L: Random forests. Machine learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.CrossRef
17.
go back to reference Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. 2004, University of California, Berkeley Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. 2004, University of California, Berkeley
19.
go back to reference Hastie T: The elements of statistical learning: data mining, inference and prediction. 2009, 605-622. Hastie T: The elements of statistical learning: data mining, inference and prediction. 2009, 605-622.
20.
go back to reference Bjoern M: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics. 10: Bjoern M: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics. 10:
21.
go back to reference Mingers J: An empirical comparison of selection measures for decision-tree induction. Machine learning. 1989, 3 (4): 319-342. Mingers J: An empirical comparison of selection measures for decision-tree induction. Machine learning. 1989, 3 (4): 319-342.
22.
go back to reference Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997, 30: 1145-1159. 10.1016/S0031-3203(96)00142-2.CrossRef Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997, 30: 1145-1159. 10.1016/S0031-3203(96)00142-2.CrossRef
23.
go back to reference Palmer D: Random forest models to predict aqueous solubility. J Chem Inf Model. 2007, 47 (1): 150-158. 10.1021/ci060164k.CrossRefPubMed Palmer D: Random forest models to predict aqueous solubility. J Chem Inf Model. 2007, 47 (1): 150-158. 10.1021/ci060164k.CrossRefPubMed
24.
go back to reference Liaw A, Wiener M: Classification and Regression by randomForest. Liaw A, Wiener M: Classification and Regression by randomForest.
Metadata
Title
Predicting disease risks from highly imbalanced data using random forest
Authors
Mohammed Khalilia
Sounak Chakraborty
Mihail Popescu
Publication date
01-12-2011
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2011
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/1472-6947-11-51

Other articles of this Issue 1/2011

BMC Medical Informatics and Decision Making 1/2011 Go to the issue