Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2011 | Research article

Predicting disease risks from highly imbalanced data using random forest

Authors: Mohammed Khalilia, Sounak Chakraborty, Mihail Popescu

Published in: BMC Medical Informatics and Decision Making | Issue 1/2011

Abstract

Background

We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.

Methods

We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.

Results

We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.

Conclusions

In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

Available only for authorised users

Yu W: Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Medical Informatics and Decision Making. 2010, 10 (1): 16-10.1186/1472-6947-10-16.CrossRefPubMedPubMedCentral

Hebert P: Identifying persons with diabetes using Medicare claims data. American Journal of Medical Quality. 1999, 14 (6): 270-10.1177/106286069901400607.CrossRefPubMed

Fuster V: Medical Underwriting for Life Insurance. 2008, McGraw-Hill's AccessMedicine

Yi T, Guo-Ji Z: The application of machine learning algorithm in underwriting process. Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on. 2005

Cohen E: Cancer coverage in general-audience and black newspapers. Health Communication. 2008, 23 (5): 427-435. 10.1080/10410230802342176.CrossRefPubMed

HCUP Project: Overview of the Nationwide Inpatient Sample (NIS). 2009, [http://www.hcup-us.ahrq.gov/nisoverview.jsp]

Moturu ST, Johnson WG, Huan L: Predicting Future High-Cost Patients: A Real-World Risk Modeling Application. Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference on. 2007

Davis DA, Chawla NV, Blumm N, Christakis N, Barabási AL: Proceeding of the 17th ACM conference on Information and knowledge management. Predicting individual disease risk based on medical history. 2008, 769-778.

Mantzaris DH, Anastassopoulos GC, Lymberopoulos DK: Medical disease prediction using Artificial Neural Networks. BioInformatics and BioEngineering, 2008. BIBE 2008. 8th IEEE International Conference on. 2008

10.

Zhang W: A Comparative Study of Ensemble Learning Approaches in the Classification of Breast Cancer Metastasis. Bioinformatics, Systems Biology and Intelligent Computing, 2009. IJCBS '09. International Joint Conference on. 2009, 242-245.CrossRef

11.

Skubic M, Alexander G, Popescu M, Rantz M, Keller J: A Smart Home Application to Eldercare: Current Status and Lessons Learned, Technology and Health Care. 2009, 17 (3): 183-201.

12.

Provost F: Machine learning from imbalanced data sets 101. Proceedings of the AAAI'2000 Workshop on Imbalanced Data Sets. 2000

13.

Japkowicz N, Stephen S: The class imbalance problem: A systematic study. Intelligent Data Analysis. 2002, 6 (5): 429-449.

14.

Quinlan JR: Bagging, boosting, and C4. 5. Proceedings of the National Conference on Artificial Intelligence. 1996, 725-730.

15.

Breiman L: Classification and regression trees. 1984, Wadsworth. Inc., Belmont, CA, 358:

16.

Breiman L: Random forests. Machine learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.CrossRef

17.

Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. 2004, University of California, Berkeley

18.

Breiman L, others: Manual-Setting Up, Using, and Understanding Random Forests V4. 0. 2003, [ftp://ftpstat.berkeley.edu/pub/users/breiman]

19.

Hastie T: The elements of statistical learning: data mining, inference and prediction. 2009, 605-622.

20.

Bjoern M: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics. 10:

21.

Mingers J: An empirical comparison of selection measures for decision-tree induction. Machine learning. 1989, 3 (4): 319-342.

22.

Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997, 30: 1145-1159. 10.1016/S0031-3203(96)00142-2.CrossRef

23.

Palmer D: Random forest models to predict aqueous solubility. J Chem Inf Model. 2007, 47 (1): 150-158. 10.1021/ci060164k.CrossRefPubMed

24.

Liaw A, Wiener M: Classification and Regression by randomForest.

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/11/51/prepub

Title: Predicting disease risks from highly imbalanced data using random forest
Authors: Mohammed Khalilia
Sounak Chakraborty
Mihail Popescu
Publication date: 01-12-2011
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2011
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/1472-6947-11-51

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Predicting disease risks from highly imbalanced data using random forest

Abstract

Background

Methods

Results

Conclusions

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Abstract

Background

Methods

Results

Conclusions

Please log in to get access to this content

Other articles of this Issue 1/2011

Framework for the impact analysis and implementation of Clinical Prediction Rules (CPRs)

De-identifying a public use microdata file from the Canadian national discharge abstract database

GenDrux: A biomedical literature search system to identify gene expression-based drug sensitivity in breast cancer

A simple clinical model for planning transfusion quantities in heart surgery

Phase 1 pilot study of e-mail support for people with long term conditions using the Internet

An efficient record linkage scheme using graphical analysis for identifier error detection