Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2017

Open Access 01-12-2017 | Research article

Imbalanced target prediction with pattern discovery on clinical data repositories

Authors: Tak-Ming Chan, Yuxi Li, Choo-Chiap Chiau, Jane Zhu, Jie Jiang, Yong Huo

Published in: BMC Medical Informatics and Decision Making | Issue 1/2017

Login to get access

Abstract

Background

Clinical data repositories (CDR) have great potential to improve outcome prediction and risk modeling. However, most clinical studies require careful study design, dedicated data collection efforts, and sophisticated modeling techniques before a hypothesis can be tested. We aim to bridge this gap, so that clinical domain users can perform first-hand prediction on existing repository data without complicated handling, and obtain insightful patterns of imbalanced targets for a formal study before it is conducted. We specifically target for interpretability for domain users where the model can be conveniently explained and applied in clinical practice.

Methods

We propose an interpretable pattern model which is noise (missing) tolerant for practice data. To address the challenge of imbalanced targets of interest in clinical research, e.g., deaths less than a few percent, the geometric mean of sensitivity and specificity (G-mean) optimization criterion is employed, with which a simple but effective heuristic algorithm is developed.

Results

We compared pattern discovery to clinically interpretable methods on two retrospective clinical datasets. They contain 14.9% deaths in 1 year in the thoracic dataset and 9.1% deaths in the cardiac dataset, respectively. In spite of the imbalance challenge shown on other methods, pattern discovery consistently shows competitive cross-validated prediction performance. Compared to logistic regression, Naïve Bayes, and decision tree, pattern discovery achieves statistically significant (p-values < 0.01, Wilcoxon signed rank test) favorable averaged testing G-means and F1-scores (harmonic mean of precision and sensitivity). Without requiring sophisticated technical processing of data and tweaking, the prediction performance of pattern discovery is consistently comparable to the best achievable performance.

Conclusions

Pattern discovery has demonstrated to be robust and valuable for target prediction on existing clinical data repositories with imbalance and noise. The prediction results and interpretable patterns can provide insights in an agile and inexpensive way for the potential formal studies.
Appendix
Available only for authorised users
Literature
1.
go back to reference Taylor GS, Muhlestein JB, Wagner GS, Bair TL, Li P, Anderson JL. Implementation of a computerized cardiovascular information system in a private hospital setting. Am Heart J. 1998;136:792–803.CrossRefPubMed Taylor GS, Muhlestein JB, Wagner GS, Bair TL, Li P, Anderson JL. Implementation of a computerized cardiovascular information system in a private hospital setting. Am Heart J. 1998;136:792–803.CrossRefPubMed
2.
go back to reference Anderson HV, Shaw RE, Brindis RG, Hewitt K, Krone RJ, Block PC, McKay CR, Weintraub WS. A contemporary overview of percutaneous coronary interventions: The American College of Cardiology-National Cardiovascular Data Registry (ACC-NCDR). J Am Coll Cardiol. 2002;39:1096–103.CrossRefPubMed Anderson HV, Shaw RE, Brindis RG, Hewitt K, Krone RJ, Block PC, McKay CR, Weintraub WS. A contemporary overview of percutaneous coronary interventions: The American College of Cardiology-National Cardiovascular Data Registry (ACC-NCDR). J Am Coll Cardiol. 2002;39:1096–103.CrossRefPubMed
3.
go back to reference Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang JF, Hua L. Data mining in healthcare and biomedicine: A survey of the literature. J Med Syst. 2012;36:2431–48.CrossRefPubMed Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang JF, Hua L. Data mining in healthcare and biomedicine: A survey of the literature. J Med Syst. 2012;36:2431–48.CrossRefPubMed
4.
go back to reference Rao SV, McCoy LA, Spertus JA, Krone RJ, Singh M, Fitzgerald S, Peterson ED. An updated bleeding model to predict the risk of post-procedure bleeding among patients undergoing percutaneous coronary intervention: A report using an expanded bleeding definition from the national cardiovascular data registry CathPCI registry. JACC Cardiovasc Interv. 2013;6:897–904.CrossRefPubMed Rao SV, McCoy LA, Spertus JA, Krone RJ, Singh M, Fitzgerald S, Peterson ED. An updated bleeding model to predict the risk of post-procedure bleeding among patients undergoing percutaneous coronary intervention: A report using an expanded bleeding definition from the national cardiovascular data registry CathPCI registry. JACC Cardiovasc Interv. 2013;6:897–904.CrossRefPubMed
5.
go back to reference Kim J, Ghasemzadeh N, Eapen DJ, Chung NC, Storey JD, Quyyumi AA, Gibson G. Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death. Genome Med. 2014;6:40.CrossRefPubMedPubMedCentral Kim J, Ghasemzadeh N, Eapen DJ, Chung NC, Storey JD, Quyyumi AA, Gibson G. Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death. Genome Med. 2014;6:40.CrossRefPubMedPubMedCentral
6.
go back to reference Wasfy JH, Singal G, O’Brien C, Blumenthal DM, Kennedy KF, Strom JB, Spertus JA, Mauri L, Normand S-LT, Yeh RW. Enhancing the Prediction of 30-Day Readmission After Percutaneous Coronary Intervention Using Data Extracted by Querying of the Electronic Health Record. Circ Cardiovasc Qual Outcomes. 2015;8:477–85.CrossRefPubMed Wasfy JH, Singal G, O’Brien C, Blumenthal DM, Kennedy KF, Strom JB, Spertus JA, Mauri L, Normand S-LT, Yeh RW. Enhancing the Prediction of 30-Day Readmission After Percutaneous Coronary Intervention Using Data Extracted by Querying of the Electronic Health Record. Circ Cardiovasc Qual Outcomes. 2015;8:477–85.CrossRefPubMed
7.
go back to reference Ziȩba M, Tomczak JM. Boosted SVM with active learning strategy for imbalanced data. Soft Comput. 2015;19:3357–68.CrossRef Ziȩba M, Tomczak JM. Boosted SVM with active learning strategy for imbalanced data. Soft Comput. 2015;19:3357–68.CrossRef
8.
go back to reference Tomczak JM, Ziȩba M. Probabilistic combination of classification rules and its application to medical diagnosis. Mach Learn. 2015;101:105–35.CrossRef Tomczak JM, Ziȩba M. Probabilistic combination of classification rules and its application to medical diagnosis. Mach Learn. 2015;101:105–35.CrossRef
9.
go back to reference Oh S, Lee MS, Zhang BT. Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinforma. 2011;8:316–25.CrossRef Oh S, Lee MS, Zhang BT. Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinforma. 2011;8:316–25.CrossRef
10.
go back to reference Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern Part A Syst Hum. 2010;40:185–97.CrossRef Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern Part A Syst Hum. 2010;40:185–97.CrossRef
11.
go back to reference Tao D, Tang X, Li X, Wu X. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell. 2006;28:1088–99.CrossRefPubMed Tao D, Tang X, Li X, Wu X. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell. 2006;28:1088–99.CrossRefPubMed
12.
13.
go back to reference Huang Z, Chan T-M, Dong W. MACE prediction of acute coronary syndrome via boosted resampling classification using electronic medical records. J Biomed Inform. 2017;66:161–70.CrossRefPubMed Huang Z, Chan T-M, Dong W. MACE prediction of acute coronary syndrome via boosted resampling classification using electronic medical records. J Biomed Inform. 2017;66:161–70.CrossRefPubMed
14.
go back to reference Werbos PJ. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis. Washington: Harvard University; 1975. Werbos PJ. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis. Washington: Harvard University; 1975.
16.
go back to reference Gortmaker SL, Hosmer DW, Lemeshow S. Applied Logistic Regression. Contemp Sociol. 1994;23:159.CrossRef Gortmaker SL, Hosmer DW, Lemeshow S. Applied Logistic Regression. Contemp Sociol. 1994;23:159.CrossRef
17.
go back to reference John GHG, Langley P. Estimating Continuous Distributions in Bayesian Classifiers. Proc Elev Conf Uncertain Artif Intell Montr Quebec, Canada. 1995;1:338–45. John GHG, Langley P. Estimating Continuous Distributions in Bayesian Classifiers. Proc Elev Conf Uncertain Artif Intell Montr Quebec, Canada. 1995;1:338–45.
18.
go back to reference Quinlan JR. C4.5: Programs for Machine Learning. 1992. Quinlan JR. C4.5: Programs for Machine Learning. 1992.
19.
go back to reference Aha DW, Kibler D, Albert MK. Instance-Based Learning Algorithms. Mach Learn. 1991;6:37–66. Aha DW, Kibler D, Albert MK. Instance-Based Learning Algorithms. Mach Learn. 1991;6:37–66.
20.
go back to reference Ziȩba M, Tomczak JM, Lubicz M, Swia̧tek J. Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput J. 2014;14:99–108.CrossRef Ziȩba M, Tomczak JM, Lubicz M, Swia̧tek J. Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput J. 2014;14:99–108.CrossRef
21.
go back to reference Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. J Comput Sci Technol. 1994;1215:487–99. Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. J Comput Sci Technol. 1994;1215:487–99.
22.
go back to reference Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–44.CrossRefPubMed Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–44.CrossRefPubMed
23.
go back to reference Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82. Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.
24.
go back to reference Liu B, Hsu W, Ma Y, Ma B. Integrating Classification and Association Rule Mining. Knowl Discov Data Min. 1998;1998:80–6. Liu B, Hsu W, Ma Y, Ma B. Integrating Classification and Association Rule Mining. Knowl Discov Data Min. 1998;1998:80–6.
25.
go back to reference Cohen WW. Fast effective rule induction. Proc Twelfth Int Conf Mach Learn. 1995;95:115–23. Cohen WW. Fast effective rule induction. Proc Twelfth Int Conf Mach Learn. 1995;95:115–23.
26.
go back to reference Leung KS, Wong KC, Chan TM, Wong MH, Lee KH, Lau CK, Tsui SKW. Discovering protein-DNA binding sequence patterns using association rule mining. Nucleic Acids Res. 2010;38:6324–37.CrossRefPubMedPubMedCentral Leung KS, Wong KC, Chan TM, Wong MH, Lee KH, Lau CK, Tsui SKW. Discovering protein-DNA binding sequence patterns using association rule mining. Nucleic Acids Res. 2010;38:6324–37.CrossRefPubMedPubMedCentral
27.
go back to reference Chan TM, Wong KC, Lee KH, Wong MH, Lau CK, Tsui SKW, Leung KS. Discovering approximate-associated sequence patterns for protein-DNA interactions. Bioinformatics. 2011;27:471–8.CrossRefPubMed Chan TM, Wong KC, Lee KH, Wong MH, Lau CK, Tsui SKW, Leung KS. Discovering approximate-associated sequence patterns for protein-DNA interactions. Bioinformatics. 2011;27:471–8.CrossRefPubMed
28.
go back to reference Lawrence J. A guide to Chi-squared testing. J Stat Plan Inference. 1997;64:157–8.CrossRef Lawrence J. A guide to Chi-squared testing. J Stat Plan Inference. 1997;64:157–8.CrossRef
29.
go back to reference Hripcsak G, Rothschild AS. Agreement, the F-measure, and reliability in information retrieval. J Am Med Informatics Assoc. 2005;12:296–8.CrossRef Hripcsak G, Rothschild AS. Agreement, the F-measure, and reliability in information retrieval. J Am Med Informatics Assoc. 2005;12:296–8.CrossRef
30.
go back to reference Kohavi R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Int Jt Conf Artif Intell. 1995;14:1137–43. Kohavi R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Int Jt Conf Artif Intell. 1995;14:1137–43.
31.
go back to reference Woolson RF. Wilcoxon signed-rank test. Wiley Encycl Clin Trials. 2008;2008:1–3. Woolson RF. Wilcoxon signed-rank test. Wiley Encycl Clin Trials. 2008;2008:1–3.
32.
go back to reference Garner SR. WEKA: The Waikato Environment for Knowledge Analysis. Proc New Zeal Comput Sci. 1995;1995:57–64. Garner SR. WEKA: The Waikato Environment for Knowledge Analysis. Proc New Zeal Comput Sci. 1995;1995:57–64.
33.
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
34.
go back to reference Ling CX, Sheng VS. Cost-sensitive learning and the class imbalance problem. Encycl Mach Learn. 2008;2008:231–5. Ling CX, Sheng VS. Cost-sensitive learning and the class imbalance problem. Encycl Mach Learn. 2008;2008:231–5.
Metadata
Title
Imbalanced target prediction with pattern discovery on clinical data repositories
Authors
Tak-Ming Chan
Yuxi Li
Choo-Chiap Chiau
Jane Zhu
Jie Jiang
Yong Huo
Publication date
01-12-2017
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2017
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-017-0443-3

Other articles of this Issue 1/2017

BMC Medical Informatics and Decision Making 1/2017 Go to the issue