Skip to main content
Top
Published in: Systematic Reviews 1/2019

Open Access 01-12-2019 | Methodology

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

Authors: Corrado Lanera, Paola Berchialla, Abhinav Sharma, Clara Minto, Dario Gregori, Ileana Baldi

Published in: Systematic Reviews | Issue 1/2019

Login to get access

Abstract

Background

The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews.

Methods

We trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy.

Results

Cross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65.

Conclusions

Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.
Literature
1.
go back to reference Thomas J, Noel-Storr A, Marshall I, et al. Living systematic reviews: 2. Combining human and machine effort. J Clin Epidemiol. 2017;91:31–7.CrossRef Thomas J, Noel-Storr A, Marshall I, et al. Living systematic reviews: 2. Combining human and machine effort. J Clin Epidemiol. 2017;91:31–7.CrossRef
2.
go back to reference Khabsa M, Elmagarmid A, Ilyas I, et al. Learning to identify relevant studies for systematic reviews using random forest and external information. Mach Learn. 2016;102:465–82.CrossRef Khabsa M, Elmagarmid A, Ilyas I, et al. Learning to identify relevant studies for systematic reviews using random forest and external information. Mach Learn. 2016;102:465–82.CrossRef
4.
go back to reference Wallace BC, Noel-Storr A, Marshall IJ, et al. Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach. J Am Med Inform Assoc. 2017;24:1165–8.CrossRef Wallace BC, Noel-Storr A, Marshall IJ, et al. Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach. J Am Med Inform Assoc. 2017;24:1165–8.CrossRef
5.
go back to reference Miwa M, Thomas J, O’Mara-Eves A, et al. Reducing systematic review workload through certainty-based screening. J Biomed Inform. 2014;51:242–53.CrossRef Miwa M, Thomas J, O’Mara-Eves A, et al. Reducing systematic review workload through certainty-based screening. J Biomed Inform. 2014;51:242–53.CrossRef
6.
go back to reference O’Mara-Eves A, Thomas J, McNaught J, et al. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4:5.CrossRef O’Mara-Eves A, Thomas J, McNaught J, et al. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4:5.CrossRef
8.
go back to reference Wallace BC, Trikalinos TA, Lau J, et al. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics. 2010;11:55.CrossRef Wallace BC, Trikalinos TA, Lau J, et al. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics. 2010;11:55.CrossRef
11.
go back to reference Laza R, Pavón R, Reboiro-Jato M, et al. Evaluating the effect of unbalanced data in biomedical document classification. J Integr Bioinforma. 2011;8:105–17.CrossRef Laza R, Pavón R, Reboiro-Jato M, et al. Evaluating the effect of unbalanced data in biomedical document classification. J Integr Bioinforma. 2011;8:105–17.CrossRef
12.
go back to reference Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.CrossRef Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.CrossRef
13.
go back to reference Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. IEEE:324–31. Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. IEEE:324–31.
14.
go back to reference Lanera C, Minto C, Sharma A, et al. Extending PubMed searches to ClinicalTrials.gov through a machine learning approach for systematic reviews. J Clin Epidemiol. 2018;103:22–30.CrossRef Lanera C, Minto C, Sharma A, et al. Extending PubMed searches to ClinicalTrials.gov through a machine learning approach for systematic reviews. J Clin Epidemiol. 2018;103:22–30.CrossRef
15.
go back to reference Naderalvojoud B, Bozkir AS, Sezer EA. Investigation of term weighting schemes in classification of imbalanced texts. Lisbon: Proceedings of European Conference on Data Mining (ECDM). p. 15–7. Naderalvojoud B, Bozkir AS, Sezer EA. Investigation of term weighting schemes in classification of imbalanced texts. Lisbon: Proceedings of European Conference on Data Mining (ECDM). p. 15–7.
16.
go back to reference Lessmann S. Solving imbalanced classification problems with support vector machines: IC-AI. p. 214–20. Lessmann S. Solving imbalanced classification problems with support vector machines: IC-AI. p. 214–20.
17.
go back to reference Tan S. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl. 2005;28:667–71.CrossRef Tan S. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl. 2005;28:667–71.CrossRef
18.
go back to reference Jindal R, Malhotra R, Jain A. Techniques for text classification: literature review and current trends. Webology. 2015;12:1. Jindal R, Malhotra R, Jain A. Techniques for text classification: literature review and current trends. Webology. 2015;12:1.
19.
go back to reference Shardlow M, Batista-Navarro R, Thompson P, et al. Identification of research hypotheses and new knowledge from scientific literature. BMC Med Inform Decis Mak. 2018;18:46.CrossRef Shardlow M, Batista-Navarro R, Thompson P, et al. Identification of research hypotheses and new knowledge from scientific literature. BMC Med Inform Decis Mak. 2018;18:46.CrossRef
20.
go back to reference Zheng T, Xie W, Xu L, et al. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inf. 2017;97:120–7.CrossRef Zheng T, Xie W, Xu L, et al. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inf. 2017;97:120–7.CrossRef
21.
go back to reference Khoshgoftaar TM, Seiffert C, Van Hulse J, et al. Learning with limited minority class data. In: Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on. IEEE, pp. 348–353. Khoshgoftaar TM, Seiffert C, Van Hulse J, et al. Learning with limited minority class data. In: Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on. IEEE, pp. 348–353.
24.
go back to reference Kourbeti IS, Ziakas PD, Mylonakis E. Biologic therapies in rheumatoid arthritis and the risk of opportunistic infections: a meta-analysis. Clin Infect Dis Off Publ Infect Dis Soc Am. 2014;58:1649–57.CrossRef Kourbeti IS, Ziakas PD, Mylonakis E. Biologic therapies in rheumatoid arthritis and the risk of opportunistic infections: a meta-analysis. Clin Infect Dis Off Publ Infect Dis Soc Am. 2014;58:1649–57.CrossRef
26.
go back to reference Mountassir A, Benbrahim H, Berrada I. An empirical study to address the problem of unbalanced data sets in sentiment classification. In: Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on: IEEE. p. 3298–303. Mountassir A, Benbrahim H, Berrada I. An empirical study to address the problem of unbalanced data sets in sentiment classification. In: Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on: IEEE. p. 3298–303.
27.
go back to reference González RR, Iglesias EL, Diz LB. Applying balancing techniques to classify biomedical documents: an empirical study. Int J Artif Intell. 2012;8:186–201. González RR, Iglesias EL, Diz LB. Applying balancing techniques to classify biomedical documents: an empirical study. Int J Artif Intell. 2012;8:186–201.
28.
go back to reference Liu S, Forss T. Text classification models for web content filtering and online safety. In: Data Mining Workshop (ICDMW), 2015 IEEE International Conference on: IEEE. p. 961–8. Liu S, Forss T. Text classification models for web content filtering and online safety. In: Data Mining Workshop (ICDMW), 2015 IEEE International Conference on: IEEE. p. 961–8.
Metadata
Title
Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
Authors
Corrado Lanera
Paola Berchialla
Abhinav Sharma
Clara Minto
Dario Gregori
Ileana Baldi
Publication date
01-12-2019
Publisher
BioMed Central
Published in
Systematic Reviews / Issue 1/2019
Electronic ISSN: 2046-4053
DOI
https://doi.org/10.1186/s13643-019-1245-8

Other articles of this Issue 1/2019

Systematic Reviews 1/2019 Go to the issue