research-article

Active learning for biomedical citation screening

Authors:
Byron C. Wallace

Tufts University, Medford, MA, USA

Tufts University, Medford, MA, USA
View Profile

,
Kevin Small

Tufts University, Medford, MA, USA

Tufts University, Medford, MA, USA
View Profile

,
Carla E. Brodley

Tufts University, Medford, MA, USA

Tufts University, Medford, MA, USA
View Profile

,
Thomas A. Trikalinos

Tufts Medical Center, Boston, MA, USA

Tufts Medical Center, Boston, MA, USA
View Profile

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2010Pages 173–182https://doi.org/10.1145/1835804.1835829

Published:25 July 2010Publication History

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 173–182

ABSTRACT

Active learning (AL) is an increasingly popular strategy for mitigating the amount of labeled data required to train classifiers, thereby reducing annotator effort. We describe a real-world, deployed application of AL to the problem of biomedical citation screening for systematic reviews at the Tufts Medical Center's Evidence-based Practice Center. We propose a novel active learning strategy that exploits a priori domain knowledge provided by the expert (specifically, labeled features)and extend this model via a Linear Programming algorithm for situations where the expert can provide ranked labeled features. Our methods outperform existing AL strategies on three real-world systematic review datasets. We argue that evaluation must be specific to the scenario under consideration. To this end, we propose a new evaluation framework for finite-pool scenarios, wherein the primary aim is to label a fixed set of examples rather than to simply induce a good predictive model. We use a method from medical decision theory for eliciting the relative costs of false positives and false negatives from the domain expert, constructing a utility measure of classification performance that integrates the expert preferences. Our findings suggest that the expert can, and should, provide more information than instance labels alone. In addition to achieving strong empirical results on the citation screening problem, this work outlines many important steps for moving away from simulated active learning and toward deploying AL for real-world applications.

Supplemental Material

kdd2010_wallace_albc_01.mov

mov

99.1 MB

Download

References

Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. J. Mach. Learn. Res., 5:255--291, 2004. Google ScholarDigital Library
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In CLT, pages 92--100, 1998. Google ScholarDigital Library
Chih-Chung and C.-J. Lin. LIBSVM: A Library For Support Vector Machines, 2001.Google Scholar
C. Counsell. Formulating questions and locating primary studies for inclusion in systematic reviews. Ann. Intern. Med., 127:380--387, Sep 1997.Google ScholarCross Ref
P. Donmez and J. G. Carbonell. Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In CIKM, pages 619--628, 2008. Google ScholarDigital Library
G. Druck, G. S. Mann, and A. McCallum. Learning from labeled features using generalized expectation criteria. In SIGIR, pages 595--602, 2009. Google ScholarDigital Library
G. Druck, B. Settles, and A. McCallum. Active learning by labeling features. In EMNLP, pages 81--90. ACL Press, 2009. Google ScholarDigital Library
Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. In Machine Learning, volume 28, pages 133--168, 1997. Google ScholarDigital Library
C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification, 2003.Google Scholar
L. Hunter and K. B. Cohen. Biomedical language processing: What's beyond pubmed? Mol Cell, 21(5):589--594, March 2006.Google ScholarCross Ref
N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies. AAAI Workshop on Learning from Imbalanced Data Sets, 2000.Google Scholar
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Machine Learning: ECML-98, pages 137--142, 1998. Google ScholarDigital Library
D. Lewis. Evaluating and optimizing autonomous text classification systems. In SIGIR, pages 246--254, 1995. Google ScholarDigital Library
D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In SIGIR, pages 3--12, 1994. Google ScholarDigital Library
A. Mccallum and K. Nigam. Employing EM and pool-based active learning for text classification. In ICML, pages 350--358, San Francisco, CA, USA, 1998. Google ScholarDigital Library
I. Muslea, S. Minton, and C. A. Knoblock. Active learning with multiple views. Journal Artificial Intelligence Research (JAIR), 27:203--233, 2006. Google ScholarDigital Library
W. N. S. O. L. Mangasarian and W. W. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43:570--577, 1995.Google ScholarDigital Library
T. Osugi, D. Kun, and S. Scott. Balancing exploration and exploitation: A new algorithm for active machine learning. In ICDM, pages 330--337, 2005. Google ScholarDigital Library
H. Raghavan and J. Allan. An interactive algorithm for asking and incorporating feature feedback into support vector machines. In SIGIR, pages 79--86, 2007. Google ScholarDigital Library
H. Raghavan, O. Madani, and R. Jones. Active learning with feedback on features and instances. J. Mach. Learn. Res., 7:1655--1686, 2006. Google ScholarDigital Library
V. E. Schutze, H. and J. Pedersen. Performance thresholding in practical text classification. In CIKM, pages 662--671, New York, NY, USA, 2006. Google ScholarDigital Library
B. Settles. Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison, 2009.Google Scholar
V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD, pages 614--622, 2008. Google ScholarDigital Library
V. Sindhwani, P. Melville, and R. D. Lawrence. Uncertainty sampling and transductive experimental design for active dual supervision. In ICML, pages 120--128, 2009. Google ScholarDigital Library
K. Tomanek and F. Olsson. A web survey on the use of active learning to support annotation of text data. In NAACL Workshop on AL for NLP, pages 45--48, June 2009. Google ScholarDigital Library
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In J. Mach. Learn. Res., pages 999--1006, 2000. Google ScholarDigital Library
V. N. Vapnik. The Nature of Statistical Learning Theory. 1995. Google ScholarDigital Library
A. J. Vickers and E. B. Elkin. Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26:565--574, 2006.Google ScholarCross Ref
B. C. Wallace, T. A. Trikalinos, J. Lau, C. E. Brodley, and C. H. Schmid. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics, 11, 2010.Google Scholar
G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. 23:69--101, 1996. Google ScholarDigital Library
O. F. Zaidan, J. Eisner, and C. Piatko. Machine learning with annotator rationales to reduce annotation cost. In NIPS Workshop on Cost Sensitive Learning, December 2008Google Scholar

Index Terms

Active learning for biomedical citation screening
1. Applied computing
  1. Life and medical sciences

Recommendations

A semi-supervised approach using label propagation to support citation screening

Display Omitted Systematic reviews can benefit from automatically screening relevant citations.We propose a new method that improves the performance by using similar citations.We utilise unlabelled documents by propagating labels in close ...
Read More
Effective multi-label active learning for text classification
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Labeling text data is quite time-consuming but essential for automatic text classification. Especially, manually creating multiple labels for each document may become impractical when a very large amount of data is needed for training multi-label text ...
Read More
Cost‐effective multi‐instance multilabel active learning
Abstract
Multi‐instance multi‐label (MIML) Active Learning (M2AL) aims to improve the learner while reducing the cost as much as possible by querying informative labels of complex bags composed of diverse instances. Existing M2AL solutions suffer high ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
July 2010
1240 pages
ISBN:9781450300551
DOI:10.1145/1835804
General Chairs:
Bharat Rao
Siemens
,
Balaji Krishnapuram
Siemens
,
Program Chairs:
Andrew Tomkins
Google Inc.
,
Qiang Yang
Hong Kong University of Science and Technology
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
applications
medicine
text classification
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 50
  Total Citations
  View Citations
- 852
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Active learning for biomedical citation screening

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A semi-supervised approach using label propagation to support citation screening

Effective multi-label active learning for text classification

Cost‐effective multi‐instance multilabel active learning