skip to main content
10.1145/1835804.1835829acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Active learning for biomedical citation screening

Published:25 July 2010Publication History

ABSTRACT

Active learning (AL) is an increasingly popular strategy for mitigating the amount of labeled data required to train classifiers, thereby reducing annotator effort. We describe a real-world, deployed application of AL to the problem of biomedical citation screening for systematic reviews at the Tufts Medical Center's Evidence-based Practice Center. We propose a novel active learning strategy that exploits a priori domain knowledge provided by the expert (specifically, labeled features)and extend this model via a Linear Programming algorithm for situations where the expert can provide ranked labeled features. Our methods outperform existing AL strategies on three real-world systematic review datasets. We argue that evaluation must be specific to the scenario under consideration. To this end, we propose a new evaluation framework for finite-pool scenarios, wherein the primary aim is to label a fixed set of examples rather than to simply induce a good predictive model. We use a method from medical decision theory for eliciting the relative costs of false positives and false negatives from the domain expert, constructing a utility measure of classification performance that integrates the expert preferences. Our findings suggest that the expert can, and should, provide more information than instance labels alone. In addition to achieving strong empirical results on the citation screening problem, this work outlines many important steps for moving away from simulated active learning and toward deploying AL for real-world applications.

Skip Supplemental Material Section

Supplemental Material

kdd2010_wallace_albc_01.mov

mov

99.1 MB

References

  1. Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. J. Mach. Learn. Res., 5:255--291, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In CLT, pages 92--100, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chih-Chung and C.-J. Lin. LIBSVM: A Library For Support Vector Machines, 2001.Google ScholarGoogle Scholar
  4. C. Counsell. Formulating questions and locating primary studies for inclusion in systematic reviews. Ann. Intern. Med., 127:380--387, Sep 1997.Google ScholarGoogle ScholarCross RefCross Ref
  5. P. Donmez and J. G. Carbonell. Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In CIKM, pages 619--628, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Druck, G. S. Mann, and A. McCallum. Learning from labeled features using generalized expectation criteria. In SIGIR, pages 595--602, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Druck, B. Settles, and A. McCallum. Active learning by labeling features. In EMNLP, pages 81--90. ACL Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. In Machine Learning, volume 28, pages 133--168, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification, 2003.Google ScholarGoogle Scholar
  10. L. Hunter and K. B. Cohen. Biomedical language processing: What's beyond pubmed? Mol Cell, 21(5):589--594, March 2006.Google ScholarGoogle ScholarCross RefCross Ref
  11. N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies. AAAI Workshop on Learning from Imbalanced Data Sets, 2000.Google ScholarGoogle Scholar
  12. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Machine Learning: ECML-98, pages 137--142, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Lewis. Evaluating and optimizing autonomous text classification systems. In SIGIR, pages 246--254, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In SIGIR, pages 3--12, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Mccallum and K. Nigam. Employing EM and pool-based active learning for text classification. In ICML, pages 350--358, San Francisco, CA, USA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Muslea, S. Minton, and C. A. Knoblock. Active learning with multiple views. Journal Artificial Intelligence Research (JAIR), 27:203--233, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. N. S. O. L. Mangasarian and W. W. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43:570--577, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Osugi, D. Kun, and S. Scott. Balancing exploration and exploitation: A new algorithm for active machine learning. In ICDM, pages 330--337, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Raghavan and J. Allan. An interactive algorithm for asking and incorporating feature feedback into support vector machines. In SIGIR, pages 79--86, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Raghavan, O. Madani, and R. Jones. Active learning with feedback on features and instances. J. Mach. Learn. Res., 7:1655--1686, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. E. Schutze, H. and J. Pedersen. Performance thresholding in practical text classification. In CIKM, pages 662--671, New York, NY, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Settles. Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison, 2009.Google ScholarGoogle Scholar
  23. V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD, pages 614--622, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Sindhwani, P. Melville, and R. D. Lawrence. Uncertainty sampling and transductive experimental design for active dual supervision. In ICML, pages 120--128, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Tomanek and F. Olsson. A web survey on the use of active learning to support annotation of text data. In NAACL Workshop on AL for NLP, pages 45--48, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In J. Mach. Learn. Res., pages 999--1006, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. N. Vapnik. The Nature of Statistical Learning Theory. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. J. Vickers and E. B. Elkin. Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26:565--574, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  29. B. C. Wallace, T. A. Trikalinos, J. Lau, C. E. Brodley, and C. H. Schmid. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics, 11, 2010.Google ScholarGoogle Scholar
  30. G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. 23:69--101, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. O. F. Zaidan, J. Eisner, and C. Piatko. Machine learning with annotator rationales to reduce annotation cost. In NIPS Workshop on Cost Sensitive Learning, December 2008Google ScholarGoogle Scholar

Index Terms

  1. Active learning for biomedical citation screening

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
      July 2010
      1240 pages
      ISBN:9781450300551
      DOI:10.1145/1835804

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 July 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader