skip to main content
10.1145/1741906.1741927acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicwetConference Proceedingsconference-collections
research-article

Biomarker information extraction tool (BIET) development using natural language processing and machine learning

Published:26 February 2010Publication History

ABSTRACT

In recent years, there has been a rising interest in extracting entities and relations from biomedical literatures. A vast number of systems and approaches have been proposed to extract biological relations but none of them achieves satisfactory results due to the failure of handling the grammatical complexities and subtle features of biomedical texts. In this paper, we detail an approach to a very specific task of information extraction namely, extracting biomarker information in biomedical literature. Starting with the abstract of a given publication, we first identify the evaluative sentence(s) among other sentences by recognizing words and phrases in the text belonging to semantic categories of interest to bio-medical entities (semantic category recognition). For the entities like, protein, gene and disease, we determine whether the statement refers to biomarker relationship (assertion classification). Finally, we identify the biomarker relationship among the bio-medical entities (semantic relationship classification). Our approach utilizes a series of statistical models that rely heavily on local lexical and syntactic context and achieve competitive results compared to more complex NLP solutions. We conclude the paper by presenting the design of a system namely, the Biomarker Information Extraction Tool (BIET). BIET combines our solutions to semantic category recognition, assertion classification and semantic relationship classification into a single application that facilitates the easy extraction of semantic information from medical text. We designed and implemented ML-based BIET system for biomarker extraction, using support vector machines and trained and tested it on a corpus of oncology related PubMed/MEDLINE literatures hand-annotated with biomarker information. Several tests are performed to assess the performance of the system's component namely semantic category recognizer, assertion classifier and semantic relationship classifier and the system achieves an average F-score of 86% for the task of biomarker information extraction comparing to the human annotated dataset (i.e. gold standard) scores.

References

  1. Blaschke C, Andrade MA, Ouzounis C, Valencia A. Automatic extraction of biological information from scientific text: protein-protein interactions. Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology. 1999; 60--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bunescu R, Ge R, Kate RJ, Marcotte EM, Mooney RJ, Ramani AK, Wong YW: Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine 2005, 33(2):139--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bunescu RC, Mooney RJ: Subsequence Kernels for Relation Extraction. Proceedings of the 19th Conference on Neural Information Processing Systems 2005.Google ScholarGoogle Scholar
  4. Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001.Google ScholarGoogle Scholar
  5. Cunningham H, Maynard D, Bontcheva K, Tablan V: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA 2002:168--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Sleator and D. Temperley. Parsing English with a link grammar. Technical Report CMU-CS-91-196, Carnegie Mellon University, 1991.Google ScholarGoogle Scholar
  7. Jenssen TK, Lægreid A, Komorowski J, Hovig E. A literature network of human genes for highthroughput analysis of gene expression. Nature Genetics 2001; 28: 21--8.Google ScholarGoogle Scholar
  8. K. Bontcheva, H. Cunningham, V. Tablan, D. Maynard, O. Hamza. Using GATE as an Environment for Teaching NLP. Proceedings of the ACL Workshop on Effective Tools and Methodologies in Teaching NLP, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ono T, Hishigaki H, Tanigami A, Takagi T. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 2001; 17:155--61.Google ScholarGoogle Scholar
  10. Ramani AK, Bunescu RC, Mooney RJ, Marcotte EM: Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol 2005, 6(5).Google ScholarGoogle Scholar
  11. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L: EDGAR: Extraction of Drugs, Genes And Relations from the Biomedical Literature. Proceedings of Pacific Symposium on Biocomputing 2000:517--528.Google ScholarGoogle Scholar
  12. Rindflesch TC, Libbus B, Hristovski D, Aronson AR, Kilicoglu H: Semantic relations asserting the etiology of genetic diseases. AMIA Annu Symp Proc, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, Maryland20894, USA 2003:554--558.Google ScholarGoogle Scholar
  13. Rosario B, Hearst A: Multi-way Relation Classification: Application to Protein-Protein Interaction. Human Language Technology Conference on Empirical Methods in Natural Language Processing 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, Morris M, Yu H, Duboué PA, Weng W, Wilbur WJ, Hatzivassiloglou V, Friedman C: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 2004, 37:43--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S Hunter L, Cohen KB. Biomedical language processing: what's beyond PubMed? Mol Cell 2006; 21:589--94.Google ScholarGoogle Scholar
  16. Sibanda, Tawanda. Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records. Massachusetts Institute of Technology, June 2006.Google ScholarGoogle Scholar
  17. Yanhui Hu, Lisa M. Hines, Haifeng Weng, Dongmei Zuo, Miguel Rivera, Andrea Richardson, and Joshua LaBaer: Analysis of Genomic and Proteomic Data Using Advanced Literature Mining. Journal of Proteome Research 2003, 2, 405--412Google ScholarGoogle Scholar
  18. Zelenko D, Aone C, Richardella A: Kernel Methods for Relation Extraction. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, USA 2002:71--78. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Biomarker information extraction tool (BIET) development using natural language processing and machine learning

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ICWET '10: Proceedings of the International Conference and Workshop on Emerging Trends in Technology
              February 2010
              1070 pages
              ISBN:9781605588124
              DOI:10.1145/1741906

              Copyright © 2010 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 26 February 2010

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader