ABSTRACT
We explore the use of morphological analysis as preprocessing for protein name tagging. Our method finds protein names by chunking based on a morpheme, the smallest unit determined by the morphological analysis. This helps to recognize the exact boundaries of protein names. Moreover, our morphological analyzer can deal with compounds. This offers a simple way to adapt name descriptions from biomedical resources for language processing. Using GENIA corpus 3.01, our method attains f-score of 70 points for protein molecule names, and 75 points for protein names including molecules, families and domains.
- B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res., 31:365--370.Google ScholarCross Ref
- N. Collier, C. Nobata, and J. Tsujii. 2000. Extracting the Names of Genes and Gene Products with a Hidden Markov Model. COLING, pages 201--207. Google ScholarDigital Library
- M. Collins and Y. Singer. 1999. Unsupervised Models for Named Entity Classification. EMNLP-VLC, pages 100--110.Google Scholar
- The Gene Ontology Consortium. 2000. Gene ontology: tool for the unification of biology. Nature Genetics, 25:25--29.Google ScholarCross Ref
- K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi. 1998. Toward information extraction: identifying protein names from biological papers. PSB, pages 705--716.Google Scholar
- D. Hanisch, J. Fluck, HT. Mevissen, and R. Zimmer. 2003. Playing biology's name game: identifying protein names in scientific text. PSB, pages 403--414.Google Scholar
- J. Kazama, T. Makino, Y. Ohta, and J. Tsujii. 2002. Tuning Support Vector Machines for Biomedical Named Entity Recognition. ACL Workshop on NLP in Biomedical Domain, pages 1--8. Google ScholarDigital Library
- T. Kudo and Y. Matsumoto. 2001. Chunking with Support Vector Machines. NAACL, pages 192--199.Google Scholar
- C. D. Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press. Google ScholarDigital Library
- NLM. 2002. UMLS Knowledge Sources. 13th edition.Google Scholar
- F. Olsson, G. Eriksson, K. Franzen, L. Asker, and P. Lidén. 2002. Notions of Correctness when Evaluating Protein Name Tagger. COLING, pages 765--771.Google Scholar
- L. Tanabe and W. J. Wilbur. 2002. Tagging gene and protein names in biomedical text. Bioinformatics, 18(8):1124--1132.Google ScholarCross Ref
- E. F. Tjong Kim Sang and J. Veenstra. 1999. Representing Text Chunks. EACL, pages 173--179. Google ScholarDigital Library
- C. H. Wu, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z.-Z. Hu, R. S. Ledley, K. C. Lewis, H.-W. Mewes, B. C. Orcutt, B. E. Suzek, A. Tsugita, C. R. Vinayaka, L.-S. L. Yeh, J. Zhang, and W. C. Barker. 2002. The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res., 30:35--37.Google ScholarCross Ref
- T. Yamashita and Y. Matsumoto. 2000. Language Independent Morphological Analysis. 6th Applied Natural Language Processing Conference, pages 232--238. Google ScholarDigital Library
- Protein name tagging for biomedical annotation in text
Recommendations
Protein name tagging guidelines: lessons learned: Conference Papers
Interest in information extraction from the biomedical literature is motivated by the need to speed up the creation of structured databases representing the latest scientific knowledge about specific objects, such as proteins and genes. This paper ...
Using heuristics, syntax and a local dynamic dictionary for protein name tagging
HLT '02: Proceedings of the second international conference on Human Language Technology ResearchA prerequisite for all higher level information extraction tasks is the identification of unknown names in text. This paper presents a method for extracting protein names from abstracts of articles in the biomedical domain. These names present several ...
Two learning approaches for protein name extraction
Protein name extraction, one of the basic tasks in automatic extraction of information from biological texts, remains challenging. In this paper, we explore the use of two different machine learning techniques and present the results of the conducted ...
Comments