skip to main content
10.3115/990820.990850dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free Access

Extracting the names of genes and gene products with a hidden Markov model

Published:31 July 2000Publication History

ABSTRACT

We report the results of a study into the use of a linear interpolating hidden Markov model (HMM) for the task of extracting technical terminology from MEDLINE abstracts and texts in the molecular-biology domain. This is the first stage in a system that will extract event information for automatically updating biology databases. We trained the HMM entirely with bigrams based on lexical and character features in a relatively small corpus of 100 MEDLINE abstracts that were marked-up by domain experts with term classes such as proteins and DNA. Using cross-validation methods we achieved an F-score of 0.73 and we examine the contribution made by each part of the interpolation model to overcoming data sparseness.

References

  1. A. Bairoch and R. Apweiler. 1997. The SWISSPROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Research, 25:31--36.Google ScholarGoogle ScholarCross RefCross Ref
  2. D. Bikel, S. Miller, R. Schwartz, and R. Wesichedel. 1997. Nymble: a high-performance learning name-finder. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 194--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Workshop on Very Large Corpora (WVLC '98).Google ScholarGoogle Scholar
  4. S. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. 34st Annual Meeting of the Association of Computational Linguistics, California, USA, 24--27 June. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Chinchor. 1995. MUC-5 evaluation metries. In In Proceedings of the Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, USA., pages 69--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Collier, H. S. Park, N. Ogata, Y. Tateishi, C. Nobata, T. Ohta, T. Sekimizu, H. Imai, and J. Tsujii. 1999. The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers. In Proceedings of the Annual Meeting of the European chapter of the Association for Computational Linguistics (EACL '99), June. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Craven and J. Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the 7th International Conference on Intelligent Systemps for Molecular Biology (ISMB-99), Heidelburg, Germany, August 6--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. P. Dempster, N. M. Laird, and D. B. Rubins. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (B), 39:1--38.Google ScholarGoogle ScholarCross RefCross Ref
  9. D. Freitag and A. McCallum. 1999. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI '99 Workshop on Machine Learning for Information Extraction, Orlando, Florida, July 19th.Google ScholarGoogle Scholar
  10. K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi. 1998. Toward information extraction: identifying protein names from biological papers. In Proceedings of the Pacific Symposium on Biocomputing '98 (PSB '98), January.Google ScholarGoogle Scholar
  11. J. Kupiec. 1992. Robust part-of-speech tagging using a hidden markov model. Computer Speech and Language, 6:225--242.Google ScholarGoogle ScholarCross RefCross Ref
  12. MEDLINE. 1999. The PubMed database can be found at: http://www.ncbi.nhn.nih.gov/PubMed/.Google ScholarGoogle Scholar
  13. DARPA. 1995. Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD, USA, November. Morgan Kaufmann.Google ScholarGoogle Scholar
  14. C. Nobata, N. Collier, and J. Tsujii. 1999. Automatic term identification and classification in biology texts. In Proceedings of the Natural Language Pacific Rim Symposium (NLPRS '2000), November.Google ScholarGoogle Scholar
  15. Y. Ohta, Y. Tateishi, N. Collier, C. Nobata, K. Ibushi, and J. Tsujii. 1999. A semantically annotated corpus from MEDLINE abstracts. In Proceedings of the Tenth Workshop on Genome Informatics. Universal Academy Press, Inc., 14--15 December.Google ScholarGoogle Scholar
  16. L. Rabiner and B. Juang. 1986. An introduction to hidden Markov models. IEEE ASSP Magazine, pages 4--16, January.Google ScholarGoogle Scholar
  17. T. Sekimizu, H. Park, and J. Tsujii. 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. In Genome Informatics. Universal Academy Press, Inc.Google ScholarGoogle Scholar
  18. K. Seymore, A. McCallum, and R. Rosenfeld. 1999. Learning hidden Markove structure for information extraction. In Proceedings of the AAAI '99 Workshop on Machine Learning for Information Extraction, Orlando, Florida, July 19th.Google ScholarGoogle Scholar
  19. J. Thomas, D. Milward, C. Ouzounis, S. Pulman, and M. Carroll. 1999. Automatic extraction of protein interactions from scientific abstracts. In Proceedings of the Pacific Symposium on Biocomputing '99 (PSB '99), Hawaii, USA, January 4--9.Google ScholarGoogle Scholar
  20. A. J. Viterbi. 1967. Error bounds for convolutions codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13(2):260--269.Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Extracting the names of genes and gene products with a hidden Markov model

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1
          July 2000
          616 pages
          ISBN:155860717X

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 31 July 2000

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,537of1,537submissions,100%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader