ABSTRACT
We report the results of a study into the use of a linear interpolating hidden Markov model (HMM) for the task of extracting technical terminology from MEDLINE abstracts and texts in the molecular-biology domain. This is the first stage in a system that will extract event information for automatically updating biology databases. We trained the HMM entirely with bigrams based on lexical and character features in a relatively small corpus of 100 MEDLINE abstracts that were marked-up by domain experts with term classes such as proteins and DNA. Using cross-validation methods we achieved an F-score of 0.73 and we examine the contribution made by each part of the interpolation model to overcoming data sparseness.
- A. Bairoch and R. Apweiler. 1997. The SWISSPROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Research, 25:31--36.Google ScholarCross Ref
- D. Bikel, S. Miller, R. Schwartz, and R. Wesichedel. 1997. Nymble: a high-performance learning name-finder. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 194--201. Google ScholarDigital Library
- A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Workshop on Very Large Corpora (WVLC '98).Google Scholar
- S. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. 34st Annual Meeting of the Association of Computational Linguistics, California, USA, 24--27 June. Google ScholarDigital Library
- N. Chinchor. 1995. MUC-5 evaluation metries. In In Proceedings of the Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, USA., pages 69--78. Google ScholarDigital Library
- N. Collier, H. S. Park, N. Ogata, Y. Tateishi, C. Nobata, T. Ohta, T. Sekimizu, H. Imai, and J. Tsujii. 1999. The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers. In Proceedings of the Annual Meeting of the European chapter of the Association for Computational Linguistics (EACL '99), June. Google ScholarDigital Library
- M. Craven and J. Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the 7th International Conference on Intelligent Systemps for Molecular Biology (ISMB-99), Heidelburg, Germany, August 6--10. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubins. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (B), 39:1--38.Google ScholarCross Ref
- D. Freitag and A. McCallum. 1999. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI '99 Workshop on Machine Learning for Information Extraction, Orlando, Florida, July 19th.Google Scholar
- K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi. 1998. Toward information extraction: identifying protein names from biological papers. In Proceedings of the Pacific Symposium on Biocomputing '98 (PSB '98), January.Google Scholar
- J. Kupiec. 1992. Robust part-of-speech tagging using a hidden markov model. Computer Speech and Language, 6:225--242.Google ScholarCross Ref
- MEDLINE. 1999. The PubMed database can be found at: http://www.ncbi.nhn.nih.gov/PubMed/.Google Scholar
- DARPA. 1995. Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD, USA, November. Morgan Kaufmann.Google Scholar
- C. Nobata, N. Collier, and J. Tsujii. 1999. Automatic term identification and classification in biology texts. In Proceedings of the Natural Language Pacific Rim Symposium (NLPRS '2000), November.Google Scholar
- Y. Ohta, Y. Tateishi, N. Collier, C. Nobata, K. Ibushi, and J. Tsujii. 1999. A semantically annotated corpus from MEDLINE abstracts. In Proceedings of the Tenth Workshop on Genome Informatics. Universal Academy Press, Inc., 14--15 December.Google Scholar
- L. Rabiner and B. Juang. 1986. An introduction to hidden Markov models. IEEE ASSP Magazine, pages 4--16, January.Google Scholar
- T. Sekimizu, H. Park, and J. Tsujii. 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. In Genome Informatics. Universal Academy Press, Inc.Google Scholar
- K. Seymore, A. McCallum, and R. Rosenfeld. 1999. Learning hidden Markove structure for information extraction. In Proceedings of the AAAI '99 Workshop on Machine Learning for Information Extraction, Orlando, Florida, July 19th.Google Scholar
- J. Thomas, D. Milward, C. Ouzounis, S. Pulman, and M. Carroll. 1999. Automatic extraction of protein interactions from scientific abstracts. In Proceedings of the Pacific Symposium on Biocomputing '99 (PSB '99), Hawaii, USA, January 4--9.Google Scholar
- A. J. Viterbi. 1967. Error bounds for convolutions codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13(2):260--269.Google ScholarDigital Library
- Extracting the names of genes and gene products with a hidden Markov model
Recommendations
Recognizing names in biomedical texts using hidden Markov model and SVM plus sigmoid
JNLPBA '04: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its ApplicationsIn this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena in the biomedical domain, various evidential features are proposed and integrated through a Hidden ...
A Hidden Markov Model approach to predicting yeast gene function from sequential gene expression data
Existing data mining tools can only achieve about 40% precision in function prediction of unannotated genes. We developed a gene function prediction tool based on profile Hidden Markov Models (HMMs). Each function class was modelled using a distinct HMM ...
Comments