Article

Free Access

Extracting the names of genes and gene products with a hidden Markov model

Authors:
Nigel Collier

University of Tokyo, Tokyo, Japan

University of Tokyo, Tokyo, Japan
View Profile

,
Chikashi Nobata

University of Tokyo, Tokyo, Japan

University of Tokyo, Tokyo, Japan
View Profile

,
Jun-ichi Tsujii

University of Tokyo, Tokyo, Japan

University of Tokyo, Tokyo, Japan
View Profile

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1July 2000Pages 201–207https://doi.org/10.3115/990820.990850

Published:31 July 2000Publication History

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1

Pages 201–207

ABSTRACT

We report the results of a study into the use of a linear interpolating hidden Markov model (HMM) for the task of extracting technical terminology from MEDLINE abstracts and texts in the molecular-biology domain. This is the first stage in a system that will extract event information for automatically updating biology databases. We trained the HMM entirely with bigrams based on lexical and character features in a relatively small corpus of 100 MEDLINE abstracts that were marked-up by domain experts with term classes such as proteins and DNA. Using cross-validation methods we achieved an F-score of 0.73 and we examine the contribution made by each part of the interpolation model to overcoming data sparseness.

References

A. Bairoch and R. Apweiler. 1997. The SWISSPROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Research, 25:31--36.Google ScholarCross Ref
D. Bikel, S. Miller, R. Schwartz, and R. Wesichedel. 1997. Nymble: a high-performance learning name-finder. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 194--201. Google ScholarDigital Library
A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Workshop on Very Large Corpora (WVLC '98).Google Scholar
S. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. 34st Annual Meeting of the Association of Computational Linguistics, California, USA, 24--27 June. Google ScholarDigital Library
N. Chinchor. 1995. MUC-5 evaluation metries. In In Proceedings of the Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, USA., pages 69--78. Google ScholarDigital Library
N. Collier, H. S. Park, N. Ogata, Y. Tateishi, C. Nobata, T. Ohta, T. Sekimizu, H. Imai, and J. Tsujii. 1999. The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers. In Proceedings of the Annual Meeting of the European chapter of the Association for Computational Linguistics (EACL '99), June. Google ScholarDigital Library
M. Craven and J. Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the 7th International Conference on Intelligent Systemps for Molecular Biology (ISMB-99), Heidelburg, Germany, August 6--10. Google ScholarDigital Library
A. P. Dempster, N. M. Laird, and D. B. Rubins. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (B), 39:1--38.Google ScholarCross Ref
D. Freitag and A. McCallum. 1999. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI '99 Workshop on Machine Learning for Information Extraction, Orlando, Florida, July 19th.Google Scholar
K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi. 1998. Toward information extraction: identifying protein names from biological papers. In Proceedings of the Pacific Symposium on Biocomputing '98 (PSB '98), January.Google Scholar
J. Kupiec. 1992. Robust part-of-speech tagging using a hidden markov model. Computer Speech and Language, 6:225--242.Google ScholarCross Ref
MEDLINE. 1999. The PubMed database can be found at: http://www.ncbi.nhn.nih.gov/PubMed/.Google Scholar
DARPA. 1995. Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD, USA, November. Morgan Kaufmann.Google Scholar
C. Nobata, N. Collier, and J. Tsujii. 1999. Automatic term identification and classification in biology texts. In Proceedings of the Natural Language Pacific Rim Symposium (NLPRS '2000), November.Google Scholar
Y. Ohta, Y. Tateishi, N. Collier, C. Nobata, K. Ibushi, and J. Tsujii. 1999. A semantically annotated corpus from MEDLINE abstracts. In Proceedings of the Tenth Workshop on Genome Informatics. Universal Academy Press, Inc., 14--15 December.Google Scholar
L. Rabiner and B. Juang. 1986. An introduction to hidden Markov models. IEEE ASSP Magazine, pages 4--16, January.Google Scholar
T. Sekimizu, H. Park, and J. Tsujii. 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. In Genome Informatics. Universal Academy Press, Inc.Google Scholar
K. Seymore, A. McCallum, and R. Rosenfeld. 1999. Learning hidden Markove structure for information extraction. In Proceedings of the AAAI '99 Workshop on Machine Learning for Information Extraction, Orlando, Florida, July 19th.Google Scholar
J. Thomas, D. Milward, C. Ouzounis, S. Pulman, and M. Carroll. 1999. Automatic extraction of protein interactions from scientific abstracts. In Proceedings of the Pacific Symposium on Biocomputing '99 (PSB '99), Hawaii, USA, January 4--9.Google Scholar
A. J. Viterbi. 1967. Error bounds for convolutions codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, IT-13(2):260--269.Google ScholarDigital Library

Extracting the names of genes and gene products with a hidden Markov model

Recommendations

A hidden Markov model to identify combinatorial epigenetic regulation patterns for estrogen receptor α target genes

Motivation: Many studies have shown that epigenetic changes, such as altered DNA methylation and histone modifications, are linked to estrogen receptor α (ERα)-positive tumors and disease prognoses. Several recent studies have applied high-throughput ...
Read More
Recognizing names in biomedical texts using hidden Markov model and SVM plus sigmoid
JNLPBA '04: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications

In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena in the biomedical domain, various evidential features are proposed and integrated through a Hidden ...
Read More
A Hidden Markov Model approach to predicting yeast gene function from sequential gene expression data

Existing data mining tools can only achieve about 40% precision in function prediction of unannotated genes. We developed a gene function prediction tool based on profile Hidden Markov Models (HMMs). Each function class was modelled using a distinct HMM ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1
July 2000
616 pages
ISBN:155860717X
Program Chair:
Martin Kay
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 31 July 2000
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,537of1,537submissions,100%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 773
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting the names of genes and gene products with a hidden Markov model

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

A hidden Markov model to identify combinatorial epigenetic regulation patterns for estrogen receptor α target genes

Recognizing names in biomedical texts using hidden Markov model and SVM plus sigmoid

A Hidden Markov Model approach to predicting yeast gene function from sequential gene expression data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Extracting the names of genes and gene products with a hidden Markov model

COLING '00: Proceedings of the 18th conference on Computational linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

A hidden Markov model to identify combinatorial epigenetic regulation patterns for estrogen receptor α target genes

Recognizing names in biomedical texts using hidden Markov model and SVM plus sigmoid

A Hidden Markov Model approach to predicting yeast gene function from sequential gene expression data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media