ABSTRACT
We present an outline of the genome information acquisition (GENIA) project for automatically extracting biochemical information from journal papers and abstracts. GENIA will be available over the Internet and is designed to aid in information extraction, retrieval and visualisation and to help reduce information overload on researchers. The vast repository of papers available online in databases such as MEDLINE is a natural environment in which to develop language engineering methods and tools and is an opportunity to show how language engineering can play a key role on the Internet.
- L. D. Baker and A. K. McCallum. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia. Google ScholarDigital Library
- C. Buckley, J. Allan, and G. Salton. 1993. Automatic routing and ad-hoc retrieval using SMART: TREC-2. In D. K. Harman, editor, The second Text R Etrieval Conference (TREC-2), pages 45--55. NIST. Google Scholar
- GENIA. 1999. Information on the GENIA project can be found at:. http://www.is.s.u-tokyo.ac.jp/~nigel/GENIA.html.Google Scholar
- Y. Jing and W. Croft. 1994. An association thesaurus for information retrieval. In Proceedings of RIAO'94, pages 146--160.Google Scholar
- MEDLINE. 1999. The PubMed database can be found at:. http://www.ncbi.nlm.nih.gov/PubMed/.Google Scholar
- Norihiro Ogata. 1997. Dynamic constructive thesaurus. In Language Study and Thesaurus: Proceedings of the National Language Research Institute Fifth International Symposium: Session 1, pages 182--189. The National Language Research Institute, Tokyo.Google Scholar
- J. R. Quinlan. 1993. c4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., San Mateo, California. Google ScholarDigital Library
- G. Salton. 1989. Automatic Text Processing - The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Publishing Company, Inc., Reading, Massachusetts. Google ScholarDigital Library
- T. Sekimizu, H. Park, and J. Tsujii. 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. In Genome Informatics. Unviersal Academy Press, Inc.Google Scholar
- A. Voutilainen. 1996. Designing a (finite-state) parsing grammar. In E. Roche and Y. Schabes, editors, Finite-State Language Processing. A Bradford Book, The MIT Press.Google Scholar
- The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers
Recommendations
The GENIA corpus: an annotated research abstract corpus in molecular biology domain
HLT '02: Proceedings of the second international conference on Human Language Technology ResearchWith the information overload in genome-related field, there is an increasing need for natural language processing technology to extract information from literature and various attempts of information extraction using NLP has been being made. We are ...
Incorporating GENETAG-style annotation to GENIA corpus
BioNLP '09: Proceedings of the Workshop on Current Trends in Biomedical Natural Language ProcessingProteins and genes are the most important entities in molecular biology, and their automated recognition in text is the most widely studied task in biomedical information extraction (IE). Several corpora containing annotation for these entities have ...
Recognizing nested named entities in GENIA corpus
BioNLP '06: Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature AnalysisNested Named Entities (nested NEs), one containing another, are commonly seen in biomedical text, e.g., accounting for 16.7% of all named entities in GENIA corpus. While many works have been done in recognizing non-nested NEs, nested NEs have been ...
Comments