skip to main content
article

A baseline feature set for learning rhetorical zones using full articles in the biomedical domain

Authors Info & Claims
Published:01 June 2005Publication History
Skip Abstract Section

Abstract

At a time when experimental throughput in the field of molecular biology is increasing, it is necessary for biologists and people working in related fields to have access to sophisticated tools to enable them to efficiently process large amounts of information in order to stay abreast of current research.Rhetorical zone analysis is an application of natural language processing in which areas of text in scientific papers are classified in terms of argumentation and intellectual contribution in order to pinpoint and distinguish certain types of information. Such analysis can be employed to assist in information extraction, helping to assess and integrate data generated by experiments into the scientific community's store of knowledge.We present results for several experiments in automatic zone identification on the ZAISA-1 dataset, a new dataset composed of full biomedical research papers hand-annotated for rhetorical zones. We concentrate on general purpose and linguistically motivated features, and report results for a variety of sets of features. It is our intention to provide a baseline feature set for modeling, which can be extended in future work using combinations of heuristics and more sophisticated and task-specific modeling techniques.

References

  1. G. D. Bader, I. Donaldson, C. Wolting, B. F. Ouellette, T. Pawson, C. W. Hogue, BIND-The Biommolecular Interaction Network Database. Nucleic Acids Research, 29:242--245. 2001.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Bairoch, R. Apweiler. The SWISS-PROT protein sequence database and its supplement TrEMBL in 200 Nucleic Acids Research, 28:302--303. 2000.Google ScholarGoogle Scholar
  3. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne. The Protein Data Bank/ Nucleic Acids Research, 28:235--242. 2000.Google ScholarGoogle Scholar
  4. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Craven and J. Kumlien. Constructing biological knowledge bases by extracting information from text sources. ISMB'99, pp 77--86. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Dickman. Tough mining; the challenges of searching the scientific literature. PLoS Biology, 1(2), pp 144--147. 2003.Google ScholarGoogle Scholar
  8. K. Humphreys, G. Demetriou, and R. Gaizauskas. Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. In BSB2000, pp 502--513. 2000.Google ScholarGoogle Scholar
  9. T. Joachims. Learning to Classify Test Using Support Vector Machines. Kluwer Academic Publishers, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Koike, Y. Kobayashi, and T. Takagi. Kinase pathway database: an integrated protein-kinase and nip-based protein-interaction resource. Genome Res, 17(6A):1231--1243, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Koike and T. Takagi. Prediction of protein-protein interaction sites using support vector machines. Protein Engineering Design and Selection, 17(2):165--173, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  12. L. Lo Conte, S. E. Brenner, T. J. P. Hubbard, C. Chothia, A, Murzin. SCOP database in 2002: Refinements accommodate structural genomics. Nucleic Acids Research, 30:264--267, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  13. Y. Mizuta and N. Collier. An annotation scheme for a rhetorical analysis of biology articles. In LREC2004, pp. 1737--1740, 2004.Google ScholarGoogle Scholar
  14. Y. Mizuta, T. Mullen and N. Collier. Annotation of Biomedical Texts for Zone Analysis. NII Technical Report (NII-2004-007E, ISSN:1346--5597). Oct 2004.Google ScholarGoogle Scholar
  15. Y. Mizuta, A. Korhonen, T. Mullen and N. Collier. Zone analysis in biology articles as a basis for information extraction. In the Special Edition on Natural Language Processing in Biomedicine and Its Applications, International Journal of Medical Informatics. Elsevier. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Novichova, S. Egorov, and N. Darasalia. Medscan, a natural language processing engine for medline abstracts. Bioinformatics; 19(13):1699--1706, 2003.Google ScholarGoogle Scholar
  17. I. Tbahriti, C. Chichester, F Lisacek and P Ruch. Using Argumentation to Retrieve Articles with Similar Citations from MEDLINE. JNLPBA, pp 8--14. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Salton and M. J. McGill. The SMART and SIRE Experimental Retrieval Systems, pp.118--155, New York: McGraw-Hill. 1983.Google ScholarGoogle Scholar
  19. H. Schauer and U. Hahn Phrases as carriers of coherence relations CogSci 2000---Proceedings of the 22nd Annual Conference of the Cognitive Science Society, pp. 429--434. 2000.Google ScholarGoogle Scholar
  20. L. Tanabe and W. Wilbur. Tagging gene and protein names in biomedical text. Bioinformatics, 18, pp 1124--1132, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  21. P. Tapanainen and T. Järvinen. A non-projective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, Washington D.C., Association of Computational Linguistics, pp 64--71. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Teufel. Arugmentative Zoning: Information Extraction from Scientific Text PhD Thesis. University of Edinburgh. 1999.Google ScholarGoogle Scholar
  23. S. Teufel and M. Moens. Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4):409--445, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Teufel and H. van Halteren. Agreement in human factoid annotation for summarization evaluation. In LREC2004, 2004.Google ScholarGoogle Scholar
  25. V. N. Vapnik. Statistical Learning Theory. Springer. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Wattarujeekrit, P. Shah and N. Collier PASBio: Predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics 5:155. 2004.Google ScholarGoogle ScholarCross RefCross Ref
  27. A. Zanzoni, L. Montecchi, M. Quondam G. Ausiello, M. Helmer-Citterich and G. Cesareni. MINT: A Molecular INTeraction database. FEBS Lett 513:135--140. 2002.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A baseline feature set for learning rhetorical zones using full articles in the biomedical domain

                            Recommendations

                            Comments

                            Login options

                            Check if you have access through your login credentials or your institution to get full access on this article.

                            Sign in

                            Full Access

                            • Published in

                              cover image ACM SIGKDD Explorations Newsletter
                              ACM SIGKDD Explorations Newsletter  Volume 7, Issue 1
                              Natural language processing and text mining
                              June 2005
                              81 pages
                              ISSN:1931-0145
                              EISSN:1931-0153
                              DOI:10.1145/1089815
                              Issue’s Table of Contents

                              Copyright © 2005 Authors

                              Publisher

                              Association for Computing Machinery

                              New York, NY, United States

                              Publication History

                              • Published: 1 June 2005

                              Check for updates

                              Qualifiers

                              • article

                            PDF Format

                            View or Download as a PDF file.

                            PDF

                            eReader

                            View online with eReader.

                            eReader