Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2018 | Research article

Identification of research hypotheses and new knowledge from scientific literature

Authors: Matthew Shardlow, Riza Batista-Navarro, Paul Thompson, Raheel Nawaz, John McNaught, Sophia Ananiadou

Published in: BMC Medical Informatics and Decision Making | Issue 1/2018

Abstract

Background

Text mining (TM) methods have been used extensively to extract relations and events from the literature. In addition, TM techniques have been used to extract various types or dimensions of interpretative information, known as Meta-Knowledge (MK), from the context of relations and events, e.g. negation, speculation, certainty and knowledge type. However, most existing methods have focussed on the extraction of individual dimensions of MK, without investigating how they can be combined to obtain even richer contextual information. In this paper, we describe a novel, supervised method to extract new MK dimensions that encode Research Hypotheses (an author’s intended knowledge gain) and New Knowledge (an author’s findings). The method incorporates various features, including a combination of simple MK dimensions.

Methods

We identify previously explored dimensions and then use a random forest to combine these with linguistic features into a classification model. To facilitate evaluation of the model, we have enriched two existing corpora annotated with relations and events, i.e., a subset of the GENIA-MK corpus and the EU-ADR corpus, by adding attributes to encode whether each relation or event corresponds to Research Hypothesis or New Knowledge. In the GENIA-MK corpus, these new attributes complement simpler MK dimensions that had previously been annotated.

Results

We show that our approach is able to assign different types of MK dimensions to relations and events with a high degree of accuracy. Firstly, our method is able to improve upon the previously reported state of the art performance for an existing dimension, i.e., Knowledge Type. Secondly, we also demonstrate high F1-score in predicting the new dimensions of Research Hypothesis (GENIA: 0.914, EU-ADR 0.802) and New Knowledge (GENIA: 0.829, EU-ADR 0.836).

Conclusion

We have presented a novel approach for predicting New Knowledge and Research Hypothesis, which combines simple MK dimensions to achieve high F1-scores. The extraction of such information is valuable for a number of practical TM applications.

Available only for authorised users

the proportion of results returned by the system which are correct.

the proportion of correct results returned by the system as a fraction of all the correct results that should have been found.

the balanced harmonic mean between precision and recall, providing a single overall measure of performance.

Jiawen L, Dongsheng L, Zhijian T. The expression of interleukin-17, interferon-gamma, and macrophage inflammatory protein-3 alpha mRNA in patients with psoriasis vulgaris. J Huazhong University Sci Technol [Med Sci]. 2004; 24(3):294–6. https://doi.org/10.1007/BF02832018.CrossRef

Scharffetter-Kochanek K, Singh K, Tasdogan A, Wlaschek M, Gatzka M, Hainzl A, Peters T. Reduction of CD18 promotes expansion of inflammatory gd T cells collaborating with CD4 T cells in chronic murine psoriasiform dermatitis. J Immunol. 2013; 191:5477–88. https://doi.org/10.4049/jimmunol.1300976.CrossRefPubMed

Zerva C, Batista-Navarro R, Day P, Ananiadou S. Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics. btx466. https://doi.org/10.1093/bioinformatics/btx466.

Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J. BRAT: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics: 2012. p. 102–107.

Agarwal S, Yu H, Kohane I. BioNØT: A searchable database of biomedical negated sentences. BMC Bioinformatics. 2011; 12(1):420. https://doi.org/10.1186/1471-2105-12-420.CrossRefPubMedPubMedCentral

Medlock B, Briscoe T. Weakly supervised learning for hedge classification in scientific literature. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic: Association for Computational Linguistics: 2007. p. 992–9. http://www.aclweb.org/anthology/P07-1125.

Vincze V, Szarvas G, Farkas R, Móra G, Csirik J. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics. 2008; 9(11):1–9.

Malhotra A, Younesi E, Gurulingappa H, Hofmann-Apitius M. ‘HypothesisFinder:’ a strategy for the detection of speculative statements in scientific text. PLOS Comput Biol. 2013; 9(7):1–10. https://doi.org/10.1371/journal.pcbi.1003117.CrossRef

Ruch P, Boyer C, Chichester C, Tbahriti I, Geissbühler A, Fabry P, Gobeill J, Pillet V, Rebholz-Schuhmann D, Lovis C, et al. Using argumentation to extract key sentences from biomedical abstracts. Int J Med Inform. 2007; 76(2):195–200.CrossRefPubMed

10.

Teufel S, Carletta J, Moens M. An annotation scheme for discourse-level argumentation in research articles. In: Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics. EACL ’99. Stroudsburg: Association for Computational Linguistics: 1999. p. 110–7. https://doi.org/10.3115/977035.977051.

11.

Mizuta Y, Collier N. Zone identification in biology articles as a basis for information extraction. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. JNLPBA ’04. Stroudsburg: Association for Computational Linguistics: 2004. p. 29–35. http://dl.acm.org/citation.cfm?id=1567594.1567600.

12.

Burns G, Dasigi P, de Waard A, Hovy EH. Automated detection of discourse segment and experimental types from the text of cancer pathway results sections. Database. 2016; 2016:122. https://doi.org/10.1093/database/baw122.CrossRef

13.

Liakata M, Saha S, Dobnik S, Batchelor C, Rebholz-Schuhmann D. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics. 2012; 28(7):991. https://doi.org/10.1093/bioinformatics/bts071.CrossRefPubMedPubMedCentral

14.

Simsek D, Buckingham Shum S, Sandor A, De Liddo A, Ferguson R. Xip dashboard: visual analytics from automated rhetorical parsing of scientific metadiscourse. In: 1st International Workshop on Discourse-Centric Learning Analytics. Leuven: 2013.

15.

Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel HP. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics. 2008; 9(1):207.CrossRefPubMedPubMedCentral

16.

Bravo A, Piñero J, Queralt-Rosinach N, Rautschka LIM. Furlong: Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics. 2015; 16(1):55.CrossRefPubMedPubMedCentral

17.

Verspoor KM, Heo EG, Kang KY, Song M. Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. BMC Med Inf Decis Mak. 2016; 16(1):68.CrossRef

18.

Nedellec C. Learning language in logic-genic interaction extraction challenge. In: Proceedings of the ICML-2005 Workshop on Learning Language in Logic (LLL05): 2005. p. 31–7.

19.

Kim JD, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J. Overview of BioNLP shared task 2011. In: Proceedings of the BioNLP Shared Task 2011 Workshop. Portland: Association for Computational Linguistics: 2011. p. 1–6.

20.

Pyysalo S, Ginter F, Heimonen J, Björne F, Boberg F, Järvinen F, Salakoski T. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics. 2007; 8(1):50.CrossRefPubMedPubMedCentral

21.

Sanchez-Graillet O, Poesio M. Negation of protein—protein interactions: analysis and extraction. Bioinformatics. 2007; 23(13):424. https://doi.org/10.1093/bioinformatics/btm184.CrossRef

22.

Kim JD, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics. 2008; 9(1):1–25.CrossRef

23.

Van Mulligen EM, Fourrier-Reglat A, Gurwitz D, Molokhia M, Nieto A, Trifiro G, Kors JA, Furlong LI. The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J Biomed Inform. 2012; 45(5):879–84.CrossRefPubMed

24.

Björne J, Ginter F, Salakoski T. University of Turku in the BioNLP’11 shared task. BMC Bioinformatics. 2012; 13(11):4.CrossRef

25.

Kilicoglu H, Bergler S. Biological event composition. BMC Bioinformatics. 2012; 13(11):7.CrossRef

26.

Thompson P, Nawaz R, McNaught J, Ananiadou S. Enriching news events with meta-knowledge information. Lang Resour Eval. 2016:1–30. https://doi.org/10.1007/s10579-016-9344-9.

27.

Walker C, Strassel S, Medero J, Maeda K. ACE 2005 multilingual training corpus. Philadelphia: Linguistic Data Consortium; 2006.

28.

Thompson P, Nawaz R, McNaught J, Ananiadou S. Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinformatics. 2011; 12(1):1–18.CrossRef

29.

Nawaz R, Thompson P, Ananiadou S. Negated BioEvents: Analysis and identification. BMC Bioinformatics. 2013; 14(1):14. https://doi.org/10.1186/1471-2105-14-14.CrossRefPubMedPubMedCentral

30.

Nawaz R, Thompson P, Ananiadou S. Something old, something new: identifying knowledge source in bio-events. Int J Comput Linguist Appl. 2013; 4(1):129–44.

31.

Miwa M, Thompson P, McNaught J, Kell DB, Ananiadou S. Extracting semantically enriched events from biomedical literature. BMC Bioinformatics. 2012; 13:108. https://doi.org/10.1186/1471-2105-13-108. Highly Accessed.CrossRefPubMedPubMedCentral

32.

Nawaz R, Thompson P, Ananiadou S. Meta-knowledge annotation at the event level: Comparison between abstracts and full papers. In: Proceedings of the Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012): 2012. p. 24–31.

33.

Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960; 20(1):37–46. https://doi.org/10.1177/001316446002000104.CrossRef

34.

McHugh ML. Interrater reliability: the kappa statistic. Biochemia medica. 2012; 22(3):276–82.CrossRefPubMedPubMedCentral

35.

Miwa M, Sætre R, Kim JD, Tsujii J. Event extraction with complex event classification using rich features. J Bioinforma Comput Biol. 2010; 8(01):131–46.CrossRef

36.

Breiman L. Random forests. Machine Learning. 2001; 45(1):5–32.CrossRef

37.

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: An update. SIGKDD Explor Newsl. 2009; 11(1):10–18. https://doi.org/10.1145/1656274.1656278.CrossRef

38.

Tsuruoka Y, Tateishi Y, Kim JD, Ohta T, McNaught J, Ananiadou S, Tsujii J. Developing a robust part-of-speech tagger for biomedical text. Berlin, Heidelberg: Springer; 2005, pp. 382–92. Advances in Informatics: 10th Panhellenic Conference on Informatics, PCI 2005, Volas, Greece, November 11-13, 2005.CrossRef

39.

Miyao Y, Tsujii J. Feature forest models for probabilistic HPSG parsing. Comput Linguist. 2008; 34(1):35–80. https://doi.org/10.1162/coli.2008.34.1.35.CrossRef

40.

Schuemie MJ, Weeber M, Schijvenaars BJA, van Mulligen EM, van der Eijk CC, Jelier R, Mons B, Kors JA. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics. 2004; 20(16):2597–604. https://doi.org/10.1093/bioinformatics/bth291.CrossRefPubMed

Title: Identification of research hypotheses and new knowledge from scientific literature
Authors: Matthew Shardlow
Riza Batista-Navarro
Paul Thompson
Raheel Nawaz
John McNaught
Sophia Ananiadou
Publication date: 01-12-2018
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2018
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-018-0639-1

At a glance: The ONWARDS insulin icodec trials

Springer Medicine

Identification of research hypotheses and new knowledge from scientific literature

Abstract

Background

Methods

Results

Conclusion

At a glance: The ONWARDS insulin icodec trials

Springer Medicine

Abstract

Background

Methods

Results

Conclusion

Please log in to get access to this content

Other articles of this Issue 1/2018

Primary care physicians’ attitudes to the adoption of electronic medical records: a systematic review and evidence synthesis using the clinical adoption framework

Development and validation of a model for the adoption of structured and standardised data recording among healthcare professionals

A machine learning model to predict the risk of 30-day readmissions in patients with heart failure: a retrospective analysis of electronic medical records data

Barriers to exchanging healthcare information in inter-municipal healthcare services: a qualitative case study

How do patients value and prioritize patient portal functionalities and usage factors? A conjoint analysis study with chronically ill patients

Adherence to standardized assessments through a complexity-based model for categorizing rehabilitation©: design and implementation in an acute hospital