Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2017

Open Access 01-12-2017 | Research article

Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

Authors: Yongjun Zhu, Erjia Yan, Fei Wang

Published in: BMC Medical Informatics and Decision Making | Issue 1/2017

Login to get access

Abstract

Background

Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec’s ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec.

Methods

We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects.

Results

Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task).

Conclusions

Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.
Literature
1.
go back to reference Pedersen T, Pakhomov SVS, Patwardhan S, Chute GG. Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform. 2007;40(3):288–99.CrossRefPubMed Pedersen T, Pakhomov SVS, Patwardhan S, Chute GG. Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform. 2007;40(3):288–99.CrossRefPubMed
2.
go back to reference Garla V, Brandt C. Semantic similarity in the biomedical domain: an evaluation across knowledge sources. BMC Bioinformatics. 2012;13(1):261–1. Garla V, Brandt C. Semantic similarity in the biomedical domain: an evaluation across knowledge sources. BMC Bioinformatics. 2012;13(1):261–1.
4.
go back to reference Sánchez D, Batet M, Isern D, Valls A. Ontology-based semantic similarity: a new feature-based approach. Expert Syst Appl. 2012;39(9):7718–28.CrossRef Sánchez D, Batet M, Isern D, Valls A. Ontology-based semantic similarity: a new feature-based approach. Expert Syst Appl. 2012;39(9):7718–28.CrossRef
5.
go back to reference Hadj Taieb MA, Ben Aouicha M, Ben HA. A new semantic relatedness measurement using WordNet features. Knowl Inf Syst. 2014;41(2):467–97.CrossRef Hadj Taieb MA, Ben Aouicha M, Ben HA. A new semantic relatedness measurement using WordNet features. Knowl Inf Syst. 2014;41(2):467–97.CrossRef
6.
go back to reference Liu Y, McInnes B, Pedersen T, Melton-Meaux G, Pakhomov SVS. Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. 2012;363–372. Liu Y, McInnes B, Pedersen T, Melton-Meaux G, Pakhomov SVS. Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. 2012;363–372.
7.
go back to reference Mikolov T, Chen K, Corrado G, Deal J. Efficient estimation of word representations in vector space. arXiv. 2013;1301.3781 [cs.CL]. Mikolov T, Chen K, Corrado G, Deal J. Efficient estimation of word representations in vector space. arXiv. 2013;1301.3781 [cs.CL].
8.
go back to reference Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. arXiv. 2013;1310.4546 [cs.CL]. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. arXiv. 2013;1310.4546 [cs.CL].
9.
go back to reference Frijters RJJM, Vugt MD, Smeets R, Schaik RV, Vlieg JD, Alkema W. Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol. 2010;6(9):e1000943–3. Frijters RJJM, Vugt MD, Smeets R, Schaik RV, Vlieg JD, Alkema W. Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol. 2010;6(9):e1000943–3.
10.
go back to reference Zhu Y, Song M, Yan E. Identifying liver cancer and its relations with diseases, drugs, and genes: a literature-based approach. PLoS One. 2016;11(5):e0156091.CrossRefPubMedPubMedCentral Zhu Y, Song M, Yan E. Identifying liver cancer and its relations with diseases, drugs, and genes: a literature-based approach. PLoS One. 2016;11(5):e0156091.CrossRefPubMedPubMedCentral
11.
go back to reference Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2014;14:1532–43. Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2014;14:1532–43.
12.
go back to reference Shi T, Liu Z. Linking GloVe with word2vec. arXiv. 2014;1411.5595[cs.CL]. Shi T, Liu Z. Linking GloVe with word2vec. arXiv. 2014;1411.5595[cs.CL].
13.
go back to reference Amer NO, Mulhem P, Géry M. Toward word embedding for personalized information retrieval. Proceedings of the SIGIR 2016 Workshop on Neural Information Retrieval. 2016; abs/1606.06991. Amer NO, Mulhem P, Géry M. Toward word embedding for personalized information retrieval. Proceedings of the SIGIR 2016 Workshop on Neural Information Retrieval. 2016; abs/1606.06991.
14.
go back to reference Ju R, Zhou P, Li CH, Liu L. An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis. Proceedings of the 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Automatic and Secure Computing; Pervasive Intelligence and Computing. 2015; 2276–2283. Ju R, Zhou P, Li CH, Liu L. An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis. Proceedings of the 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Automatic and Secure Computing; Pervasive Intelligence and Computing. 2015; 2276–2283.
15.
go back to reference Zhang D, Xu H, Su Z, Xu Y. Chinese comments sentiment classification based on word2vec and SVM. Expert Syst Appl. 2015;42:1857–63.CrossRef Zhang D, Xu H, Su Z, Xu Y. Chinese comments sentiment classification based on word2vec and SVM. Expert Syst Appl. 2015;42:1857–63.CrossRef
16.
go back to reference Bai X, Chen F, Zhan S. A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec. Proceedings of the 2014 IEEE International Congress on Big Data. 2014; 358–363. Bai X, Chen F, Zhan S. A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec. Proceedings of the 2014 IEEE International Congress on Big Data. 2014; 358–363.
17.
go back to reference Jeong YK, Song M. Applying content-based similarity measure to author co-citation analysis. Proceedings of iConference. 2016;2016 Jeong YK, Song M. Applying content-based similarity measure to author co-citation analysis. Proceedings of iConference. 2016;2016
18.
go back to reference Minarro-Giménez JA, Marín-Alonso O, Samwald M. Exploring the application of deep learning techniques on medical text corpora. Stud Health Technol Inform. 2014;205:584–8.PubMed Minarro-Giménez JA, Marín-Alonso O, Samwald M. Exploring the application of deep learning techniques on medical text corpora. Stud Health Technol Inform. 2014;205:584–8.PubMed
19.
go back to reference Minarro-Giménez JA, Marín-Alonso O, Samwald M. Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation. arXiv. 2015;1502.03682 [cs.CL]. Minarro-Giménez JA, Marín-Alonso O, Samwald M. Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation. arXiv. 2015;1502.03682 [cs.CL].
20.
go back to reference Muneeb TH, Sahu SK, Anand A. Evaluating distributed word representations for capturing semantics of biomedical concepts. Proceedings of the 2015 Workshop on Biomedical Natural Language Processing. 2015;158–163. Muneeb TH, Sahu SK, Anand A. Evaluating distributed word representations for capturing semantics of biomedical concepts. Proceedings of the 2015 Workshop on Biomedical Natural Language Processing. 2015;158–163.
21.
go back to reference Pakhomov SVS, McInnes B, Adam T, Liu Y, Pedersen T, Melton GB. Semantic similarity and relatedness between clinical terms: an experimental study. AMIA Ann Symp Proc. 2010:572–6. Pakhomov SVS, McInnes B, Adam T, Liu Y, Pedersen T, Melton GB. Semantic similarity and relatedness between clinical terms: an experimental study. AMIA Ann Symp Proc. 2010:572–6.
22.
go back to reference Pakhomov SVS, Finley G, McEwan R, Wang Y, Melton GB. Corpus Domain effects on distributional semantic modeling of medical terms. Bioinformatics. 2016;32(23):3635–44.PubMed Pakhomov SVS, Finley G, McEwan R, Wang Y, Melton GB. Corpus Domain effects on distributional semantic modeling of medical terms. Bioinformatics. 2016;32(23):3635–44.PubMed
23.
go back to reference Chiu B, Crichton G, Korhonen A, Pyysalo S. How to Train Good Word Embeddings for Biomedical NLP. Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016;166–174. Chiu B, Crichton G, Korhonen A, Pyysalo S. How to Train Good Word Embeddings for Biomedical NLP. Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016;166–174.
24.
go back to reference Bird S. NLTK: the Natural language toolkit. Proceedings of the COLING/ACL on Interactive presentation sessions. 2006:69–72. Bird S. NLTK: the Natural language toolkit. Proceedings of the COLING/ACL on Interactive presentation sessions. 2006:69–72.
Metadata
Title
Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
Authors
Yongjun Zhu
Erjia Yan
Fei Wang
Publication date
01-12-2017
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2017
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-017-0498-1

Other articles of this Issue 1/2017

BMC Medical Informatics and Decision Making 1/2017 Go to the issue