Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-07-2018 | Research

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases

Authors: Zhiwei Chen, Zhe He, Xiuwen Liu, Jiang Bian

Published in: BMC Medical Informatics and Decision Making | Special Issue 2/2018

Abstract

Background

In the past few years, neural word embeddings have been widely used in text mining. However, the vector representations of word embeddings mostly act as a black box in downstream applications using them, thereby limiting their interpretability. Even though word embeddings are able to capture semantic regularities in free text documents, it is not clear how different kinds of semantic relations are represented by word embeddings and how semantically-related terms can be retrieved from word embeddings.

Methods

To improve the transparency of word embeddings and the interpretability of the applications using them, in this study, we propose a novel approach for evaluating the semantic relations in word embeddings using external knowledge bases: Wikipedia, WordNet and Unified Medical Language System (UMLS). We trained multiple word embeddings using health-related articles in Wikipedia and then evaluated their performance in the analogy and semantic relation term retrieval tasks. We also assessed if the evaluation results depend on the domain of the textual corpora by comparing the embeddings of health-related Wikipedia articles with those of general Wikipedia articles.

Results

Regarding the retrieval of semantic relations, we were able to retrieve semanti. Meanwhile, the two popular word embedding approaches, Word2vec and GloVe, obtained comparable results on both the analogy retrieval task and the semantic relation retrieval task, while dependency-based word embeddings had much worse performance in both tasks. We also found that the word embeddings trained with health-related Wikipedia articles obtained better performance in the health-related relation retrieval tasks than those trained with general Wikipedia articles.

Conclusion

It is evident from this study that word embeddings can group terms with diverse semantic relations together. The domain of the training corpus does have impact on the semantic relations represented by word embeddings. We thus recommend using domain-specific corpus to train word embeddings for domain-specific text mining tasks.

Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013. https://arxiv.org/abs/1301.3781.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.: 2013. p. 3111–9.

Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. In: EMNLP, vol 14. Association for Computational Linguistics: 2014. p. 1532–43.

Tang D, Wei F, Yang N, Zhou M, Liu T, Qin B. Learning sentiment-specific word embedding for twitter sentiment classification. In: ACL (1). Association for Computational Linguistics: 2014. p. 1555–65.

Li C, Wang H, Zhang Z, Sun A, Ma Z. Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM: 2016. p. 165–74.

Kim Y. Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: 2014. p. 1746–51.

Tang D, Qin B, Liu T. Document modeling with gated recurrent neural network for sentiment classification. In: EMNLP. Association for Computational Linguistics: 2015. p. 1422–32.

Sun F, Guo J, Lan Y, Xu J, Cheng X. Sparse word embeddings using l1 regularized online learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16. AAAI Press: 2016. p. 2915–21. http://dl.acm.org/citation.cfm?id=3060832.3061029.

Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. 2016. https://arxiv.org/abs/1607.04606.

10.

Levy O, Goldberg Y. Dependency-based word embeddings. In: ACL (2). Stroudsburg: Citeseer: 2014. p. 302–8.

11.

Khoo CSG, Na J-C. Semantic relations in information science. Annu Rev Inf Sci Technol. 2006; 40(1):157–228. https://doi.org/10.1002/aris.1440400112.CrossRef

12.

Bodenreider O. Biomedical ontologies in action: role in knowledge management, data integration and decision support In: Geissbuhler A, Kulikowski C, editors. IMIA Yearbook of Medical Informatics. IMIA, the Netherlands, Methods Inf Med. 2008;47(Suppl 1):67–79.

13.

Miller GA. Wordnet: a lexical database for english. Commun ACM. 1995; 38(11):39–41.CrossRef

14.

Lindberg DA, Humphreys BL, McCray AT, et al.The unified medical language system. the Netherlands: Yearbook, IMIA; 1993. p. 41–51.

15.

Chen Z, He Z, Liu X, Bian J. An exploration of semantic relations in neural word embeddings using extrinsic knowledge. In: Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference On. Piscataway: IEEE: 2017. p. 1246–51.

16.

Lund K, Burgess C. Producing high-dimensional semantic spaces from lexical co-occurrence. Behav Res Methods. 1996; 28(2):203–8.CrossRef

17.

Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model. J Mach Learn Res. 2003; 3(Feb):1137–55.

18.

Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S. Recurrent neural network based language model. In: Interspeech, vol 2. International Speech Communication Association: 2010. p. 3.

19.

Harris ZS. Distributional structure. Word. 1954; 10(2-3):146–62.CrossRef

20.

Levy O, Goldberg Y. Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.: 2014. p. 2177–85.

21.

Arora S, Li Y, Liang Y, Ma T, Risteski A. A latent variable model approach to pmi-based word embeddings. Trans Assoc Comput Linguist. 2016; 4:385–99.

22.

Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E. Placing search in context: The concept revisited. In: Proceedings of the 10th International Conference on World Wide Web. New York: ACM: 2001. p. 406–14.

23.

Ono M, Miwa M, Sasaki Y. Word embedding-based antonym detection using thesauri and distributional information. In: HLT-NAACL. Association for Computational Linguistics: 2015. p. 984–9.

24.

Schnabel T, Labutov I, Mimno DM, Joachims T. Evaluation methods for unsupervised word embeddings. In: EMNLP. Association for Computational Linguistics: 2015. p. 298–307.

25.

Baroni M, Dinu G, Kruszewski G. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL (1). Association for Computational Linguistics: 2014. p. 238–47.

26.

Levy O, Goldberg Y, Dagan I. Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist. 2015; 3:211–25.

27.

Zhu Y, Yan E, Wang F. Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec. BMC Med Inform Decis Making. 2017; 17(1):95. https://doi.org/10.1186/s12911-017-0498-1.CrossRef

28.

Liu S, Bremer P-T, Thiagarajan JJ, Srikumar V, Wang B, Livnat Y, Pascucci V. Visual exploration of semantic relationships in neural word embeddings. IEEE Trans Vis Comput Graph. 2018; 24(1):553–62.CrossRefPubMed

29.

Embedding Projector of TensorFlow. http://projector.tensorflow.org/. Accessed 1 June 2017.

30.

Shlens J. A tutorial on principal component analysis. 2014. https://arxiv.org/abs/1404.1100.

31.

Maaten Lvd, Hinton G. Visualizing data using t-sne. J Mach Learn Res. 2008; 9(Nov):2579–605.

32.

TensorFlow. https://www.tensorflow.org/. Accessed 1 June 2017.

33.

PetScan. https://petscan.wmflabs.org. Accessed 1 June 2017.

34.

Loper E, Bird S. Nltk: The natural language toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - vol 1. ETMTNLP ’02. Stroudsburg: Association for Computational Linguistics: 2002. p. 63–70. https://doi.org/10.3115/1118108.1118117.

35.

WordNet API. http://www.nltk.org/howto/wordnet.html. Accessed 1 June 2017.

36.

Dependency Based Word Embedding project. https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings. Accessed 1 June 2017.

37.

Word, 2vec project. https://code.google.com/archive/p/word2vec/. Accessed 1 June 2017.

38.

Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011; 12(Jul):2121–59.

39.

GloVe project. https://nlp.stanford.edu/projects/glove/. Accessed 1 June 2017.

40.

Statistical information of WordNet. https://wordnet.princeton.edu/documentation/wnstats7wn. Accessed 1 June 2017.

41.

He Z, Chen Z, Oh S, Hou J, Bian J. Enriching consumer health vocabulary through mining a social q&a site: A similarity-based approach. J Biomed Inform. 2017; 69:75–85.CrossRefPubMedPubMedCentral

42.

Dependencies manual in Stanford NLP project. https://nlp.stanford.edu/software/dependencies_manual.pdf. Accessed 1 June 2017.

Title: Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases
Authors: Zhiwei Chen
Zhe He
Xiuwen Liu
Jiang Bian
Publication date: 01-07-2018
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue Special Issue 2/2018
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-018-0630-x

At a glance: The ONWARDS insulin icodec trials

Springer Medicine

Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases

Abstract

Background

Methods

Results

Conclusion

At a glance: The ONWARDS insulin icodec trials

Springer Medicine

Abstract

Background

Methods

Results

Conclusion

Please log in to get access to this content

Other articles of this Special Issue 2/2018

OC-2-KB: integrating crowdsourcing into an obesity and cancer knowledge base curation system

Automatic extraction of protein-protein interactions using grammatical relationship graph

Identifying direct temporal relations between time and events from clinical notes

A semantics-oriented computational approach to investigate microRNA regulation on glucocorticoid resistance in pediatric acute lymphoblastic leukemia

Discovering and identifying New York heart association classification from electronic health records

Chemical-induced disease extraction via recurrent piecewise convolutional neural networks