Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 4/2018

Open Access 01-12-2018 | Research

EHR phenotyping via jointly embedding medical concepts and words into a unified vector space

Authors: Tian Bai, Ashis Kumar Chanda, Brian L. Egleston, Slobodan Vucetic

Published in: BMC Medical Informatics and Decision Making | Special Issue 4/2018

Login to get access

Abstract

Background

There has been an increasing interest in learning low-dimensional vector representations of medical concepts from Electronic Health Records (EHRs). Vector representations of medical concepts facilitate exploratory analysis and predictive modeling of EHR data to gain insights about the patterns of care and health outcomes. EHRs contain structured data such as diagnostic codes and laboratory tests, as well as unstructured free text data in form of clinical notes, which provide more detail about condition and treatment of patients.

Methods

In this work, we propose a method that jointly learns vector representations of medical concepts and words. This is achieved by a novel learning scheme based on the word2vec model. Our model learns those relationships by integrating clinical notes and sets of accompanying medical codes and by defining joint contexts for each observed word and medical code.

Results

In our experiments, we learned joint representations using MIMIC-III data. Using the learned representations of words and medical codes, we evaluated phenotypes for 6 diseases discovered by our and baseline method. The experimental results show that for each of the 6 diseases our method finds highly relevant words. We also show that our representations can be very useful when predicting the reason for the next visit.

Conclusions

The jointly learned representations of medical concepts and words capture not only similarity between codes or words themselves, but also similarity between codes and words. They can be used to extract phenotypes of different diseases. The representations learned by the joint model are also useful for construction of patient features.
Literature
1.
go back to reference Yan Y, Birman-Deych E, Radford MJ, Nilasena DS, Gage BF. Comorbidity indices to predict mortality from medicare data: results from the national registry of atrial fibrillation. Med Care. 2005; 43:1073–7.CrossRef Yan Y, Birman-Deych E, Radford MJ, Nilasena DS, Gage BF. Comorbidity indices to predict mortality from medicare data: results from the national registry of atrial fibrillation. Med Care. 2005; 43:1073–7.CrossRef
2.
go back to reference Krumholz HM, Wang Y, Mattera JA, Wang Y, Han LF, Ingber MJ, Roman S, Normand S-LT. An administrative claims model suitable for profiling hospital performance based on 30-day mortality rates among patients with heart failure. Circulation. 2006; 113(13):1693–701.CrossRef Krumholz HM, Wang Y, Mattera JA, Wang Y, Han LF, Ingber MJ, Roman S, Normand S-LT. An administrative claims model suitable for profiling hospital performance based on 30-day mortality rates among patients with heart failure. Circulation. 2006; 113(13):1693–701.CrossRef
3.
go back to reference Klabunde CN, Potosky AL, Legler JM, Warren JL. Development of a comorbidity index using physician claims data. J Clin Epidemiol. 2000; 53(12):1258–67.CrossRef Klabunde CN, Potosky AL, Legler JM, Warren JL. Development of a comorbidity index using physician claims data. J Clin Epidemiol. 2000; 53(12):1258–67.CrossRef
4.
go back to reference Levitan N, Dowlati A, Remick S, Tahsildar H, Sivinski L, Beyth R, Rimm A. Rates of initial and recurrent thromboembolic disease among patients with malignancy versus those without malignancy. Risk Anal Medicare Claims Data. Med (Baltimore). 1999; 78(5):285–91. Levitan N, Dowlati A, Remick S, Tahsildar H, Sivinski L, Beyth R, Rimm A. Rates of initial and recurrent thromboembolic disease among patients with malignancy versus those without malignancy. Risk Anal Medicare Claims Data. Med (Baltimore). 1999; 78(5):285–91.
5.
go back to reference Taylor Jr DH, Østbye T, Langa KM, Weir D, Plassman BL. The accuracy of medicare claims as an epidemiological tool: the case of dementia revisited. J Alzheimers Dis. 2009; 17(4):807–15.CrossRef Taylor Jr DH, Østbye T, Langa KM, Weir D, Plassman BL. The accuracy of medicare claims as an epidemiological tool: the case of dementia revisited. J Alzheimers Dis. 2009; 17(4):807–15.CrossRef
6.
go back to reference Schneeweiss S, Seeger JD, Maclure M, Wang PS, Avorn J, Glynn RJ. Performance of comorbidity scores to control for confounding in epidemiologic studies using claims data. Am J Epidemiol. 2001; 154(9):854–64.CrossRef Schneeweiss S, Seeger JD, Maclure M, Wang PS, Avorn J, Glynn RJ. Performance of comorbidity scores to control for confounding in epidemiologic studies using claims data. Am J Epidemiol. 2001; 154(9):854–64.CrossRef
7.
go back to reference Nattinger AB, Laud PW, Bajorunaite R, Sparapani RA, Freeman JL. An algorithm for the use of medicare claims data to identify women with incident breast cancer. Health Serv Res. 2004; 39(6p1):1733–50.CrossRef Nattinger AB, Laud PW, Bajorunaite R, Sparapani RA, Freeman JL. An algorithm for the use of medicare claims data to identify women with incident breast cancer. Health Serv Res. 2004; 39(6p1):1733–50.CrossRef
8.
go back to reference Winkelmayer WC, Schneeweiss S, Mogun H, Patrick AR, Avorn J, Solomon DH. Identification of individuals with ckd from medicare claims data: a validation study. Am J Kidney Dis. 2005; 46(2):225–32.CrossRef Winkelmayer WC, Schneeweiss S, Mogun H, Patrick AR, Avorn J, Solomon DH. Identification of individuals with ckd from medicare claims data: a validation study. Am J Kidney Dis. 2005; 46(2):225–32.CrossRef
9.
go back to reference Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the seer-medicare data: content, research applications, and generalizability to the united states elderly population. Med Care. 2002;40:3–18. Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the seer-medicare data: content, research applications, and generalizability to the united states elderly population. Med Care. 2002;40:3–18.
10.
go back to reference Halpern Y, Horng S, Choi Y, Sontag D. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inform Assoc. 2016; 23(4):731–40.CrossRef Halpern Y, Horng S, Choi Y, Sontag D. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inform Assoc. 2016; 23(4):731–40.CrossRef
11.
go back to reference Wang Y, Patrick J. Mapping clinical notes to medical terminology at point of care. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing. Stroudsburg: Association for Computational Linguistics: 2008. p. 102–3. Wang Y, Patrick J. Mapping clinical notes to medical terminology at point of care. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing. Stroudsburg: Association for Computational Linguistics: 2008. p. 102–3.
12.
go back to reference Pivovarov R, Perotte AJ, Grave E, Angiolillo J, Wiggins CH, Elhadad N. Learning probabilistic phenotypes from heterogeneous ehr data. J Biomed Inform. 2015; 58:156–65.CrossRef Pivovarov R, Perotte AJ, Grave E, Angiolillo J, Wiggins CH, Elhadad N. Learning probabilistic phenotypes from heterogeneous ehr data. J Biomed Inform. 2015; 58:156–65.CrossRef
13.
go back to reference Joshi S, Gunasekar S, Sontag D, Ghosh J. Identifiable phenotyping using constrained non-negative matrix factorization; 2016, pp. 17–41. arXiv preprint arXiv:1608.00704. Joshi S, Gunasekar S, Sontag D, Ghosh J. Identifiable phenotyping using constrained non-negative matrix factorization; 2016, pp. 17–41. arXiv preprint arXiv:1608.00704.
14.
go back to reference Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems.2013. p. 3111–9. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems.2013. p. 3111–9.
15.
go back to reference Moen H, Ginter F, Marsi E, Peltonen L-M, Salakoski T, Salanterä S. Care episode retrieval: distributional semantic models for information retrieval in the clinical domain. In: BMC Medical Informatics and Decision Making, vol. 15. BioMed Central: 2015. p. 2. https://doi.org/10.1186/1472-6947-15-S2-S2. Moen H, Ginter F, Marsi E, Peltonen L-M, Salakoski T, Salanterä S. Care episode retrieval: distributional semantic models for information retrieval in the clinical domain. In: BMC Medical Informatics and Decision Making, vol. 15. BioMed Central: 2015. p. 2. https://​doi.​org/​10.​1186/​1472-6947-15-S2-S2.
16.
go back to reference Wu Y, Xu J, Jiang M, Zhang Y, Xu H. A study of neural word embeddings for named entity recognition in clinical text. In: AMIA Annual Symposium Proceedings, vol. 2015. American Medical Informatics Association: 2015. p. 1326. Wu Y, Xu J, Jiang M, Zhang Y, Xu H. A study of neural word embeddings for named entity recognition in clinical text. In: AMIA Annual Symposium Proceedings, vol. 2015. American Medical Informatics Association: 2015. p. 1326.
17.
go back to reference De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York: ACM: 2014. p. 1819–22. De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York: ACM: 2014. p. 1819–22.
18.
go back to reference Amunategui M, Markwell T, Rozenfeld Y. Prediction using note text: Synthetic feature creation with word2vec; 2015. arXiv preprint arXiv:1503.05123. Amunategui M, Markwell T, Rozenfeld Y. Prediction using note text: Synthetic feature creation with word2vec; 2015. arXiv preprint arXiv:1503.05123.
20.
go back to reference Henriksson A. Representing clinical notes for adverse drug event detection. In: Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis.2015. p. 152–8. Henriksson A. Representing clinical notes for adverse drug event detection. In: Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis.2015. p. 152–8.
22.
go back to reference Choi Y, Chiu CY-I, Sontag D. Learning low-dimensional representations of medical concepts. AMIA Summits Transl Sci Proc. 2016; 2016:41.PubMed Choi Y, Chiu CY-I, Sontag D. Learning low-dimensional representations of medical concepts. AMIA Summits Transl Sci Proc. 2016; 2016:41.PubMed
23.
go back to reference Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J. Multi-layer representation learning for medical concepts. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2016. p. 1495–504. Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J. Multi-layer representation learning for medical concepts. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2016. p. 1495–504.
24.
go back to reference Choi E, Schuetz A, Stewart WF, Sun J. Using recurrent neural network models for early detection of heart failure onset. J Am Med Inform Assoc. 2016; 24(2):361–70.PubMedCentral Choi E, Schuetz A, Stewart WF, Sun J. Using recurrent neural network models for early detection of heart failure onset. J Am Med Inform Assoc. 2016; 24(2):361–70.PubMedCentral
25.
go back to reference Stojanovic J, Gligorijevic D, Radosavljevic V, Djuric N, Grbovic M, Obradovic Z. Modeling healthcare quality via compact representations of electronic health records. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2017; 14(3):545–54.CrossRef Stojanovic J, Gligorijevic D, Radosavljevic V, Djuric N, Grbovic M, Obradovic Z. Modeling healthcare quality via compact representations of electronic health records. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2017; 14(3):545–54.CrossRef
26.
go back to reference Henriksson A, Zhao J, Boström H, Dalianis H. Modeling electronic health records in ensembles of semantic spaces for adverse drug event detection. In: Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference On. IEEE: 2015. p. 343–50. https://doi.org/10.1109/BIBM.2015.7359705. Henriksson A, Zhao J, Boström H, Dalianis H. Modeling electronic health records in ensembles of semantic spaces for adverse drug event detection. In: Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference On. IEEE: 2015. p. 343–50. https://​doi.​org/​10.​1109/​BIBM.​2015.​7359705.
27.
go back to reference Henriksson A, Zhao J, Dalianis H, Boström H. Ensembles of randomized trees using diverse distributed representations of clinical events. BMC Med Inform Decis Mak. 2016; 16(2):69.CrossRef Henriksson A, Zhao J, Dalianis H, Boström H. Ensembles of randomized trees using diverse distributed representations of clinical events. BMC Med Inform Decis Mak. 2016; 16(2):69.CrossRef
28.
go back to reference Ramage D, Hall D, Nallapati R, Manning CD. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Stroudsburg: Association for Computational Linguistics: 2009. p. 248–56. Ramage D, Hall D, Nallapati R, Manning CD. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Stroudsburg: Association for Computational Linguistics: 2009. p. 248–56.
29.
go back to reference Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003; 3(Jan):993–1022. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003; 3(Jan):993–1022.
30.
go back to reference Chan KR, Lou X, Karaletsos T, Crosbie C, Gardos S, Artz D, Ratsch G. An empirical analysis of topic modeling for mining cancer clinical notes. In: Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference On. IEEE: 2013. p. 56–63. https://doi.org/10.1109/ICDMW.2013.91. Chan KR, Lou X, Karaletsos T, Crosbie C, Gardos S, Artz D, Ratsch G. An empirical analysis of topic modeling for mining cancer clinical notes. In: Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference On. IEEE: 2013. p. 56–63. https://​doi.​org/​10.​1109/​ICDMW.​2013.​91.
31.
go back to reference Arnold CW, El-Saden SM, Bui AA, Taira R. Clinical case-based retrieval using latent topic analysis. In: AMIA Annual Symposium Proceedings, vol. 2010. American Medical Informatics Association: 2010. p. 26. Arnold CW, El-Saden SM, Bui AA, Taira R. Clinical case-based retrieval using latent topic analysis. In: AMIA Annual Symposium Proceedings, vol. 2010. American Medical Informatics Association: 2010. p. 26.
32.
go back to reference Ghassemi M, Naumann T, Doshi-Velez F, Brimmer N, Joshi R, Rumshisky A, Szolovits P. Unfolding physiological state: Mortality modelling in intensive care units. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2014. p. 75–84. Ghassemi M, Naumann T, Doshi-Velez F, Brimmer N, Joshi R, Rumshisky A, Szolovits P. Unfolding physiological state: Mortality modelling in intensive care units. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2014. p. 75–84.
33.
go back to reference Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. Mimic-iii, a freely accessible critical care database. Sci Data. 2016; 3:160035.CrossRef Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. Mimic-iii, a freely accessible critical care database. Sci Data. 2016; 3:160035.CrossRef
34.
go back to reference Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor ai: Predicting clinical events via recurrent neural networks. In: Machine Learning for Healthcare Conference: 2016. p. 301–18. Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor ai: Predicting clinical events via recurrent neural networks. In: Machine Learning for Healthcare Conference: 2016. p. 301–18.
35.
go back to reference Esteban C, Staeck O, Baier S, Yang Y, Tresp V. Predicting clinical events by combining static and dynamic information using recurrent neural networks. In: Healthcare Informatics (ICHI), 2016 IEEE International Conference On. IEEE: 2016. p. 93–101. Esteban C, Staeck O, Baier S, Yang Y, Tresp V. Predicting clinical events by combining static and dynamic information using recurrent neural networks. In: Healthcare Informatics (ICHI), 2016 IEEE International Conference On. IEEE: 2016. p. 93–101.
Metadata
Title
EHR phenotyping via jointly embedding medical concepts and words into a unified vector space
Authors
Tian Bai
Ashis Kumar Chanda
Brian L. Egleston
Slobodan Vucetic
Publication date
01-12-2018
Publisher
BioMed Central
DOI
https://doi.org/10.1186/s12911-018-0672-0

Other articles of this Special Issue 4/2018

BMC Medical Informatics and Decision Making 4/2018 Go to the issue