Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2020

Open Access 01-12-2020 | Research article

Combining structured and unstructured data for predictive models: a deep learning approach

Authors: Dongdong Zhang, Changchang Yin, Jucheng Zeng, Xiaohui Yuan, Ping Zhang

Published in: BMC Medical Informatics and Decision Making | Issue 1/2020

Login to get access

Abstract

Background

The broad adoption of electronic health records (EHRs) provides great opportunities to conduct health care research and solve various clinical problems in medicine. With recent advances and success, methods based on machine learning and deep learning have become increasingly popular in medical informatics. However, while many research studies utilize temporal structured data on predictive modeling, they typically neglect potentially valuable information in unstructured clinical notes. Integrating heterogeneous data types across EHRs through deep learning techniques may help improve the performance of prediction models.

Methods

In this research, we proposed 2 general-purpose multi-modal neural network architectures to enhance patient representation learning by combining sequential unstructured notes with structured data. The proposed fusion models leverage document embeddings for the representation of long clinical note documents and either convolutional neural network or long short-term memory networks to model the sequential clinical notes and temporal signals, and one-hot encoding for static information representation. The concatenated representation is the final patient representation which is used to make predictions.

Results

We evaluate the performance of proposed models on 3 risk prediction tasks (i.e. in-hospital mortality, 30-day hospital readmission, and long length of stay prediction) using derived data from the publicly available Medical Information Mart for Intensive Care III dataset. Our results show that by combining unstructured clinical notes with structured data, the proposed models outperform other models that utilize either unstructured notes or structured data only.

Conclusions

The proposed fusion models learn better patient representation by combining structured and unstructured data. Integrating heterogeneous data types across EHRs helps improve the performance of prediction models and reduce errors.
Appendix
Available only for authorised users
Literature
1.
go back to reference Henry J, Pylypchuk Y, Searcy T, Patel V. Adoption of electronic health record systems among US non-federal acute care hospitals: 2008–2015. ONC Data Brief. 2016;35:1–9. Henry J, Pylypchuk Y, Searcy T, Patel V. Adoption of electronic health record systems among US non-federal acute care hospitals: 2008–2015. ONC Data Brief. 2016;35:1–9.
2.
go back to reference Bisbal M, Jouve E, Papazian L, de Bourmont S, Perrin G, Eon B, et al. Effectiveness of SAPS III to predict hospital mortality for post-cardiac arrest patients. Resuscitation. 2014;85(7):939–44.CrossRef Bisbal M, Jouve E, Papazian L, de Bourmont S, Perrin G, Eon B, et al. Effectiveness of SAPS III to predict hospital mortality for post-cardiac arrest patients. Resuscitation. 2014;85(7):939–44.CrossRef
3.
go back to reference Zimmerman JE, Kramer AA, McNair DS, Malila FM. Acute Physiology and Chronic Health Evaluation (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit Care Med. 2006;34(5):1297–310.CrossRef Zimmerman JE, Kramer AA, McNair DS, Malila FM. Acute Physiology and Chronic Health Evaluation (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit Care Med. 2006;34(5):1297–310.CrossRef
4.
go back to reference van Walraven C, Dhalla IA, Bell C, Etchells E, Stiell IG, Zarnke K, et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ. 2010;182(6):551–7.CrossRef van Walraven C, Dhalla IA, Bell C, Etchells E, Stiell IG, Zarnke K, et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ. 2010;182(6):551–7.CrossRef
5.
go back to reference Donzé J, Aujesky D, Williams D, Schnipper JL. Potentially avoidable 30-day hospital readmissions in medical patients: derivation and validation of a prediction model. JAMA Internal Med. 2013;173(8):632–8.CrossRef Donzé J, Aujesky D, Williams D, Schnipper JL. Potentially avoidable 30-day hospital readmissions in medical patients: derivation and validation of a prediction model. JAMA Internal Med. 2013;173(8):632–8.CrossRef
6.
go back to reference Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2015. p. 1721–1730. Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2015. p. 1721–1730.
7.
go back to reference Tang F, Xiao C, Wang F, Zhou J. Predictive modeling in urgent care: a comparative study of machine learning approaches. JAMIA Open. 2018;1(1):87–98.CrossRef Tang F, Xiao C, Wang F, Zhou J. Predictive modeling in urgent care: a comparative study of machine learning approaches. JAMIA Open. 2018;1(1):87–98.CrossRef
8.
go back to reference Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digital Med. 2018;1(1):18.CrossRef Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digital Med. 2018;1(1):18.CrossRef
9.
go back to reference Min X, Yu B, Wang F. Predictive modeling of the hospital readmission risk from patients’ claims data using machine learning: a case study on COPD. Sci Rep. 2019;9(1):1–10.CrossRef Min X, Yu B, Wang F. Predictive modeling of the hospital readmission risk from patients’ claims data using machine learning: a case study on COPD. Sci Rep. 2019;9(1):1–10.CrossRef
10.
go back to reference Purushotham S, Meng C, Che Z, Liu Y. Benchmarking deep learning models on large healthcare datasets. J Biomed Inform. 2018;83:112–34.CrossRef Purushotham S, Meng C, Che Z, Liu Y. Benchmarking deep learning models on large healthcare datasets. J Biomed Inform. 2018;83:112–34.CrossRef
11.
go back to reference Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data. 2019;6(1):96.CrossRef Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data. 2019;6(1):96.CrossRef
12.
go back to reference Grnarova P, Schmidt F, Hyland SL, Eickhoff C. Neural document embeddings for intensive care patient mortality prediction. arXiv preprint arXiv:161200467. 2016. Grnarova P, Schmidt F, Hyland SL, Eickhoff C. Neural document embeddings for intensive care patient mortality prediction. arXiv preprint arXiv:​161200467. 2016.
13.
go back to reference Ghassemi M, Naumann T, Joshi R, Rumshisky A. Topic models for mortality modeling in intensive care units. In: ICML machine learning for clinical data analysis workshop; 2012. p. 1–4. Ghassemi M, Naumann T, Joshi R, Rumshisky A. Topic models for mortality modeling in intensive care units. In: ICML machine learning for clinical data analysis workshop; 2012. p. 1–4.
14.
go back to reference Boag W, Doss D, Naumann T, Szolovits P. What’s in a note? Unpacking predictive value in clinical note representations. AMIA Summi Transl Sci Proc. 2018;2018:26. Boag W, Doss D, Naumann T, Szolovits P. What’s in a note? Unpacking predictive value in clinical note representations. AMIA Summi Transl Sci Proc. 2018;2018:26.
15.
go back to reference Liu J, Zhang Z, Razavian N. Deep EHR: chronic disease prediction using medical notes. J Mach Learn Res (JMLR). 2018 Liu J, Zhang Z, Razavian N. Deep EHR: chronic disease prediction using medical notes. J Mach Learn Res (JMLR). 2018
16.
go back to reference Sushil M, Šuster S, Luyckx K, Daelemans W. Patient representation learning and interpretable evaluation using clinical notes. J Biomed Inform. 2018;84:103–13.CrossRef Sushil M, Šuster S, Luyckx K, Daelemans W. Patient representation learning and interpretable evaluation using clinical notes. J Biomed Inform. 2018;84:103–13.CrossRef
17.
go back to reference Jin M, Bahadori MT, Colak A, Bhatia P, Celikkaya B, Bhakta R, et al. Improving hospital mortality prediction with medical named entities and multimodal learning. arXiv preprint arXiv:181112276. 2018 Jin M, Bahadori MT, Colak A, Bhatia P, Celikkaya B, Bhakta R, et al. Improving hospital mortality prediction with medical named entities and multimodal learning. arXiv preprint arXiv:​181112276. 2018
18.
go back to reference LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
19.
go back to reference Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, et al. Deep learning for content-based image retrieval: A comprehensive study. In: Proceedings of the 22nd ACM international conference on multimedia. ACM; 2014. p. 157–166. Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, et al. Deep learning for content-based image retrieval: A comprehensive study. In: Proceedings of the 22nd ACM international conference on multimedia. ACM; 2014. p. 157–166.
20.
go back to reference Deng L, Hinton G, Kingsbury B. New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8599–8603. Deng L, Hinton G, Kingsbury B. New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8599–8603.
21.
go back to reference Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning. ACM; 2008. p. 160–167. Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning. ACM; 2008. p. 160–167.
22.
go back to reference Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035.CrossRef Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035.CrossRef
23.
go back to reference Hu Z, Melton GB, Arsoniadis EG, Wang Y, Kwaan MR, Simon GJ. Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J Biomed Inform. 2017;68:112–20.CrossRef Hu Z, Melton GB, Arsoniadis EG, Wang Y, Kwaan MR, Simon GJ. Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J Biomed Inform. 2017;68:112–20.CrossRef
24.
go back to reference Luo YF, Rumshisky A. Interpretable topic features for post-icu mortality prediction. In: AMIA annual symposium proceedings. vol. 2016. American medical informatics association; 2016. p. 827. Luo YF, Rumshisky A. Interpretable topic features for post-icu mortality prediction. In: AMIA annual symposium proceedings. vol. 2016. American medical informatics association; 2016. p. 827.
25.
go back to reference Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, et al. Risk prediction models for hospital readmission: a systematic review. JAMA. 2011;306(15):1688–98.CrossRef Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, et al. Risk prediction models for hospital readmission: a systematic review. JAMA. 2011;306(15):1688–98.CrossRef
26.
go back to reference Campbell AJ, Cook JA, Adey G, Cuthbertson BH. Predicting death and readmission after intensive care discharge. Br J Anaesth. 2008;100(5):656–62.CrossRef Campbell AJ, Cook JA, Adey G, Cuthbertson BH. Predicting death and readmission after intensive care discharge. Br J Anaesth. 2008;100(5):656–62.CrossRef
27.
go back to reference Futoma J, Morris J, Lucas J. A comparison of models for predicting early hospital readmissions. J Biomed Inform. 2015;56:229–38.CrossRef Futoma J, Morris J, Lucas J. A comparison of models for predicting early hospital readmissions. J Biomed Inform. 2015;56:229–38.CrossRef
28.
go back to reference Liu V, Kipnis P, Gould MK, Escobar GJ. Length of stay predictions: improvements through the use of automated laboratory and comorbidity variables. Med Care. 2010; p. 739–744. Liu V, Kipnis P, Gould MK, Escobar GJ. Length of stay predictions: improvements through the use of automated laboratory and comorbidity variables. Med Care. 2010; p. 739–744.
29.
go back to reference Hackbarth G, Reischauer R, Miller M. Report to the congress: promoting greater efficiency in medicare. Washington, DC: MedPAC; 2007. Hackbarth G, Reischauer R, Miller M. Report to the congress: promoting greater efficiency in medicare. Washington, DC: MedPAC; 2007.
30.
go back to reference Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference on machine learning; 2014. p. 1188–1196. Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference on machine learning; 2014. p. 1188–1196.
31.
go back to reference Rehurek R, Sojka P. Software framework for topic modelling with large corpora. In: In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer; 2010. Rehurek R, Sojka P. Software framework for topic modelling with large corpora. In: In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer; 2010.
32.
go back to reference Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(Oct):2825–30. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(Oct):2825–30.
33.
go back to reference Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems; 2019. p. 8024–8035. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems; 2019. p. 8024–8035.
Metadata
Title
Combining structured and unstructured data for predictive models: a deep learning approach
Authors
Dongdong Zhang
Changchang Yin
Jucheng Zeng
Xiaohui Yuan
Ping Zhang
Publication date
01-12-2020
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2020
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-020-01297-6

Other articles of this Issue 1/2020

BMC Medical Informatics and Decision Making 1/2020 Go to the issue