Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2020

Open Access 01-12-2020 | Research article

Customization scenarios for de-identification of clinical notes

Authors: Tzvika Hartman, Michael D. Howell, Jeff Dean, Shlomo Hoory, Ronit Slyper, Itay Laish, Oren Gilon, Danny Vainstein, Greg Corrado, Katherine Chou, Ming Jack Po, Jutta Williams, Scott Ellis, Gavin Bee, Avinatan Hassidim, Rony Amira, Genady Beryozkin, Idan Szpektor, Yossi Matias

Published in: BMC Medical Informatics and Decision Making | Issue 1/2020

Login to get access

Abstract

Background

Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets.

Objective

We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized.

Methods

We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset.

Results

Fully customized systems remove 97–99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems.

Conclusion

Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level.
Literature
1.
go back to reference Chen X, Xie H, Wang FL, Liu Z, Xu J, Hao T. A bibliometric analysis of natural language processing in medical research. BMC Med Inform Decis Mak. 2018;18(Suppl 1):14.CrossRef Chen X, Xie H, Wang FL, Liu Z, Xu J, Hao T. A bibliometric analysis of natural language processing in medical research. BMC Med Inform Decis Mak. 2018;18(Suppl 1):14.CrossRef
3.
go back to reference Névéol A, Zweigenbaum P. Clinical natural language processing in 2014: foundational methods supporting efficient healthcare. Yearb Med Inform. 2015;10(1):194–8.PubMedPubMedCentral Névéol A, Zweigenbaum P. Clinical natural language processing in 2014: foundational methods supporting efficient healthcare. Yearb Med Inform. 2015;10(1):194–8.PubMedPubMedCentral
4.
go back to reference Meystre SM, Ferrández Ó, Friedlin FJ, South BR, Shen S, Samore MH. Text de-identification for privacy protection: a study of its impact on clinical text information content. J Biomed Inform. 2014;50:142–50.CrossRef Meystre SM, Ferrández Ó, Friedlin FJ, South BR, Shen S, Samore MH. Text de-identification for privacy protection: a study of its impact on clinical text information content. J Biomed Inform. 2014;50:142–50.CrossRef
5.
go back to reference Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc. 2017;24(3):596–606.PubMed Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc. 2017;24(3):596–606.PubMed
6.
go back to reference Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75S:S34–42.CrossRef Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75S:S34–42.CrossRef
7.
go back to reference Neamatullah I, Douglass MM, Lehman L-WH, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32.CrossRef Neamatullah I, Douglass MM, Lehman L-WH, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32.CrossRef
8.
go back to reference Kushida CA, Nichols DA, Jadrnicek R, Miller R, Walsh JK, Griffin K. Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med Care. 2012;50(Suppl):S82–S101.CrossRef Kushida CA, Nichols DA, Jadrnicek R, Miller R, Walsh JK, Griffin K. Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med Care. 2012;50(Suppl):S82–S101.CrossRef
9.
go back to reference Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 i2b2/UTHealth shared task track 1. J Biomed Inform. 2015;58(Suppl):S11–9.CrossRef Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 i2b2/UTHealth shared task track 1. J Biomed Inform. 2015;58(Suppl):S11–9.CrossRef
10.
go back to reference Sweeney L. Replacing personally-identifying information in medical records, the Scrub system. In Proceedings of the AMIA annual fall symposium 1996. American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc; 2016. p. 333 Sweeney L. Replacing personally-identifying information in medical records, the Scrub system. In Proceedings of the AMIA annual fall symposium 1996. American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc; 2016. p. 333
11.
go back to reference Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol. 2004;121(2):176–86.CrossRef Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol. 2004;121(2):176–86.CrossRef
12.
go back to reference Szarvas G, Farkas R, Kocsor A. A multilingual named entity recognition system using boosting and c4. 5 decision tree learning algorithms. In International Conference on Discovery Science 2006 Oct 7. Berlin: Springer; 2006. p. 267–278. Szarvas G, Farkas R, Kocsor A. A multilingual named entity recognition system using boosting and c4. 5 decision tree learning algorithms. In International Conference on Discovery Science 2006 Oct 7. Berlin: Springer; 2006. p. 267–278.
13.
go back to reference Guo Y, Gaizauskas R, Roberts I, Demetriou G, Hepple M. Identifying personal health information using support vector machines. In i2b2 workshop on challenges in natural language processing for clinical data 2006 Nov 10. p. 10–11. Guo Y, Gaizauskas R, Roberts I, Demetriou G, Hepple M. Identifying personal health information using support vector machines. In i2b2 workshop on challenges in natural language processing for clinical data 2006 Nov 10. p. 10–11.
14.
go back to reference Uzuner O, Sibanda TC, Luo Y, Szolovits P. A de-identifier for medical discharge summaries. Artif Intell Med. 2008;42(1):13–35.CrossRef Uzuner O, Sibanda TC, Luo Y, Szolovits P. A de-identifier for medical discharge summaries. Artif Intell Med. 2008;42(1):13–35.CrossRef
15.
go back to reference Hara K. Others. Applying a SVM based chunker and a text classifier to the deid challenge. In: i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data; 2006. p. 10–1. Hara K. Others. Applying a SVM based chunker and a text classifier to the deid challenge. In: i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data; 2006. p. 10–1.
17.
go back to reference Li K, Chai Y, Zhao H, Nan X, Zhao Y. Learning to Recognize Protected Health Information in Electronic Health Records with Recurrent Neural Network. In Natural Language Understanding and Intelligent Applications 2016 Dec 2. Champ: Springer; 2016. p. 575–582.CrossRef Li K, Chai Y, Zhao H, Nan X, Zhao Y. Learning to Recognize Protected Health Information in Electronic Health Records with Recurrent Neural Network. In Natural Language Understanding and Intelligent Applications 2016 Dec 2. Champ: Springer; 2016. p. 575–582.CrossRef
18.
go back to reference Lee H-J, Wu Y, Zhang Y, Xu J, Xu H, Roberts K. A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform. 2017;75S:S19–27.CrossRef Lee H-J, Wu Y, Zhang Y, Xu J, Xu H, Roberts K. A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform. 2017;75S:S19–27.CrossRef
19.
go back to reference Kayaalp M. Modes of De-identification. AMIA Annu Symp Proc. 2017;2017:1044–50.PubMed Kayaalp M. Modes of De-identification. AMIA Annu Symp Proc. 2017;2017:1044–50.PubMed
21.
go back to reference Lee H-J, Zhang Y, Roberts K, Xu H. Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation. AMIA Annu Symp Proc. 2017;2017:1070–9.PubMed Lee H-J, Zhang Y, Roberts K, Xu H. Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation. AMIA Annu Symp Proc. 2017;2017:1070–9.PubMed
22.
go back to reference Kim Y, Riloff E, Hurdle JF. A study of concept extraction across different types of clinical notes. AMIA Annu Symp Proc. 2015;2015:737–46.PubMedPubMedCentral Kim Y, Riloff E, Hurdle JF. A study of concept extraction across different types of clinical notes. AMIA Annu Symp Proc. 2015;2015:737–46.PubMedPubMedCentral
23.
go back to reference Newman-Griffis D, Zirikly A. Embedding Transfer for Low-Resource Medical Named Entity Recognition: A Case Study on Patient Mobility. In: Proceedings of the BioNLP 2018 Workshop, Melbourne, Australia, July 19; 2018. p. 1–11. Newman-Griffis D, Zirikly A. Embedding Transfer for Low-Resource Medical Named Entity Recognition: A Case Study on Patient Mobility. In: Proceedings of the BioNLP 2018 Workshop, Melbourne, Australia, July 19; 2018. p. 1–11.
24.
go back to reference Rosenbaum PR, Rubin DB. The Central Role of the Propensity Score in Observational Studies for Causal Effects; 1982. Rosenbaum PR, Rubin DB. The Central Role of the Propensity Score in Observational Studies for Causal Effects; 1982.
26.
go back to reference Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14(5):550–63.CrossRef Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14(5):550–63.CrossRef
27.
go back to reference Amber Stubbs OU. Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2/UTHealth Corpus. J Biomed Inform. 2015;58(Suppl):S20.CrossRef Amber Stubbs OU. Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2/UTHealth Corpus. J Biomed Inform. 2015;58(Suppl):S20.CrossRef
28.
go back to reference Goldberger AL, Amaral LAN, Glass L, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):e215–20.CrossRef Goldberger AL, Amaral LAN, Glass L, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):e215–20.CrossRef
31.
33.
go back to reference Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12(Jul):2121–59. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12(Jul):2121–59.
34.
go back to reference EET E, Schain M, Mackey A, Gordon A, Saurous RA, Elidan G. Scalable Learning of Non-Decomposable Objectives. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS); 2017. EET E, Schain M, Mackey A, Gordon A, Saurous RA, Elidan G. Scalable Learning of Non-Decomposable Objectives. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS); 2017.
36.
go back to reference Sheikhshabbafghi G, Birol I, Sarkar A. In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Brussels: Association for Computational Linguistics; 2018. p. 160–4.CrossRef Sheikhshabbafghi G, Birol I, Sarkar A. In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Brussels: Association for Computational Linguistics; 2018. p. 160–4.CrossRef
37.
go back to reference Wang Y, Liu S, Afzal N, et al. A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform. 2018;87:12–20.CrossRef Wang Y, Liu S, Afzal N, et al. A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform. 2018;87:12–20.CrossRef
39.
go back to reference Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems; 2013. p. 3111–9. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems; 2013. p. 3111–9.
40.
go back to reference El Emam K, Arbuckle L. Anonymizing health data: case studies and methods to get you started. California: O’Reilly Media, Inc.; 2013. El Emam K, Arbuckle L. Anonymizing health data: case studies and methods to get you started. California: O’Reilly Media, Inc.; 2013.
Metadata
Title
Customization scenarios for de-identification of clinical notes
Authors
Tzvika Hartman
Michael D. Howell
Jeff Dean
Shlomo Hoory
Ronit Slyper
Itay Laish
Oren Gilon
Danny Vainstein
Greg Corrado
Katherine Chou
Ming Jack Po
Jutta Williams
Scott Ellis
Gavin Bee
Avinatan Hassidim
Rony Amira
Genady Beryozkin
Idan Szpektor
Yossi Matias
Publication date
01-12-2020
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2020
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-020-1026-2

Other articles of this Issue 1/2020

BMC Medical Informatics and Decision Making 1/2020 Go to the issue