Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2008

Open Access 01-12-2008 | Research article

Automated de-identification of free-text medical records

Authors: Ishna Neamatullah, Margaret M Douglass, Li-wei H Lehman, Andrew Reisner, Mauricio Villarroel, William J Long, Peter Szolovits, George B Moody, Roger G Mark, Gari D Clifford

Published in: BMC Medical Informatics and Decision Making | Issue 1/2008

Login to get access

Abstract

Background

Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification.

Methods

We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus.

Results

Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus.

Conclusion

We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.
Appendix
Available only for authorised users
Literature
1.
go back to reference Saeed M, Lieu C, Raber G, Mark RG: MIMIC II: A massive temporal ICU patient database to support research in intelligent patient monitoring. Computers in Cardiology. 2002, 29: 641-644.CrossRefPubMed Saeed M, Lieu C, Raber G, Mark RG: MIMIC II: A massive temporal ICU patient database to support research in intelligent patient monitoring. Computers in Cardiology. 2002, 29: 641-644.CrossRefPubMed
2.
go back to reference Standards for privacy of individually identifiable health information final rule. 67. Federal Register. 2002, 53181-53273. Standards for privacy of individually identifiable health information final rule. 67. Federal Register. 2002, 53181-53273.
3.
go back to reference Douglass MM, Clifford GD, Reisner A, Long WJ, Moody GB, Mark RG: De-identification algorithm for free-text nursing notes. Computers in Cardiology. 2005, 32: 331-334. Douglass MM, Clifford GD, Reisner A, Long WJ, Moody GB, Mark RG: De-identification algorithm for free-text nursing notes. Computers in Cardiology. 2005, 32: 331-334.
4.
go back to reference Douglass MM, Clifford GD, Reisner A, Moody GB, Mark RG: Computer-assisted deidentification of free text in the MIMIC II database. Computers in Cardiology. 2004, 31: 341-344. Douglass MM, Clifford GD, Reisner A, Moody GB, Mark RG: Computer-assisted deidentification of free text in the MIMIC II database. Computers in Cardiology. 2004, 31: 341-344.
5.
go back to reference Gupta D, Saul M, Gilbertson J: Evaluation of a de-identification software engine: Progress towards sharing clinical documents and pathology reports. Am J Clin Pathol. 2004, 121 (2): 176-186.CrossRefPubMed Gupta D, Saul M, Gilbertson J: Evaluation of a de-identification software engine: Progress towards sharing clinical documents and pathology reports. Am J Clin Pathol. 2004, 121 (2): 176-186.CrossRefPubMed
6.
go back to reference Sweeney L: Replacing personally-identifying information in medical records, the Scrub system. Proc AMIA Annu Fall Symp. 1996, 333-337. Sweeney L: Replacing personally-identifying information in medical records, the Scrub system. Proc AMIA Annu Fall Symp. 1996, 333-337.
7.
go back to reference Sweeney L: Guaranteeing anonymity when sharing medical data, the Datafly system. Proc AMIA Annu Fall Symp. 1996, 51-5. Sweeney L: Guaranteeing anonymity when sharing medical data, the Datafly system. Proc AMIA Annu Fall Symp. 1996, 51-5.
9.
go back to reference Goldberger A, Amaral L, Glass L, Hausdorff JM, Ivanov P, Mark RG, Mietus J, Moody G, Peng C, Stanley H: PhysioBank, PhysioToolkit, and PhysioNet : Components of a New Research Resource for Complex Physiologic Signals. Circulation. 2000, 101 (23): e215-220.CrossRefPubMed Goldberger A, Amaral L, Glass L, Hausdorff JM, Ivanov P, Mark RG, Mietus J, Moody G, Peng C, Stanley H: PhysioBank, PhysioToolkit, and PhysioNet : Components of a New Research Resource for Complex Physiologic Signals. Circulation. 2000, 101 (23): e215-220.CrossRefPubMed
10.
go back to reference Douglass MM: Computer-Assisted De-identification of Free-text Nursing Notes. MEng thesis. 2005, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science Douglass MM: Computer-Assisted De-identification of Free-text Nursing Notes. MEng thesis. 2005, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
12.
go back to reference Ruch P, Baud R, Rassinoux A-M, Bouillon P, Robert G: Medical Document Anonymization with a Semantic Lexicon. Proceedings of AMIA Symposium. 2000, 729-733. Ruch P, Baud R, Rassinoux A-M, Bouillon P, Robert G: Medical Document Anonymization with a Semantic Lexicon. Proceedings of AMIA Symposium. 2000, 729-733.
13.
go back to reference Taira RK, Bui AAT, Kangarloo H: Identification of patient name references within Medical Documents Using Semantic Selectional Restrictions. Proceedings of AMIA Symposium. 2002, 757-761. Taira RK, Bui AAT, Kangarloo H: Identification of patient name references within Medical Documents Using Semantic Selectional Restrictions. Proceedings of AMIA Symposium. 2002, 757-761.
14.
go back to reference Thomas SM, Mamlin B, Schadow G, McDonald C: A successful technique for removing names in pathology reports using an augmented search and replace method. Proc AMIA Symp. 2002, 777-781. Thomas SM, Mamlin B, Schadow G, McDonald C: A successful technique for removing names in pathology reports using an augmented search and replace method. Proc AMIA Symp. 2002, 777-781.
15.
go back to reference Berman JJ: Concept-Match Medical Data Scrubbing. How pathology text can be used in research. Arch Pathol Lab Med. 2003, 127 (6): 680-6.PubMed Berman JJ: Concept-Match Medical Data Scrubbing. How pathology text can be used in research. Arch Pathol Lab Med. 2003, 127 (6): 680-6.PubMed
16.
go back to reference Beckwith B, Mahaadevan R, Balis U, Kuo F: Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak. 2006, 6: 12-CrossRefPubMedPubMedCentral Beckwith B, Mahaadevan R, Balis U, Kuo F: Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak. 2006, 6: 12-CrossRefPubMedPubMedCentral
17.
go back to reference Miller R, Boitnott JK, Moore GW: Web-Based Free-Text Query System for Surgical Pathology Reports with Automatic Case Deidentification. Arch Pathol Lab Med. 2001, 125: 1011- Miller R, Boitnott JK, Moore GW: Web-Based Free-Text Query System for Surgical Pathology Reports with Automatic Case Deidentification. Arch Pathol Lab Med. 2001, 125: 1011-
18.
go back to reference Sweeney L: Computational Disclosure Control: A Primer on Data Privacy Protection. Ph.D. thesis. 2001, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science Sweeney L: Computational Disclosure Control: A Primer on Data Privacy Protection. Ph.D. thesis. 2001, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
19.
go back to reference Sibanda T, He T, Szolovits P, Uzuner Ö: Syntactically-Informed Semantic Category Recognizer for Discharge Summaries. Proceedings of the AMIA Symposium. 2006, 714-8. Sibanda T, He T, Szolovits P, Uzuner Ö: Syntactically-Informed Semantic Category Recognizer for Discharge Summaries. Proceedings of the AMIA Symposium. 2006, 714-8.
20.
go back to reference Sibanda T: Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records. MEng thesis. 2006, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science Sibanda T: Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records. MEng thesis. 2006, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
21.
go back to reference Namini AH, Berkowicz DA, Kohane IS, Chueh H: A submission model for use in the indexing, searching, and retrieval of distributed pathology case and tissue specimens. Stud Health Technol Inform. 2004, 107 (Pt 2): 1264-7.PubMed Namini AH, Berkowicz DA, Kohane IS, Chueh H: A submission model for use in the indexing, searching, and retrieval of distributed pathology case and tissue specimens. Stud Health Technol Inform. 2004, 107 (Pt 2): 1264-7.PubMed
26.
go back to reference Berman JJ: Ruby Programming for Medicine and Biology. 2008, Jones & Bartlett Publishers, Sudbury, MA, 11: 157-163. Berman JJ: Ruby Programming for Medicine and Biology. 2008, Jones & Bartlett Publishers, Sudbury, MA, 11: 157-163.
27.
go back to reference Uzuner Ö, Luo Y, Szolovits P: Evaluating the state-of-the-art in Automatic De-identification. Journal of the American Medical Informatics Association. 2007, 14 (5): 550-563.CrossRefPubMedPubMedCentral Uzuner Ö, Luo Y, Szolovits P: Evaluating the state-of-the-art in Automatic De-identification. Journal of the American Medical Informatics Association. 2007, 14 (5): 550-563.CrossRefPubMedPubMedCentral
28.
go back to reference Szarvas G, Farkas R, Busa-Fekete R: State-of-the-art anonymisation of medical records using an iterative machine learning framework. Journal of the American Medical Informatics Association. 2007, 14 (5): 574-580.CrossRefPubMedPubMedCentral Szarvas G, Farkas R, Busa-Fekete R: State-of-the-art anonymisation of medical records using an iterative machine learning framework. Journal of the American Medical Informatics Association. 2007, 14 (5): 574-580.CrossRefPubMedPubMedCentral
29.
go back to reference Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L, Yeh A, Hitzeman J, Hirschman L: Rapidly retargetable approaches to de-identification in medical records. Journal of the American Medical Informatics Association. 2007, 14 (5): 564-573.CrossRefPubMedPubMedCentral Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L, Yeh A, Hitzeman J, Hirschman L: Rapidly retargetable approaches to de-identification in medical records. Journal of the American Medical Informatics Association. 2007, 14 (5): 564-573.CrossRefPubMedPubMedCentral
30.
go back to reference O'Byrne P, Nigrin D, Shaw D: Children's Hospital, Boston. Private Communication. June 9th 2008 O'Byrne P, Nigrin D, Shaw D: Children's Hospital, Boston. Private Communication. June 9th 2008
31.
go back to reference Federal Policy for the Protection of Human Subjects; Notices and Rules. Federal Register, Part II. 1991, 56 (117): 28002-32. 18 June 1991 Federal Policy for the Protection of Human Subjects; Notices and Rules. Federal Register, Part II. 1991, 56 (117): 28002-32. 18 June 1991
Metadata
Title
Automated de-identification of free-text medical records
Authors
Ishna Neamatullah
Margaret M Douglass
Li-wei H Lehman
Andrew Reisner
Mauricio Villarroel
William J Long
Peter Szolovits
George B Moody
Roger G Mark
Gari D Clifford
Publication date
01-12-2008
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2008
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/1472-6947-8-32

Other articles of this Issue 1/2008

BMC Medical Informatics and Decision Making 1/2008 Go to the issue