Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2008 | Research article

Automated de-identification of free-text medical records

Authors: Ishna Neamatullah, Margaret M Douglass, Li-wei H Lehman, Andrew Reisner, Mauricio Villarroel, William J Long, Peter Szolovits, George B Moody, Roger G Mark, Gari D Clifford

Published in: BMC Medical Informatics and Decision Making | Issue 1/2008

Abstract

Background

Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification.

Methods

We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus.

Results

Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus.

Conclusion

We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.

Available only for authorised users

Saeed M, Lieu C, Raber G, Mark RG: MIMIC II: A massive temporal ICU patient database to support research in intelligent patient monitoring. Computers in Cardiology. 2002, 29: 641-644.CrossRefPubMed

Standards for privacy of individually identifiable health information final rule. 67. Federal Register. 2002, 53181-53273.

Douglass MM, Clifford GD, Reisner A, Long WJ, Moody GB, Mark RG: De-identification algorithm for free-text nursing notes. Computers in Cardiology. 2005, 32: 331-334.

Douglass MM, Clifford GD, Reisner A, Moody GB, Mark RG: Computer-assisted deidentification of free text in the MIMIC II database. Computers in Cardiology. 2004, 31: 341-344.

Gupta D, Saul M, Gilbertson J: Evaluation of a de-identification software engine: Progress towards sharing clinical documents and pathology reports. Am J Clin Pathol. 2004, 121 (2): 176-186.CrossRefPubMed

Sweeney L: Replacing personally-identifying information in medical records, the Scrub system. Proc AMIA Annu Fall Symp. 1996, 333-337.

Sweeney L: Guaranteeing anonymity when sharing medical data, the Datafly system. Proc AMIA Annu Fall Symp. 1996, 51-5.

De-identification: software and test data. PhysioNet: Research Resource for Complex Physiologic Signals. [http://www.physionet.org/physiotools/deid/]

Goldberger A, Amaral L, Glass L, Hausdorff JM, Ivanov P, Mark RG, Mietus J, Moody G, Peng C, Stanley H: PhysioBank, PhysioToolkit, and PhysioNet : Components of a New Research Resource for Complex Physiologic Signals. Circulation. 2000, 101 (23): e215-220.CrossRefPubMed

10.

Douglass MM: Computer-Assisted De-identification of Free-text Nursing Notes. MEng thesis. 2005, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

11.

Privacert: Data Privacy Technology. [http://www.privacert.com]

12.

Ruch P, Baud R, Rassinoux A-M, Bouillon P, Robert G: Medical Document Anonymization with a Semantic Lexicon. Proceedings of AMIA Symposium. 2000, 729-733.

13.

Taira RK, Bui AAT, Kangarloo H: Identification of patient name references within Medical Documents Using Semantic Selectional Restrictions. Proceedings of AMIA Symposium. 2002, 757-761.

14.

Thomas SM, Mamlin B, Schadow G, McDonald C: A successful technique for removing names in pathology reports using an augmented search and replace method. Proc AMIA Symp. 2002, 777-781.

15.

Berman JJ: Concept-Match Medical Data Scrubbing. How pathology text can be used in research. Arch Pathol Lab Med. 2003, 127 (6): 680-6.PubMed

16.

Beckwith B, Mahaadevan R, Balis U, Kuo F: Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak. 2006, 6: 12-CrossRefPubMedPubMedCentral

17.

Miller R, Boitnott JK, Moore GW: Web-Based Free-Text Query System for Surgical Pathology Reports with Automatic Case Deidentification. Arch Pathol Lab Med. 2001, 125: 1011-

18.

Sweeney L: Computational Disclosure Control: A Primer on Data Privacy Protection. Ph.D. thesis. 2001, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

19.

Sibanda T, He T, Szolovits P, Uzuner Ö: Syntactically-Informed Semantic Category Recognizer for Discharge Summaries. Proceedings of the AMIA Symposium. 2006, 714-8.

20.

Sibanda T: Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records. MEng thesis. 2006, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

21.

Namini AH, Berkowicz DA, Kohane IS, Chueh H: A submission model for use in the indexing, searching, and retrieval of distributed pathology case and tissue specimens. Stud Health Technol Inform. 2004, 107 (Pt 2): 1264-7.PubMed

22.

U.S. Census Bureau: Frequently Occurring Names from the 1990 Census. [http://www.census.gov/genealogy/names]

23.

Atkinson K: Spell checking oriented word lists, revision 6, 2004. [http://prdownloads.sourceforge.net/wordlist/scowl6.tar.gz]

24.

Unified Medical Language System, United States National Library of Medicine, National Institutes of Health. [http://umlsinfo.nlm.nih.gov/]

25.

Ispell – open source software for spelling correction. [http://www.gnu.org/software/ispell/ispell.html]

26.

Berman JJ: Ruby Programming for Medicine and Biology. 2008, Jones & Bartlett Publishers, Sudbury, MA, 11: 157-163.

27.

Uzuner Ö, Luo Y, Szolovits P: Evaluating the state-of-the-art in Automatic De-identification. Journal of the American Medical Informatics Association. 2007, 14 (5): 550-563.CrossRefPubMedPubMedCentral

28.

Szarvas G, Farkas R, Busa-Fekete R: State-of-the-art anonymisation of medical records using an iterative machine learning framework. Journal of the American Medical Informatics Association. 2007, 14 (5): 574-580.CrossRefPubMedPubMedCentral

29.

Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L, Yeh A, Hitzeman J, Hirschman L: Rapidly retargetable approaches to de-identification in medical records. Journal of the American Medical Informatics Association. 2007, 14 (5): 564-573.CrossRefPubMedPubMedCentral

30.

O'Byrne P, Nigrin D, Shaw D: Children's Hospital, Boston. Private Communication. June 9^th 2008

31.

Federal Policy for the Protection of Human Subjects; Notices and Rules. Federal Register, Part II. 1991, 56 (117): 28002-32. 18 June 1991

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/8/32/prepub

Title: Automated de-identification of free-text medical records
Authors: Ishna Neamatullah
Margaret M Douglass
Li-wei H Lehman
Andrew Reisner
Mauricio Villarroel
William J Long
Peter Szolovits
George B Moody
Roger G Mark
Gari D Clifford
Publication date: 01-12-2008
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2008
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/1472-6947-8-32

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Automated de-identification of free-text medical records

Abstract

Background

Methods

Results

Conclusion

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Abstract

Background

Methods

Results

Conclusion

Please log in to get access to this content

Other articles of this Issue 1/2008

Measurement properties of the Inventory of Cognitive Bias in Medicine (ICBM)

Sample size determination for bibliographic retrieval studies

Is Canada ready for patient accessible electronic health records? A national scan

How does age affect baseline screening mammography performance measures? A decision model

Design of a graphical and interactive interface for facilitating access to drug contraindications, cautions for use, interactions and adverse effects

Underutilization of information and knowledge in everyday medical practice: Evaluation of a computer-based solution