Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2012

Open Access 01-12-2012 | Correspondence

Text data extraction for a prospective, research-focused data mart: implementation and validation

Authors: Monique Hinchcliff, Eric Just, Sofia Podlusky, John Varga, Rowland W Chang, Warren A Kibbe

Published in: BMC Medical Informatics and Decision Making | Issue 1/2012

Login to get access

Abstract

Background

Translational research typically requires data abstracted from medical records as well as data collected specifically for research. Unfortunately, many data within electronic health records are represented as text that is not amenable to aggregation for analyses. We present a scalable open source SQL Server Integration Services package, called Regextractor, for including regular expression parsers into a classic extract, transform, and load workflow. We have used Regextractor to abstract discrete data from textual reports from a number of ‘machine generated’ sources. To validate this package, we created a pulmonary function test data mart and analyzed the quality of the data mart versus manual chart review.

Methods

Eleven variables from pulmonary function tests performed closest to the initial clinical evaluation date were studied for 100 randomly selected subjects with scleroderma. One research assistant manually reviewed, abstracted, and entered relevant data into a database. Correlation with data obtained from the automated pulmonary function test data mart within the Northwestern Medical Enterprise Data Warehouse was determined.

Results

There was a near perfect (99.5%) agreement between results generated from the Regextractor package and those obtained via manual chart abstraction. The pulmonary function test data mart has been used subsequently to monitor disease progression of patients in the Northwestern Scleroderma Registry. In addition to the pulmonary function test example presented in this manuscript, the Regextractor package has been used to create cardiac catheterization and echocardiography data marts. The Regextractor package was released as open source software in October 2009 and has been downloaded 552 times as of 6/1/2012.

Conclusions

Collaboration between clinical researchers and biomedical informatics experts enabled the development and validation of a tool (Regextractor) to parse, abstract and assemble structured data from text data contained in the electronic health record. Regextractor has been successfully used to create additional data marts in other medical domains and is available to the public.
Appendix
Available only for authorised users
Literature
1.
go back to reference Lovis C, Baud RH, Planche P: Power of expression in the electronic patient record: structured data or narrative text?. Int J Med Inform. 2000, 58–59: 101-110.CrossRefPubMed Lovis C, Baud RH, Planche P: Power of expression in the electronic patient record: structured data or narrative text?. Int J Med Inform. 2000, 58–59: 101-110.CrossRefPubMed
2.
go back to reference Cunningham H, Maynard D, Bontcheva K, Tablan V: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). 2002 Cunningham H, Maynard D, Bontcheva K, Tablan V: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). 2002
3.
go back to reference Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF: Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008, 128-144. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF: Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008, 128-144.
4.
go back to reference Saria S, McElvain G, Rajani AK, Penn AA, Koller DL: Combining Structured and Free-text Data for Automatic Coding of Patient Outcomes. AMIA Annu Symp Proc. 2010, 2010: 712-716.PubMedPubMedCentral Saria S, McElvain G, Rajani AK, Penn AA, Koller DL: Combining Structured and Free-text Data for Automatic Coding of Patient Outcomes. AMIA Annu Symp Proc. 2010, 2010: 712-716.PubMedPubMedCentral
5.
go back to reference Nahm ML, Pieper CF, Cunningham MM: Quantifying data quality for clinical trials using electronic data capture. PLoS One. 2008, 3 (8): e3049-10.1371/journal.pone.0003049.CrossRefPubMedPubMedCentral Nahm ML, Pieper CF, Cunningham MM: Quantifying data quality for clinical trials using electronic data capture. PLoS One. 2008, 3 (8): e3049-10.1371/journal.pone.0003049.CrossRefPubMedPubMedCentral
6.
go back to reference Crowley RS, Castine M, Mitchell K, Chavan G, McSherry T, Feldman M: caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. J Am Med Inform Assoc. 2010, 17 (3): 253-264.CrossRefPubMedPubMedCentral Crowley RS, Castine M, Mitchell K, Chavan G, McSherry T, Feldman M: caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. J Am Med Inform Assoc. 2010, 17 (3): 253-264.CrossRefPubMedPubMedCentral
7.
go back to reference Ferrucci D, Lally A: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat Lang Eng. 2004, 10 (3–4): 327-348.CrossRef Ferrucci D, Lally A: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat Lang Eng. 2004, 10 (3–4): 327-348.CrossRef
8.
go back to reference Goryachev S, Sordo M, Zeng QT: A suite of natural language processing tools developed for the I2B2 project. 2006: American Medical Informatics Association. 2006, 931 Goryachev S, Sordo M, Zeng QT: A suite of natural language processing tools developed for the I2B2 project. 2006: American Medical Informatics Association. 2006, 931
9.
go back to reference Mackenzie SL, Wyatt MC, Schuff R, Tenenbaum JD, Anderson N: Practices and perspectives on building integrated data repositories: results from a 2010 CTSA survey. J Am Med Inform Assoc. 2012, 19 (e1): e119-e124. 10.1136/amiajnl-2011-000508.CrossRefPubMedPubMedCentral Mackenzie SL, Wyatt MC, Schuff R, Tenenbaum JD, Anderson N: Practices and perspectives on building integrated data repositories: results from a 2010 CTSA survey. J Am Med Inform Assoc. 2012, 19 (e1): e119-e124. 10.1136/amiajnl-2011-000508.CrossRefPubMedPubMedCentral
10.
go back to reference Hinchcliff M, Varga J: Managing Systemic Sclerosis and Its Complications. J Musculoskelet Med. 2011, 28 (10): Hinchcliff M, Varga J: Managing Systemic Sclerosis and Its Complications. J Musculoskelet Med. 2011, 28 (10):
11.
go back to reference Association Diagnostic and Therapeutic Criteria Committee: Preliminary criteria for the classification of systemic sclerosis (Scleroderma). Arthritis Rheum. 1980, 23 (5): 581-590. 10.1002/art.1780230510.CrossRef Association Diagnostic and Therapeutic Criteria Committee: Preliminary criteria for the classification of systemic sclerosis (Scleroderma). Arthritis Rheum. 1980, 23 (5): 581-590. 10.1002/art.1780230510.CrossRef
12.
go back to reference Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG: Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009, 42 (2): 377-381. 10.1016/j.jbi.2008.08.010.CrossRefPubMed Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG: Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009, 42 (2): 377-381. 10.1016/j.jbi.2008.08.010.CrossRefPubMed
13.
go back to reference Prokosch HU, Ganslandt T: Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods of information in medicine. 2009, 48 (1): 38-44.PubMed Prokosch HU, Ganslandt T: Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods of information in medicine. 2009, 48 (1): 38-44.PubMed
14.
go back to reference Kahn MG, Kaplan D, Sokol RJ, DiLaura RP: Configuration challenges: implementing translational research policies in electronic medical records. Acad Med. 2007, 82 (7): 661-669. 10.1097/ACM.0b013e318065be8d.CrossRefPubMedPubMedCentral Kahn MG, Kaplan D, Sokol RJ, DiLaura RP: Configuration challenges: implementing translational research policies in electronic medical records. Acad Med. 2007, 82 (7): 661-669. 10.1097/ACM.0b013e318065be8d.CrossRefPubMedPubMedCentral
15.
go back to reference Kahn M: Ensuring the Inclusion of Clinical Research in the National Health Information Network, NCRR Workshop. 2006 Kahn M: Ensuring the Inclusion of Clinical Research in the National Health Information Network, NCRR Workshop. 2006
16.
go back to reference Hinchcliff M, Desai CS, Varga J, Shah SJ: Prevalence, prognosis, and factors associated with left ventricular diastolic dysfunction in systemic sclerosis. Clinical and experimental rheumatology. 2012, 30 (2 Suppl 71): S30-37.PubMedPubMedCentral Hinchcliff M, Desai CS, Varga J, Shah SJ: Prevalence, prognosis, and factors associated with left ventricular diastolic dysfunction in systemic sclerosis. Clinical and experimental rheumatology. 2012, 30 (2 Suppl 71): S30-37.PubMedPubMedCentral
17.
go back to reference Klein OL, Smith LJ, Tipping M, Peng J, Williams MV: Reduced diffusion lung capacity in patients with type 2 diabetes mellitus predicts hospitalization for pneumonia. Diabetes Res Clin Pract. 2011, 92 (1): e12-15. 10.1016/j.diabres.2010.12.012.CrossRefPubMed Klein OL, Smith LJ, Tipping M, Peng J, Williams MV: Reduced diffusion lung capacity in patients with type 2 diabetes mellitus predicts hospitalization for pneumonia. Diabetes Res Clin Pract. 2011, 92 (1): e12-15. 10.1016/j.diabres.2010.12.012.CrossRefPubMed
Metadata
Title
Text data extraction for a prospective, research-focused data mart: implementation and validation
Authors
Monique Hinchcliff
Eric Just
Sofia Podlusky
John Varga
Rowland W Chang
Warren A Kibbe
Publication date
01-12-2012
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2012
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/1472-6947-12-106

Other articles of this Issue 1/2012

BMC Medical Informatics and Decision Making 1/2012 Go to the issue