Skip to main content
Top
Published in: Journal of Digital Imaging 4/2009

01-08-2009

Development of a Google-Based Search Engine for Data Mining Radiology Reports

Authors: Joseph P. Erinjeri, Daniel Picus, Fred W. Prior, David A. Rubin, Paul Koppel

Published in: Journal of Imaging Informatics in Medicine | Issue 4/2009

Login to get access

Abstract

The aim of this study is to develop a secure, Google-based data-mining tool for radiology reports using free and open source technologies and to explore its use within an academic radiology department. A Health Insurance Portability and Accountability Act (HIPAA)-compliant data repository, search engine and user interface were created to facilitate treatment, operations, and reviews preparatory to research. The Institutional Review Board waived review of the project, and informed consent was not required. Comprising 7.9 GB of disk space, 2.9 million text reports were downloaded from our radiology information system to a fileserver. Extensible markup language (XML) representations of the reports were indexed using Google Desktop Enterprise search engine software. A hypertext markup language (HTML) form allowed users to submit queries to Google Desktop, and Google’s XML response was interpreted by a practical extraction and report language (PERL) script, presenting ranked results in a web browser window. The query, reason for search, results, and documents visited were logged to maintain HIPAA compliance. Indexing averaged approximately 25,000 reports per hour. Keyword search of a common term like “pneumothorax” yielded the first ten most relevant results of 705,550 total results in 1.36 s. Keyword search of a rare term like “hemangioendothelioma” yielded the first ten most relevant results of 167 total results in 0.23 s; retrieval of all 167 results took 0.26 s. Data mining tools for radiology reports will improve the productivity of academic radiologists in clinical, educational, research, and administrative tasks. By leveraging existing knowledge of Google’s interface, radiologists can quickly perform useful searches.
Literature
2.
go back to reference Thrall JH: Reinventing radiology in the digital age: part I. The all-digital department. Radiology 236:382–385, 2005PubMedCrossRef Thrall JH: Reinventing radiology in the digital age: part I. The all-digital department. Radiology 236:382–385, 2005PubMedCrossRef
3.
go back to reference Hynes DM, Stevenson G, Nahmias C: Towards filmless and distance radiology. Lancet 350:657–660, 1997PubMedCrossRef Hynes DM, Stevenson G, Nahmias C: Towards filmless and distance radiology. Lancet 350:657–660, 1997PubMedCrossRef
4.
go back to reference Tamm EP, Kawashima A, Silverman P: An academic radiology information system (RIS): a review of the commercial RIS systems, and how an individualized academic RIS can be created and utilized. J Digit Imaging 14:131–134, 2001PubMedCrossRef Tamm EP, Kawashima A, Silverman P: An academic radiology information system (RIS): a review of the commercial RIS systems, and how an individualized academic RIS can be created and utilized. J Digit Imaging 14:131–134, 2001PubMedCrossRef
5.
go back to reference Thrall JH: Reinventing radiology in the digital age. Part II. New directions and new stakeholder value. Radiology 237:15–18, 2005PubMedCrossRef Thrall JH: Reinventing radiology in the digital age. Part II. New directions and new stakeholder value. Radiology 237:15–18, 2005PubMedCrossRef
6.
go back to reference Meghea CI, Sunshine JH: Who’s overworked and who’s underworked among radiologists? An update on the radiologist shortage. Radiology 236:932–938, 2005PubMedCrossRef Meghea CI, Sunshine JH: Who’s overworked and who’s underworked among radiologists? An update on the radiologist shortage. Radiology 236:932–938, 2005PubMedCrossRef
7.
go back to reference Steinbrook R: Searching for the right search—reaching the medical literature. N Engl J Med 354:4–7, 2006PubMedCrossRef Steinbrook R: Searching for the right search—reaching the medical literature. N Engl J Med 354:4–7, 2006PubMedCrossRef
8.
go back to reference Birney E, Bateman A, Clamp ME, Hubbard TJ: Mining the draft human genome. Nature 409:827–828, 2001PubMedCrossRef Birney E, Bateman A, Clamp ME, Hubbard TJ: Mining the draft human genome. Nature 409:827–828, 2001PubMedCrossRef
10.
go back to reference O’Connor JB, Johanson JF: Use of the Web for medical information by a gastroenterology clinic population. JAMA 284:1962–1964, 2000PubMedCrossRef O’Connor JB, Johanson JF: Use of the Web for medical information by a gastroenterology clinic population. JAMA 284:1962–1964, 2000PubMedCrossRef
12.
go back to reference Hand DJ, Mannila P, Smyth P: Principle of Data Mining, Cambridge, MA: MIT, 2001 Hand DJ, Mannila P, Smyth P: Principle of Data Mining, Cambridge, MA: MIT, 2001
13.
go back to reference Mullins IM, Siadaty MS, Lyman J, et al: Data mining and clinical data repositories: insights from a 667,000 patient data set. Comput Biol Med 36:1351–1377, 2006PubMedCrossRef Mullins IM, Siadaty MS, Lyman J, et al: Data mining and clinical data repositories: insights from a 667,000 patient data set. Comput Biol Med 36:1351–1377, 2006PubMedCrossRef
14.
go back to reference Nigrin DJ, Kohane IS: Data mining by clinicians. Proc AMIA Symp 1998:957–961, 1998 Nigrin DJ, Kohane IS: Data mining by clinicians. Proc AMIA Symp 1998:957–961, 1998
15.
go back to reference Prather JC, Lobach DF, Goodwin LK, Hales JW, Hage ML, Hammond WE: Medical data mining: knowledge discovery in a clinical data warehouse. Proc AMIA Annu Fall Symp 1997:101–105, 1997 Prather JC, Lobach DF, Goodwin LK, Hales JW, Hage ML, Hammond WE: Medical data mining: knowledge discovery in a clinical data warehouse. Proc AMIA Annu Fall Symp 1997:101–105, 1997
16.
go back to reference Ananiadou S, Kell DB, Tsujii JI: Text mining and its potential applications in systems biology. Trends Biotechnol 24:571–579, 2006PubMedCrossRef Ananiadou S, Kell DB, Tsujii JI: Text mining and its potential applications in systems biology. Trends Biotechnol 24:571–579, 2006PubMedCrossRef
17.
go back to reference Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform 6:57–71, 2005PubMedCrossRef Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform 6:57–71, 2005PubMedCrossRef
18.
go back to reference Heinze DT, Morsch ML, Holbrook J: Mining free-text medical records. Proc AMIA Symp 2001:254–258, 2001 Heinze DT, Morsch ML, Holbrook J: Mining free-text medical records. Proc AMIA Symp 2001:254–258, 2001
20.
go back to reference Bekhuis T: Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacy. Biomed Digit Libr 3:2, 2006PubMedCrossRef Bekhuis T: Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacy. Biomed Digit Libr 3:2, 2006PubMedCrossRef
21.
go back to reference Scherf M, Epple A, Werner T: The next generation of literature analysis: integration of genomic analysis into text mining. Brief Bioinform 6:287–297, 2005PubMedCrossRef Scherf M, Epple A, Werner T: The next generation of literature analysis: integration of genomic analysis into text mining. Brief Bioinform 6:287–297, 2005PubMedCrossRef
22.
go back to reference Schonbach C, Nagashima T, Konagaya A: Textmining in support of knowledge discovery for vaccine development. Methods 34:488–495, 2004PubMedCrossRef Schonbach C, Nagashima T, Konagaya A: Textmining in support of knowledge discovery for vaccine development. Methods 34:488–495, 2004PubMedCrossRef
23.
go back to reference Sokol L, Garcia B, Rodriguez J, West M, Johnson K: Using data mining to find fraud in HCFA health care claims. Top Health Inf Manage 22:1–13, 2001PubMed Sokol L, Garcia B, Rodriguez J, West M, Johnson K: Using data mining to find fraud in HCFA health care claims. Top Health Inf Manage 22:1–13, 2001PubMed
24.
go back to reference Definitions: research. Title 45 Code of Federal Regulation, Pt. 46.102(d), 2000 Definitions: research. Title 45 Code of Federal Regulation, Pt. 46.102(d), 2000
25.
go back to reference Use and Disclosure for Treatment, Payment and Health Care Operations. Title 45 Code of Federal Regulation, Pt. 164.506, 2000 Use and Disclosure for Treatment, Payment and Health Care Operations. Title 45 Code of Federal Regulation, Pt. 164.506, 2000
26.
go back to reference Definition: health care operations. Title 45 Code of Federal Regulation, Pt. 164.501(2), 2000 Definition: health care operations. Title 45 Code of Federal Regulation, Pt. 164.501(2), 2000
27.
go back to reference IRB review of research. Title 45 Code of Federal Regulation, Pt. 46.109, 2000 IRB review of research. Title 45 Code of Federal Regulation, Pt. 46.109, 2000
28.
go back to reference Reviews Preparatory to Research. Title 45 Code of Federal Regulation, Pt. 164.512(h)(i)(1)(ii), 2000 Reviews Preparatory to Research. Title 45 Code of Federal Regulation, Pt. 164.512(h)(i)(1)(ii), 2000
29.
go back to reference De-identification of protected health information. Title 45 Code of Federal Regulation, Pt. 164.514(a), 2000 De-identification of protected health information. Title 45 Code of Federal Regulation, Pt. 164.514(a), 2000
30.
go back to reference Magos A, Gambadauro P: Desktop search engines: a modern way to hand search in full text. Lancet 366:203–204, 2005PubMedCrossRef Magos A, Gambadauro P: Desktop search engines: a modern way to hand search in full text. Lancet 366:203–204, 2005PubMedCrossRef
31.
go back to reference Smith AC: Effect of XML markup on retrieval of clinical documents. AMIA Annu Symp Proc 2003:614–618, 2003 Smith AC: Effect of XML markup on retrieval of clinical documents. AMIA Annu Symp Proc 2003:614–618, 2003
32.
go back to reference Hulse NC, Rocha RA, Bradshaw R, Del Fiol G, Roemer L: Application of an XML-based document framework to knowledge content authoring and clinical information system development. AMIA Annu Symp Proc 2003:870, 2003 Hulse NC, Rocha RA, Bradshaw R, Del Fiol G, Roemer L: Application of an XML-based document framework to knowledge content authoring and clinical information system development. AMIA Annu Symp Proc 2003:870, 2003
33.
go back to reference Hripcsak G, Austin JH, Alderson PO, Friedman C: Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. Radiology 224:157–163, 2002PubMedCrossRef Hripcsak G, Austin JH, Alderson PO, Friedman C: Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. Radiology 224:157–163, 2002PubMedCrossRef
Metadata
Title
Development of a Google-Based Search Engine for Data Mining Radiology Reports
Authors
Joseph P. Erinjeri
Daniel Picus
Fred W. Prior
David A. Rubin
Paul Koppel
Publication date
01-08-2009
Publisher
Springer-Verlag
Published in
Journal of Imaging Informatics in Medicine / Issue 4/2009
Print ISSN: 2948-2925
Electronic ISSN: 2948-2933
DOI
https://doi.org/10.1007/s10278-008-9110-7

Other articles of this Issue 4/2009

Journal of Digital Imaging 4/2009 Go to the issue