Skip to main content
Top
Published in: Breast Cancer Research and Treatment 2/2017

01-01-2017 | Preclinical Study

Using machine learning to parse breast pathology reports

Authors: Adam Yala, Regina Barzilay, Laura Salama, Molly Griffin, Grace Sollender, Aditya Bardia, Constance Lehman, Julliette M. Buckley, Suzanne B. Coopey, Fernanda Polubriaginof, Judy E. Garber, Barbara L. Smith, Michele A. Gadd, Michelle C. Specht, Thomas M. Gudewicz, Anthony J. Guidi, Alphonse Taghian, Kevin S. Hughes

Published in: Breast Cancer Research and Treatment | Issue 2/2017

Login to get access

Abstract

Purpose

Extracting information from electronic medical record is a time-consuming and expensive process when done manually. Rule-based and machine learning techniques are two approaches to solving this problem. In this study, we trained a machine learning model on pathology reports to extract pertinent tumor characteristics, which enabled us to create a large database of attribute searchable pathology reports. This database can be used to identify cohorts of patients with characteristics of interest.

Methods

We collected a total of 91,505 breast pathology reports from three Partners hospitals: Massachusetts General Hospital, Brigham and Women’s Hospital, and Newton-Wellesley Hospital, covering the period from 1978 to 2016. We trained our system with annotations from two datasets, consisting of 6295 and 10,841 manually annotated reports. The system extracts 20 separate categories of information, including atypia types and various tumor characteristics such as receptors. We also report a learning curve analysis to show how much annotation our model needs to perform reasonably.

Results

The model accuracy was tested on 500 reports that did not overlap with the training set. The model achieved accuracy of 90% for correctly parsing all carcinoma and atypia categories for a given patient. The average accuracy for individual categories was 97%. Using this classifier, we created a database of 91,505 parsed pathology reports.

Conclusions

Our learning curve analysis shows that the model can achieve reasonable results even when trained on a few annotations. We developed a user-friendly interface to the database that allows physicians to easily identify patients with target characteristics and export the matching cohort. This model has the potential to reduce the effort required for analyzing large amounts of data from medical records, and to minimize the cost and time required to glean scientific insight from these data.
Literature
1.
go back to reference Buckley JM, Coopey SB, Sharko J, Polubriaginof F, Drohan B, Belli AK, Kim EM, Garber JE, Smith BL, Gadd MA et al (2012) The feasibility of using natural language processing to extract clinical information from breast pathology reports. J Pathol Inform 3:23CrossRefPubMedPubMedCentral Buckley JM, Coopey SB, Sharko J, Polubriaginof F, Drohan B, Belli AK, Kim EM, Garber JE, Smith BL, Gadd MA et al (2012) The feasibility of using natural language processing to extract clinical information from breast pathology reports. J Pathol Inform 3:23CrossRefPubMedPubMedCentral
3.
go back to reference Napolitano G, Fox C, Middleton R, Connolly D (2010) Pattern-based information extraction from pathology reports for cancer registration. Cancer Causes Control 21:1887–1894CrossRefPubMed Napolitano G, Fox C, Middleton R, Connolly D (2010) Pattern-based information extraction from pathology reports for cancer registration. Cancer Causes Control 21:1887–1894CrossRefPubMed
4.
go back to reference Nguyen A, Lawley M, Hansen D, Colquist S (2011) Structured pathology reporting for cancer from free text: lung cancer case study. Electron J Health Inform 7:8 Nguyen A, Lawley M, Hansen D, Colquist S (2011) Structured pathology reporting for cancer from free text: lung cancer case study. Electron J Health Inform 7:8
5.
go back to reference Nguyen AN, Lawley MJ, Hansen DP, Bowman RV, Clarke BE, Duhig EE, Colquist S (2010) Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J Am Med Inform Assoc 17:440–445CrossRefPubMedPubMedCentral Nguyen AN, Lawley MJ, Hansen DP, Bowman RV, Clarke BE, Duhig EE, Colquist S (2010) Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J Am Med Inform Assoc 17:440–445CrossRefPubMedPubMedCentral
6.
go back to reference Weegar R, Dalianis H (2015) Creating a rule based system for text mining of Norwegian breast cancer pathology reports. In: Sixth international workshop on health text mining and information analysis (Louhi). p 73 Weegar R, Dalianis H (2015) Creating a rule based system for text mining of Norwegian breast cancer pathology reports. In: Sixth international workshop on health text mining and information analysis (Louhi). p 73
7.
go back to reference Li Y, Martinez D (2010) Information extraction of multiple categories from pathology reports. In: Australasian Language Technology Association Workshop. p 41 Li Y, Martinez D (2010) Information extraction of multiple categories from pathology reports. In: Australasian Language Technology Association Workshop. p 41
8.
go back to reference Martinez D, Li Y (2011) Information extraction from pathology reports in a hospital setting. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM. pp 1877–1882 Martinez D, Li Y (2011) Information extraction from pathology reports in a hospital setting. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM. pp 1877–1882
9.
go back to reference Nguyen A, Moore D, McCowan I, Courage M-J (2007) Multi-class classification of cancer stages from free-text histology reports using support vector machines. In: 29th annual international conference of the IEEE engineering in medicine and biology society, IEEE. pp 5140–5143 Nguyen A, Moore D, McCowan I, Courage M-J (2007) Multi-class classification of cancer stages from free-text histology reports using support vector machines. In: 29th annual international conference of the IEEE engineering in medicine and biology society, IEEE. pp 5140–5143
10.
go back to reference Wieneke AE, Bowles EJ, Cronkite D, Wernli KJ, Gao H, Carrell D, Buist DS (2015) Validation of natural language processing to extract breast cancer pathology procedures and results. J Pathol Inform 6:38CrossRefPubMedPubMedCentral Wieneke AE, Bowles EJ, Cronkite D, Wernli KJ, Gao H, Carrell D, Buist DS (2015) Validation of natural language processing to extract breast cancer pathology procedures and results. J Pathol Inform 6:38CrossRefPubMedPubMedCentral
11.
go back to reference Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39:135–168CrossRef Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39:135–168CrossRef
12.
go back to reference Ou Y, Patrick J (2014) Automatic population of structured reports from narrative pathology reports. In: Proceedings of the seventh Australasian workshop on health informatics and knowledge management, vol 153, Australian Computer Society, Inc. pp 41–50 Ou Y, Patrick J (2014) Automatic population of structured reports from narrative pathology reports. In: Proceedings of the seventh Australasian workshop on health informatics and knowledge management, vol 153, Australian Computer Society, Inc. pp 41–50
Metadata
Title
Using machine learning to parse breast pathology reports
Authors
Adam Yala
Regina Barzilay
Laura Salama
Molly Griffin
Grace Sollender
Aditya Bardia
Constance Lehman
Julliette M. Buckley
Suzanne B. Coopey
Fernanda Polubriaginof
Judy E. Garber
Barbara L. Smith
Michele A. Gadd
Michelle C. Specht
Thomas M. Gudewicz
Anthony J. Guidi
Alphonse Taghian
Kevin S. Hughes
Publication date
01-01-2017
Publisher
Springer US
Published in
Breast Cancer Research and Treatment / Issue 2/2017
Print ISSN: 0167-6806
Electronic ISSN: 1573-7217
DOI
https://doi.org/10.1007/s10549-016-4035-1

Other articles of this Issue 2/2017

Breast Cancer Research and Treatment 2/2017 Go to the issue
Webinar | 19-02-2024 | 17:30 (CET)

Keynote webinar | Spotlight on antibody–drug conjugates in cancer

Antibody–drug conjugates (ADCs) are novel agents that have shown promise across multiple tumor types. Explore the current landscape of ADCs in breast and lung cancer with our experts, and gain insights into the mechanism of action, key clinical trials data, existing challenges, and future directions.

Dr. Véronique Diéras
Prof. Fabrice Barlesi
Developed by: Springer Medicine