Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2023

Open Access 01-12-2023 | Research

Extracting laboratory test information from paper-based reports

Authors: Ming-Wei Ma, Xian-Shu Gao, Ze-Yu Zhang, Shi-Yu Shang, Ling Jin, Pei-Lin Liu, Feng Lv, Wei Ni, Yu-Chen Han, Hui Zong

Published in: BMC Medical Informatics and Decision Making | Issue 1/2023

Login to get access

Abstract

Background

In the healthcare domain today, despite the substantial adoption of electronic health information systems, a significant proportion of medical reports still exist in paper-based formats. As a result, there is a significant demand for the digitization of information from these paper-based reports. However, the digitization of paper-based laboratory reports into a structured data format can be challenging due to their non-standard layouts, which includes various data types such as text, numeric values, reference ranges, and units. Therefore, it is crucial to develop a highly scalable and lightweight technique that can effectively identify and extract information from laboratory test reports and convert them into a structured data format for downstream tasks.

Methods

We developed an end-to-end Natural Language Processing (NLP)-based pipeline for extracting information from paper-based laboratory test reports. Our pipeline consists of two main modules: an optical character recognition (OCR) module and an information extraction (IE) module. The OCR module is applied to locate and identify text from scanned laboratory test reports using state-of-the-art OCR algorithms. The IE module is then used to extract meaningful information from the OCR results to form digitalized tables of the test reports. The IE module consists of five sub-modules, which are time detection, headline position, line normalization, Named Entity Recognition (NER) with a Conditional Random Fields (CRF)-based method, and step detection for multi-column. Finally, we evaluated the performance of the proposed pipeline on 153 laboratory test reports collected from Peking University First Hospital (PKU1).

Results

In the OCR module, we evaluate the accuracy of text detection and recognition results at three different levels and achieved an averaged accuracy of 0.93. In the IE module, we extracted four laboratory test entities, including test item name, test result, test unit, and reference value range. The overall F1 score is 0.86 on the 153 laboratory test reports collected from PKU1. With a single CPU, the average inference time of each report is only 0.78 s.

Conclusion

In this study, we developed a practical lightweight pipeline to digitalize and extract information from paper-based laboratory test reports in diverse types and with different layouts that can be adopted in real clinical environments with the lowest possible computing resources requirements. The high evaluation performance on the real-world hospital dataset validated the feasibility of the proposed pipeline.
Literature
1.
go back to reference Adler-Milstein J, DesRoches CM, Kralovec P, Foster G, Worzala C, Charles D, et al. Electronic health record adoption in US hospitals: Progress continues, but challenges persist. Health Aff (Millwood). 2015;34:2174–80.CrossRefPubMed Adler-Milstein J, DesRoches CM, Kralovec P, Foster G, Worzala C, Charles D, et al. Electronic health record adoption in US hospitals: Progress continues, but challenges persist. Health Aff (Millwood). 2015;34:2174–80.CrossRefPubMed
2.
go back to reference Liang J, Li Y, Zhang Z, Shen D, Xu J, Zheng X, et al. Adoption of electronic health records (EHRs) in China during the past 10 years: consecutive survey data analysis and comparison of sino-american challenges and experiences. J Med Internet Res. 2021;23:e24813.CrossRefPubMedPubMedCentral Liang J, Li Y, Zhang Z, Shen D, Xu J, Zheng X, et al. Adoption of electronic health records (EHRs) in China during the past 10 years: consecutive survey data analysis and comparison of sino-american challenges and experiences. J Med Internet Res. 2021;23:e24813.CrossRefPubMedPubMedCentral
3.
go back to reference Ford E, Carroll JA, Smith HE, Scott D, Cassell JA. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc. 2016;23:1007–15.CrossRefPubMedPubMedCentral Ford E, Carroll JA, Smith HE, Scott D, Cassell JA. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc. 2016;23:1007–15.CrossRefPubMedPubMedCentral
4.
go back to reference Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, et al. Clinical information extraction applications: a literature review. J Biomed Inform. 2018;77:34–49.CrossRefPubMed Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, et al. Clinical information extraction applications: a literature review. J Biomed Inform. 2018;77:34–49.CrossRefPubMed
5.
go back to reference Cai T, Zhang L, Yang N, Kumamaru KK, Rybicki FJ, Cai T, et al. EXTraction of EMR numerical data: an efficient and generalizable tool to EXTEND clinical research. BMC Med Inform Decis Mak. 2019;19:226.CrossRefPubMedPubMedCentral Cai T, Zhang L, Yang N, Kumamaru KK, Rybicki FJ, Cai T, et al. EXTraction of EMR numerical data: an efficient and generalizable tool to EXTEND clinical research. BMC Med Inform Decis Mak. 2019;19:226.CrossRefPubMedPubMedCentral
6.
go back to reference Mishra N, Duke J, Karki S, Choi M, Riley M, Ilatovskiy AV, et al. A modified public health automated case event reporting platform for enhancing electronic laboratory reports with clinical data: design and implementation study. J Med Internet Res. 2021;23:e26388.CrossRefPubMedPubMedCentral Mishra N, Duke J, Karki S, Choi M, Riley M, Ilatovskiy AV, et al. A modified public health automated case event reporting platform for enhancing electronic laboratory reports with clinical data: design and implementation study. J Med Internet Res. 2021;23:e26388.CrossRefPubMedPubMedCentral
7.
go back to reference Dikmen ZG, Pinar A, Akbiyik F. Specimen rejection in laboratory medicine: necessary for patient safety? Biochem Med (Zagreb). 2015;25:377–85.CrossRefPubMed Dikmen ZG, Pinar A, Akbiyik F. Specimen rejection in laboratory medicine: necessary for patient safety? Biochem Med (Zagreb). 2015;25:377–85.CrossRefPubMed
8.
go back to reference Pylypchuk Y, Meyerhoefer CD, Encinosa W, Searcy T. The role of electronic health record developers in hospital patient sharing. J Am Med Inform Assoc. 2022;29:435–42.CrossRefPubMed Pylypchuk Y, Meyerhoefer CD, Encinosa W, Searcy T. The role of electronic health record developers in hospital patient sharing. J Am Med Inform Assoc. 2022;29:435–42.CrossRefPubMed
9.
go back to reference Furukawa MF, King J, Patel V, Hsiao C-J, Adler-Milstein J, Jha AK. Despite substantial progress in EHR adoption, health information exchange and patient engagement remain low in office settings. Health Aff (Millwood). 2014;33:1672–9.CrossRefPubMed Furukawa MF, King J, Patel V, Hsiao C-J, Adler-Milstein J, Jha AK. Despite substantial progress in EHR adoption, health information exchange and patient engagement remain low in office settings. Health Aff (Millwood). 2014;33:1672–9.CrossRefPubMed
10.
go back to reference Laique SN, Hayat U, Sarvepalli S, Vaughn B, Ibrahim M, McMichael J, et al. Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports. Gastrointest Endosc. 2021;93:750–7.CrossRefPubMed Laique SN, Hayat U, Sarvepalli S, Vaughn B, Ibrahim M, McMichael J, et al. Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports. Gastrointest Endosc. 2021;93:750–7.CrossRefPubMed
11.
go back to reference Hsu E, Malagaris I, Kuo Y-F, Sultana R, Roberts K. Deep learning-based NLP data pipeline for EHR-scanned document information extraction. JAMIA Open. 2022;5:ooac045.CrossRefPubMedPubMedCentral Hsu E, Malagaris I, Kuo Y-F, Sultana R, Roberts K. Deep learning-based NLP data pipeline for EHR-scanned document information extraction. JAMIA Open. 2022;5:ooac045.CrossRefPubMedPubMedCentral
12.
go back to reference Cassim N, Mapundu M, Olago V, Celik T, George JA, Glencross DK. Using text mining techniques to extract prostate cancer predictive information (Gleason score) from semi-structured narrative laboratory reports in the Gauteng province, South Africa. BMC Med Inform Decis Mak. 2021;21:330.CrossRefPubMedPubMedCentral Cassim N, Mapundu M, Olago V, Celik T, George JA, Glencross DK. Using text mining techniques to extract prostate cancer predictive information (Gleason score) from semi-structured narrative laboratory reports in the Gauteng province, South Africa. BMC Med Inform Decis Mak. 2021;21:330.CrossRefPubMedPubMedCentral
13.
go back to reference Liu P, Guo Y, Wang F, Li G. Chinese named entity recognition: the state of the art. Neurocomputing. 2022;7(473):37–53.CrossRef Liu P, Guo Y, Wang F, Li G. Chinese named entity recognition: the state of the art. Neurocomputing. 2022;7(473):37–53.CrossRef
14.
go back to reference Lafferty J, McCallum A, Pereira FC. Conditional random fields: probabilistic models for segmenting and labeling sequence data. 2001. Lafferty J, McCallum A, Pereira FC. Conditional random fields: probabilistic models for segmenting and labeling sequence data. 2001.
15.
go back to reference Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2016;39(11):2298–304.CrossRefPubMed Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2016;39(11):2298–304.CrossRefPubMed
16.
go back to reference Du Y, Li C, Guo R, Yin X, Liu W, Zhou J, et al. PP-OCR: A practical ultra lightweight OCR system. 2020. Du Y, Li C, Guo R, Yin X, Liu W, Zhou J, et al. PP-OCR: A practical ultra lightweight OCR system. 2020.
17.
go back to reference Liao M, Wan Z, Yao C, Chen K, Bai X. Real-time scene text detection with differentiable binarization. 2019. Liao M, Wan Z, Yao C, Chen K, Bai X. Real-time scene text detection with differentiable binarization. 2019.
18.
go back to reference Xue W, Li Q, Xue Q. Text detection and recognition for images of medical laboratory reports with a deep learning approach. IEEE Access. 2020;8:407–16.CrossRef Xue W, Li Q, Xue Q. Text detection and recognition for images of medical laboratory reports with a deep learning approach. IEEE Access. 2020;8:407–16.CrossRef
19.
go back to reference Batra P, Phalnikar N, Kurmi D, Tembhurne J, Sahare P, Diwan T. OCR-MRD: performance analysis of different optical character recognition engines for medical report digitization. Batra P, Phalnikar N, Kurmi D, Tembhurne J, Sahare P, Diwan T. OCR-MRD: performance analysis of different optical character recognition engines for medical report digitization.
20.
go back to reference Goodrum H, Roberts K, Bernstam EV. Automatic classification of scanned electronic health record documents. Int J Med Informatics. 2020;1(144):104302.CrossRef Goodrum H, Roberts K, Bernstam EV. Automatic classification of scanned electronic health record documents. Int J Med Informatics. 2020;1(144):104302.CrossRef
21.
go back to reference Kumar A, Goodrum H, Kim A, Stender C, Roberts K, Bernstam EV. Closing the loop: automatically identifying abnormal imaging results in scanned documents. J Am Med Inform Assoc. 2022;29(5):831–40.CrossRefPubMedPubMedCentral Kumar A, Goodrum H, Kim A, Stender C, Roberts K, Bernstam EV. Closing the loop: automatically identifying abnormal imaging results in scanned documents. J Am Med Inform Assoc. 2022;29(5):831–40.CrossRefPubMedPubMedCentral
22.
go back to reference Wei Q, Zuo X, Anjum O, Hu Y, Denlinger R, Bernstam EV, Citardi MJ, Xu H. ClinicalLayoutLM: a pre-trained multi-modal model for understanding scanned document in electronic health records. In 2022 IEEE International Conference on Big Data (Big Data) 2022;2821–2827). IEEE. Wei Q, Zuo X, Anjum O, Hu Y, Denlinger R, Bernstam EV, Citardi MJ, Xu H. ClinicalLayoutLM: a pre-trained multi-modal model for understanding scanned document in electronic health records. In 2022 IEEE International Conference on Big Data (Big Data) 2022;2821–2827). IEEE.
23.
go back to reference Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805. 2018.
24.
go back to reference Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323. 2019. Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. arXiv preprint arXiv:​1904.​03323. 2019.
25.
go back to reference Chen L, Varoquaux G, Suchanek FM. A lightweight neural model for biomedical entity linking. Proc AAAI Confer Artif Intell. 2021;35(14):12657–65. Chen L, Varoquaux G, Suchanek FM. A lightweight neural model for biomedical entity linking. Proc AAAI Confer Artif Intell. 2021;35(14):12657–65.
26.
go back to reference Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, et al. Searching for MobileNetV3. 2019. Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, et al. Searching for MobileNetV3. 2019.
27.
go back to reference Rashid SF, Shafait F, Breuel TM. Scanning neural network for text line recognition. In: 2012 10th IAPR International Workshop on document analysis systems. 2012. p. 105–9. Rashid SF, Shafait F, Breuel TM. Scanning neural network for text line recognition. In: 2012 10th IAPR International Workshop on document analysis systems. 2012. p. 105–9.
28.
go back to reference Zheng H, Qin B, Xu M. Chinese Medical Named Entity Recognition using CRF-MT-Adapt and NER-MRC. In: 2021 2nd International Conference on Computing and Data Science (CDS). 2021. p. 362–5. Zheng H, Qin B, Xu M. Chinese Medical Named Entity Recognition using CRF-MT-Adapt and NER-MRC. In: 2021 2nd International Conference on Computing and Data Science (CDS). 2021. p. 362–5.
29.
go back to reference Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans Math Softw. 1997;23:550–60.CrossRef Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans Math Softw. 1997;23:550–60.CrossRef
30.
go back to reference Haynos AF, Wang SB, Lipson S, Peterson CB, Mitchell JE, Halmi KA, et al. Machine learning enhances prediction of illness course: a longitudinal study in eating disorders. Psychol Med. 2021;51(8):1392–402.CrossRefPubMed Haynos AF, Wang SB, Lipson S, Peterson CB, Mitchell JE, Halmi KA, et al. Machine learning enhances prediction of illness course: a longitudinal study in eating disorders. Psychol Med. 2021;51(8):1392–402.CrossRefPubMed
31.
go back to reference Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351. 2019. Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:​1909.​10351. 2019.
32.
go back to reference Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984. 2020. Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:​2004.​02984. 2020.
33.
go back to reference Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. 2019. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:​1909.​11942. 2019.
Metadata
Title
Extracting laboratory test information from paper-based reports
Authors
Ming-Wei Ma
Xian-Shu Gao
Ze-Yu Zhang
Shi-Yu Shang
Ling Jin
Pei-Lin Liu
Feng Lv
Wei Ni
Yu-Chen Han
Hui Zong
Publication date
01-12-2023
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2023
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-023-02346-6

Other articles of this Issue 1/2023

BMC Medical Informatics and Decision Making 1/2023 Go to the issue