Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2023

Open Access 01-12-2023 | Research

Ontology-driven and weakly supervised rare disease identification from clinical notes

Authors: Hang Dong, Víctor Suárez-Paniagua, Huayu Zhang, Minhong Wang, Arlene Casey, Emma Davidson, Jiaoyan Chen, Beatrice Alex, William Whiteley, Honghan Wu

Published in: BMC Medical Informatics and Decision Making | Issue 1/2023

Login to get access

Abstract

Background

Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts.

Methods

We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations.

Results

The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes).

Conclusion

The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies.
Appendix
Available only for authorised users
Footnotes
1
We focus on the identification of diseases instead of the associated phenotypic abnormalities, therefore we chose ORDO instead of Human Phenotype Ontology (HPO) [10]. The overall ontology based and weak supervision framework can potentially be applied to HPO, given it being aligned to ORDO [11]. We leave the phenotypic abnormalities (in HPO) for future studies.
 
3
The most 5 frequent UMLS (version 2020AB) semantic types of the 4,064 linked ORDO concepts: T047 (Disease or Syndrome, 3,245 concepts, 79.8%), T019 (Congenital Abnormality, 465 concepts, 11.4%), T191 (Neoplastic Process, 374 concepts, 9.2%), T049 (Cell or Molecular Dysfunction, 35 concepts, 0.9%), and T046 (Pathologic Function, 19 concepts, 0.5%); note that 160 concepts (3.9%) are associated with two semantic types.
 
9
We also benchmarked the performance on MIMIC-III discharge summaries with recent NER+L tools, MedCAT [26] and Google Healthcare Natural Language API [44]. However, given that the results (especially recall) may favour SemEHR-based methods, we do not formally report the results of the two NER+L tools but make them available at https://​github.​com/​acadTags/​Rare-disease-identification/​blob/​main/​supp-results.
 
11
Our two interpretations of acute liver failure differ most in the factors of drug use, alcohol abuse, virus infection, etc., that could contribute to the rarity of the disease but not specified in the definition from ORDO. We finally considered hepatitis virus or drugs as causes of acute liver failure as a rare disease, but removed cases of alcohol abuse.
 
Literature
1.
go back to reference Nguengang Wakap S, Lambert DM, Olry A, Rodwell C, Gueydan C, Lanneau V, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet. 2020;28(2):165–73.CrossRefPubMed Nguengang Wakap S, Lambert DM, Olry A, Rodwell C, Gueydan C, Lanneau V, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet. 2020;28(2):165–73.CrossRefPubMed
4.
go back to reference Richesson RL, Fung KW, Bodenreider O. Coverage of Rare Disease Names in Clinical Coding Systems and Ontologies and Implications for Electronic Health Records-Based Research. In: Proceedings of the 5th International Conference on Biomedical Ontology. Houston: CEUR Workshop Proceedings (CEUR-WS.org); 2014. p. 78–80. Richesson RL, Fung KW, Bodenreider O. Coverage of Rare Disease Names in Clinical Coding Systems and Ontologies and Implications for Electronic Health Records-Based Research. In: Proceedings of the 5th International Conference on Biomedical Ontology. Houston: CEUR Workshop Proceedings (CEUR-WS.org); 2014. p. 78–80.
7.
go back to reference Dong H, Suárez-Paniagua V, Zhang H, Wang M, Whitfield E, Wu H, Rare disease identification from clinical notes with ontologies and weak supervision. In: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). Online: IEEE; 2021. p. 2294–8. Dong H, Suárez-Paniagua V, Zhang H, Wang M, Whitfield E, Wu H, Rare disease identification from clinical notes with ontologies and weak supervision. In: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). Online: IEEE; 2021. p. 2294–8.
8.
go back to reference Kahn Jr CE. An Ontology-Based Approach to Estimate the Frequency of Rare Diseases in Narrative-Text Radiology Reports. Stud Health Technol Inf. 2017;245:896–900. MEDINFO 2017: Precision Healthcare through Informatics. Kahn Jr CE. An Ontology-Based Approach to Estimate the Frequency of Rare Diseases in Narrative-Text Radiology Reports. Stud Health Technol Inf. 2017;245:896–900. MEDINFO 2017: Precision Healthcare through Informatics.
11.
go back to reference Maiella S, Olry A, Hanauer M, Lanneau V, Lourghi H, Donadille B, et al. Harmonising phenomics information for a better interoperability in the rare disease field. European Journal of Medical Genetics. 2018;61(11):706–714. Focus on rare disease research projects supported by the E-Rare ERA-Net program. https://doi.org/10.1016/j.ejmg.2018.01.013. Maiella S, Olry A, Hanauer M, Lanneau V, Lourghi H, Donadille B, et al. Harmonising phenomics information for a better interoperability in the rare disease field. European Journal of Medical Genetics. 2018;61(11):706–714. Focus on rare disease research projects supported by the E-Rare ERA-Net program. https://​doi.​org/​10.​1016/​j.​ejmg.​2018.​01.​013.
16.
go back to reference Wu H, Hodgson K, Dyson S, Morley K, Ibrahim Z, Iqbal E, et al. Efficiently Reusing Natural Language Processing Models for Phenotype Identification in Free-text Electronic Medical Records: Methodological Study. JMIR Med Inf. 2019;7(4):e14782:1-14. Wu H, Hodgson K, Dyson S, Morley K, Ibrahim Z, Iqbal E, et al. Efficiently Reusing Natural Language Processing Models for Phenotype Identification in Free-text Electronic Medical Records: Methodological Study. JMIR Med Inf. 2019;7(4):e14782:1-14.
17.
go back to reference Gorinski PJ, Wu H, Grover C, Tobin R, Talbot C, Whalley H, et al. Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches. arXiv preprint arXiv:1903.03985. 2019;Comment: 8 pages, presented at HealTAC 2019, Cardiff, 24-25/04/2019. Gorinski PJ, Wu H, Grover C, Tobin R, Talbot C, Whalley H, et al. Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches. arXiv preprint arXiv:​1903.​03985. 2019;Comment: 8 pages, presented at HealTAC 2019, Cardiff, 24-25/04/2019.
18.
19.
go back to reference Peng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. In: Proceedings of the 18th BioNLP Workshop and Shared Task. Florence: Association for Computational Linguistics; 2019. p. 58–65. Peng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. In: Proceedings of the 18th BioNLP Workshop and Shared Task. Florence: Association for Computational Linguistics; 2019. p. 58–65.
28.
go back to reference Cusick M, Adekkanattu P, Campion TR, Sholle ET, Myers A, Banerjee S, et al. Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. J Psychiatr Res. 2021;136:95–102.CrossRefPubMedPubMedCentral Cusick M, Adekkanattu P, Campion TR, Sholle ET, Myers A, Banerjee S, et al. Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. J Psychiatr Res. 2021;136:95–102.CrossRefPubMedPubMedCentral
29.
go back to reference Shen Z, Schutte D, Yi Y, Bompelli A, Yu F, Wang Y, et al. Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision. BMC Med Inf Decis Making. 2022;22(1):1–11. Shen Z, Schutte D, Yi Y, Bompelli A, Yu F, Wang Y, et al. Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision. BMC Med Inf Decis Making. 2022;22(1):1–11.
30.
go back to reference Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT. Minneapolis, Minnesota: Association for Computational Linguistics. 2019. p. 4171–4186. https://doi.org/10.18653/v1/N19-1423. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT. Minneapolis, Minnesota: Association for Computational Linguistics. 2019. p. 4171–4186. https://​doi.​org/​10.​18653/​v1/​N19-1423.
31.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). Long Beach: NeurIPS Proceedings; 2017. p. 5998–6008. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). Long Beach: NeurIPS Proceedings; 2017. p. 5998–6008.
32.
38.
go back to reference Ma X, Wang Z, Ng P, Nallapati R, Xiang B. Universal text representation from bert: An empirical study. arXiv preprint arXiv:1910.07973. 2019. Ma X, Wang Z, Ng P, Nallapati R, Xiang B. Universal text representation from bert: An empirical study. arXiv preprint arXiv:​1910.​07973. 2019.
43.
go back to reference Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30.
45.
go back to reference Fries JA, Steinberg E, Khattar S, Fleming SL, Posada J, Callahan A, et al. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun. 2021;12(1):1–11.CrossRef Fries JA, Steinberg E, Khattar S, Fleming SL, Posada J, Callahan A, et al. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun. 2021;12(1):1–11.CrossRef
46.
go back to reference Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Ré C. Snorkel: Rapid training data creation with weak supervision. VLDB J. 2020;29(2):709–30.CrossRefPubMed Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Ré C. Snorkel: Rapid training data creation with weak supervision. VLDB J. 2020;29(2):709–30.CrossRefPubMed
48.
go back to reference Monarch RM. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Shelter Island, NY: Manning Publications Company; 2021. Version 11, MEAP Edition (Manning Early Access Program). Monarch RM. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Shelter Island, NY: Manning Publications Company; 2021. Version 11, MEAP Edition (Manning Early Access Program).
51.
go back to reference Kolyvakis P, Kalousis A, Smith B, Kiritsis D. Biomedical ontology alignment: an approach based on representation learning. J Biomed Semant. 2018;9(1):1–20.CrossRef Kolyvakis P, Kalousis A, Smith B, Kiritsis D. Biomedical ontology alignment: an approach based on representation learning. J Biomed Semant. 2018;9(1):1–20.CrossRef
54.
go back to reference Zhang H, Thygesen JH, Shi T, Gkoutos GV, Hemingway H, Guthrie B, et al. Increased COVID-19 mortality rate in rare disease patients: a retrospective cohort study in participants of the Genomics England 100,000 Genomes project. Orphanet J Rare Dis. 2022;17(1):1–7.CrossRef Zhang H, Thygesen JH, Shi T, Gkoutos GV, Hemingway H, Guthrie B, et al. Increased COVID-19 mortality rate in rare disease patients: a retrospective cohort study in participants of the Genomics England 100,000 Genomes project. Orphanet J Rare Dis. 2022;17(1):1–7.CrossRef
Metadata
Title
Ontology-driven and weakly supervised rare disease identification from clinical notes
Authors
Hang Dong
Víctor Suárez-Paniagua
Huayu Zhang
Minhong Wang
Arlene Casey
Emma Davidson
Jiaoyan Chen
Beatrice Alex
William Whiteley
Honghan Wu
Publication date
01-12-2023
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2023
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-023-02181-9

Other articles of this Issue 1/2023

BMC Medical Informatics and Decision Making 1/2023 Go to the issue