Skip to main content
Top
Published in: BMC Medical Research Methodology 1/2021

Open Access 01-12-2021 | Oseltamivir | Research

Machine learning in medicine: a practical introduction to natural language processing

Authors: Conrad J. Harrison, Chris J. Sidey-Gibbons

Published in: BMC Medical Research Methodology | Issue 1/2021

Login to get access

Abstract

Background

Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software.

Methods

We performed three NLP experiments using publicly-available data obtained from medicine review websites. First, we conducted lexicon-based sentiment analysis on open-text patient reviews of four drugs: Levothyroxine, Viagra, Oseltamivir and Apixaban. Next, we used unsupervised ML (latent Dirichlet allocation, LDA) to identify similar drugs in the dataset, based solely on their reviews. Finally, we developed three supervised ML algorithms to predict whether a drug review was associated with a positive or negative rating. These algorithms were: a regularised logistic regression, a support vector machine (SVM), and an artificial neural network (ANN). We compared the performance of these algorithms in terms of classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity and specificity.

Results

Levothyroxine and Viagra were reviewed with a higher proportion of positive sentiments than Oseltamivir and Apixaban. One of the three LDA clusters clearly represented drugs used to treat mental health problems. A common theme suggested by this cluster was drugs taking weeks or months to work. Another cluster clearly represented drugs used as contraceptives. Supervised machine learning algorithms predicted positive or negative drug ratings with classification accuracies ranging from 0.664, 95% CI [0.608, 0.716] for the regularised regression to 0.720, 95% CI [0.664,0.776] for the SVM.

Conclusions

In this paper, we present a conceptual overview of common techniques used to analyse large volumes of text, and provide reproducible code that can be readily applied to other research studies using open-source software.
Appendix
Available only for authorised users
Literature
14.
go back to reference Sidey-Gibbons J, Sidey-Gibbons C. Machine learning in medicine: a practical introduction. BMC Med Res Methodol. 2019;19(1):64.CrossRef Sidey-Gibbons J, Sidey-Gibbons C. Machine learning in medicine: a practical introduction. BMC Med Res Methodol. 2019;19(1):64.CrossRef
21.
go back to reference Bakliwal A, Arora P, Patil A, Varma V. Towards enhanced opinion classification using NLP techniques. In: Proceedings of the workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011). 2011. p. 101–107. Bakliwal A, Arora P, Patil A, Varma V. Towards enhanced opinion classification using NLP techniques. In: Proceedings of the workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011). 2011. p. 101–107.
28.
go back to reference Jelodar H, Wang Y, Rabbani M, Ayobi SVA. Natural language processing via LDA topic model in recommendation systems. arXiv. 2019. Jelodar H, Wang Y, Rabbani M, Ayobi SVA. Natural language processing via LDA topic model in recommendation systems. arXiv. 2019.
35.
go back to reference Ozgur C, Colliau T, Rogers G, Hughes Z, Myer-Tyson EB. MatLab vs Python vs. R. J Data Sci. 2017;15(3):355–71.CrossRef Ozgur C, Colliau T, Rogers G, Hughes Z, Myer-Tyson EB. MatLab vs Python vs. R. J Data Sci. 2017;15(3):355–71.CrossRef
48.
go back to reference Japkowicz N. The class imbalance problem: significance and strategies. In: Proc 2000 Int Conf Artif Intell. 2000. Japkowicz N. The class imbalance problem: significance and strategies. In: Proc 2000 Int Conf Artif Intell. 2000.
Metadata
Title
Machine learning in medicine: a practical introduction to natural language processing
Authors
Conrad J. Harrison
Chris J. Sidey-Gibbons
Publication date
01-12-2021
Publisher
BioMed Central
Published in
BMC Medical Research Methodology / Issue 1/2021
Electronic ISSN: 1471-2288
DOI
https://doi.org/10.1186/s12874-021-01347-1

Other articles of this Issue 1/2021

BMC Medical Research Methodology 1/2021 Go to the issue