Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2019

Open Access 01-12-2019 | Research article

Latent Dirichlet Allocation in predicting clinical trial terminations

Authors: Simon Geletta, Lendie Follett, Marcia Laugerman

Published in: BMC Medical Informatics and Decision Making | Issue 1/2019

Login to get access

Abstract

Background

This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures.

Method

We used data from the clinicialTrials.​gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data.

Results

In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone.

Conclusions

Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.
Literature
1.
go back to reference Kasenda B, Von Elm E, You J, Blümle A, Tomonaga Y, Saccilotto R, Amstutz A, Bengough T, Meerpohl JJ, Stegert M, et al. Prevalence, characteristics, and publication of discontinued randomized trials. Jama. 2014;311(10):1045–52.CrossRef Kasenda B, Von Elm E, You J, Blümle A, Tomonaga Y, Saccilotto R, Amstutz A, Bengough T, Meerpohl JJ, Stegert M, et al. Prevalence, characteristics, and publication of discontinued randomized trials. Jama. 2014;311(10):1045–52.CrossRef
2.
go back to reference Jamjoom AAB, Gane AB, Demetriades AK. Randomized controlled trials in neurosurgery: an observational analysis of trial discontinuation and publication outcome. J Neurosurg. 2017;127(4):857–66.CrossRef Jamjoom AAB, Gane AB, Demetriades AK. Randomized controlled trials in neurosurgery: an observational analysis of trial discontinuation and publication outcome. J Neurosurg. 2017;127(4):857–66.CrossRef
3.
go back to reference Department of Health and Human Services. Final rule—clinical trials registration and results information submission. Fed Regist. 2016;81:64981–5157. Department of Health and Human Services. Final rule—clinical trials registration and results information submission. Fed Regist. 2016;81:64981–5157.
4.
go back to reference Cahan A, Anand V. Second thoughts on the final rule: An analysis of baseline participant characteristics reports on clinicaltrials. gov. PloS one. 2017;12(11):e0185886.CrossRef Cahan A, Anand V. Second thoughts on the final rule: An analysis of baseline participant characteristics reports on clinicaltrials. gov. PloS one. 2017;12(11):e0185886.CrossRef
5.
go back to reference Lazard AJ, Saffer AJ, Wilcox GB, Chung ADW, Mackert MS, Bernhardt JM. E-cigarette social media messages: a text mining analysis of marketing and consumer conversations on twitter. JMIR Public Health Surveill. 2016;2(2):e171.CrossRef Lazard AJ, Saffer AJ, Wilcox GB, Chung ADW, Mackert MS, Bernhardt JM. E-cigarette social media messages: a text mining analysis of marketing and consumer conversations on twitter. JMIR Public Health Surveill. 2016;2(2):e171.CrossRef
6.
go back to reference Lazard AJ, Scheinfeld E, Bernhardt JM, Wilcox GB, Suran M. Detecting themes of public concern: a text mining analysis of the centers for disease control and prevention’s ebola live twitter chat. Am J Infect Control. 2015;43(10):1109–11.CrossRef Lazard AJ, Scheinfeld E, Bernhardt JM, Wilcox GB, Suran M. Detecting themes of public concern: a text mining analysis of the centers for disease control and prevention’s ebola live twitter chat. Am J Infect Control. 2015;43(10):1109–11.CrossRef
7.
go back to reference Glowacki EM, Lazard AJ, Wilcox GB, Mackert M, Bernhardt JM. Identifying the public’s concerns and the centers for disease control and prevention’s reactions during a health crisis: an analysis of a zika live twitter chat. Am J Infect Control. 2016;44(12):1709–11.CrossRef Glowacki EM, Lazard AJ, Wilcox GB, Mackert M, Bernhardt JM. Identifying the public’s concerns and the centers for disease control and prevention’s reactions during a health crisis: an analysis of a zika live twitter chat. Am J Infect Control. 2016;44(12):1709–11.CrossRef
8.
go back to reference Blei DM, Ng A, Jordan M. Latent dirichlet allocation journal of machine learning research (3); 2003. Blei DM, Ng A, Jordan M. Latent dirichlet allocation journal of machine learning research (3); 2003.
9.
go back to reference Amado A, Cortez P, Rita P, Moro S. Research trends on big data in marketing: a text mining and topic modeling based literature analysis. Eur Res Manag Bus Econ. 2018;24(1):1–7.CrossRef Amado A, Cortez P, Rita P, Moro S. Research trends on big data in marketing: a text mining and topic modeling based literature analysis. Eur Res Manag Bus Econ. 2018;24(1):1–7.CrossRef
10.
go back to reference Delen D, Crossland MD. Seeding the survey and analysis of research literature with text mining. Expert Syst Appl. 2008;34(3):1707–20.CrossRef Delen D, Crossland MD. Seeding the survey and analysis of research literature with text mining. Expert Syst Appl. 2008;34(3):1707–20.CrossRef
11.
go back to reference Cai Z, Li H, Hu X, Graesser A. Can Word Probabilities from LDA be Simply Added up to Represent Documents? Paper presented at the 9th International Conference on Educational Data Mining, June 29 - July 2, 2016 Raleigh, North Carolina. Cai Z, Li H, Hu X, Graesser A. Can Word Probabilities from LDA be Simply Added up to Represent Documents? Paper presented at the 9th International Conference on Educational Data Mining, June 29 - July 2, 2016 Raleigh, North Carolina.
12.
go back to reference Ramanathan V, Wechsler H. Phishing detection and impersonated entity discovery using conditional random field and latent dirichlet allocation. Comput Secur. 2013;34:123–39.CrossRef Ramanathan V, Wechsler H. Phishing detection and impersonated entity discovery using conditional random field and latent dirichlet allocation. Comput Secur. 2013;34:123–39.CrossRef
13.
go back to reference Xiao C, Zhang P, Chaovalitwongse WA, Hu J, Wang F. Adverse drug reaction prediction with symbolic latent dirichlet allocation. In: Thirty-First AAAI Conference on Artificial Intelligence; 2017. Xiao C, Zhang P, Chaovalitwongse WA, Hu J, Wang F. Adverse drug reaction prediction with symbolic latent dirichlet allocation. In: Thirty-First AAAI Conference on Artificial Intelligence; 2017.
14.
go back to reference Follett L, Geletta S, Laugerman M. Quantifying risk associated with clinical trial termination: a text mining approach. Inf Process Manag. 2019;56(3):516–25.CrossRef Follett L, Geletta S, Laugerman M. Quantifying risk associated with clinical trial termination: a text mining approach. Inf Process Manag. 2019;56(3):516–25.CrossRef
15.
go back to reference Han H, Guo X, Hua Y. Variable selection using mean decrease accuracy and mean decrease gini based on random forest. In: 2016 7th ieee international conference on software engineering and service science (icsess). Beijing: IEEE; 2016. p. 219–24. Han H, Guo X, Hua Y. Variable selection using mean decrease accuracy and mean decrease gini based on random forest. In: 2016 7th ieee international conference on software engineering and service science (icsess). Beijing: IEEE; 2016. p. 219–24.
Metadata
Title
Latent Dirichlet Allocation in predicting clinical trial terminations
Authors
Simon Geletta
Lendie Follett
Marcia Laugerman
Publication date
01-12-2019
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2019
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-019-0973-y

Other articles of this Issue 1/2019

BMC Medical Informatics and Decision Making 1/2019 Go to the issue