Skip to main content
Top
Published in: Systematic Reviews 1/2019

Open Access 01-12-2019 | Antipsychotics | Research

Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools

Authors: Allison Gates, Samantha Guitard, Jennifer Pillay, Sarah A. Elliott, Michele P. Dyson, Amanda S. Newton, Lisa Hartling

Published in: Systematic Reviews | Issue 1/2019

Login to get access

Abstract

Background

We explored the performance of three machine learning tools designed to facilitate title and abstract screening in systematic reviews (SRs) when used to (a) eliminate irrelevant records (automated simulation) and (b) complement the work of a single reviewer (semi-automated simulation). We evaluated user experiences for each tool.

Methods

We subjected three SRs to two retrospective screening simulations. In each tool (Abstrackr, DistillerSR, RobotAnalyst), we screened a 200-record training set and downloaded the predicted relevance of the remaining records. We calculated the proportion missed and workload and time savings compared to dual independent screening. To test user experiences, eight research staff tried each tool and completed a survey.

Results

Using Abstrackr, DistillerSR, and RobotAnalyst, respectively, the median (range) proportion missed was 5 (0 to 28) percent, 97 (96 to 100) percent, and 70 (23 to 100) percent for the automated simulation and 1 (0 to 2) percent, 2 (0 to 7) percent, and 2 (0 to 4) percent for the semi-automated simulation. The median (range) workload savings was 90 (82 to 93) percent, 99 (98 to 99) percent, and 85 (85 to 88) percent for the automated simulation and 40 (32 to 43) percent, 49 (48 to 49) percent, and 35 (34 to 38) percent for the semi-automated simulation. The median (range) time savings was 154 (91 to 183), 185 (95 to 201), and 157 (86 to 172) hours for the automated simulation and 61 (42 to 82), 92 (46 to 100), and 64 (37 to 71) hours for the semi-automated simulation. Abstrackr identified 33–90% of records missed by a single reviewer. RobotAnalyst performed less well and DistillerSR provided no relative advantage. User experiences depended on user friendliness, qualities of the user interface, features and functions, trustworthiness, ease and speed of obtaining predictions, and practicality of the export file(s).

Conclusions

The workload savings afforded in the automated simulation came with increased risk of missing relevant records. Supplementing a single reviewer’s decisions with relevance predictions (semi-automated simulation) sometimes reduced the proportion missed, but performance varied by tool and SR. Designing tools based on reviewers’ self-identified preferences may improve their compatibility with present workflows.

Systematic review registration

Not applicable.
Appendix
Available only for authorised users
Literature
8.
go back to reference Thomas J. Diffusion of innovation in systematic review methodology: why is study selection not yet assisted by automation? OA Evidence-Based Med. 2013;1:12.CrossRef Thomas J. Diffusion of innovation in systematic review methodology: why is study selection not yet assisted by automation? OA Evidence-Based Med. 2013;1:12.CrossRef
11.
go back to reference Paynter R, Bañez LL, Berliner E, Erinoff E, Lege-Matsuura J, Potter S, et al. EPC methods: an exploration of the use of text-mining software in systematic reviews. Report no.: 16-EHC023-EF. Rockville: Agency for Healthcare Research and Quality; 2016. Paynter R, Bañez LL, Berliner E, Erinoff E, Lege-Matsuura J, Potter S, et al. EPC methods: an exploration of the use of text-mining software in systematic reviews. Report no.: 16-EHC023-EF. Rockville: Agency for Healthcare Research and Quality; 2016.
14.
go back to reference Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying An Interactive ML System In An Evidence-Based Practice Center: Abstrackr. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium; 2012 Jan 28-30; New York, New York: Association for Computing Machinery; 2012. doi:https://doi.org/10.1145/2110363.2110464. Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying An Interactive ML System In An Evidence-Based Practice Center: Abstrackr. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium; 2012 Jan 28-30; New York, New York: Association for Computing Machinery; 2012. doi:https://​doi.​org/​10.​1145/​2110363.​2110464.
19.
go back to reference Rathbone J, Hoffmann T, Glasziou P. Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev. 2015;4:80 doi:26073974.CrossRef Rathbone J, Hoffmann T, Glasziou P. Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev. 2015;4:80 doi:26073974.CrossRef
20.
go back to reference Pillay J, Boylan K, Carrey N, Newton A, Vandermeer B, Nuspl M, et al. First-and second-generation antipsychotics in children and young adults: systematic review update. Report no.: 17-EHC001-EF. Rockville: Agency for Healthcare Research and Quality; 2017.CrossRef Pillay J, Boylan K, Carrey N, Newton A, Vandermeer B, Nuspl M, et al. First-and second-generation antipsychotics in children and young adults: systematic review update. Report no.: 17-EHC001-EF. Rockville: Agency for Healthcare Research and Quality; 2017.CrossRef
25.
go back to reference Brooke J. SUS-a quick and dirty usability scale. Usability Eval Industry. 1996;189:4–7. Brooke J. SUS-a quick and dirty usability scale. Usability Eval Industry. 1996;189:4–7.
29.
go back to reference Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Modeling annotation time to reduce workload in comparative effectiveness reviews. Proceedings of the 1st ACM International Health Informatics Symposium; 2010 Nov 11–12. New York, New York: Association for Computing Machinery; 2010. doi:1−/1145/1882992.1882999. Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Modeling annotation time to reduce workload in comparative effectiveness reviews. Proceedings of the 1st ACM International Health Informatics Symposium; 2010 Nov 11–12. New York, New York: Association for Computing Machinery; 2010. doi:1−/1145/1882992.1882999.
30.
go back to reference Chen Y. Developing stopping rules for a ML system in citation screening [dissertation]. Providence: Brown University; 2019. Chen Y. Developing stopping rules for a ML system in citation screening [dissertation]. Providence: Brown University; 2019.
32.
go back to reference Olorisade BK, Quincey E, Brereton P, Andras P. A critical analysis of studies that address the use of text mining for citation screening in systematic reviews. Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering; 2016 Jun 1–3. Limerick: Association for Computing Machinery; 2016. https://doi.org/10.1145/2915970.2915982.CrossRef Olorisade BK, Quincey E, Brereton P, Andras P. A critical analysis of studies that address the use of text mining for citation screening in systematic reviews. Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering; 2016 Jun 1–3. Limerick: Association for Computing Machinery; 2016. https://​doi.​org/​10.​1145/​2915970.​2915982.CrossRef
Metadata
Title
Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools
Authors
Allison Gates
Samantha Guitard
Jennifer Pillay
Sarah A. Elliott
Michele P. Dyson
Amanda S. Newton
Lisa Hartling
Publication date
01-12-2019
Publisher
BioMed Central
Published in
Systematic Reviews / Issue 1/2019
Electronic ISSN: 2046-4053
DOI
https://doi.org/10.1186/s13643-019-1222-2

Other articles of this Issue 1/2019

Systematic Reviews 1/2019 Go to the issue