ABSTRACT
Background. Search and selection of primary studies in Systematic Literature Reviews (SLR) is labour intensive, and hard to replicate and update. Aims. We explore a machine learning approach to support semi-automated search and selection in SLRs to address these weaknesses. Method. We 1) train a classifier on an initial set of papers, 2) extend this set of papers by automated search and snowballing, 3) have the researcher validate the top paper, selected by the classifier, and 4) update the set of papers and iterate the process until a stopping criterion is met. Results. We demonstrate with a proof-of-concept tool that the proposed automated search and selection approach generates valid search strings and that the performance for subsets of primary studies can reduce the manual work by half. Conclusions. The approach is promising and the demonstrated advantages include cost savings and replicability. The next steps include further tool development and evaluate the approach on a complete SLR.
- S. Augier, G. Venturini, and Y. Kodratoff. 1995. Learning first order logic rules with a genetic algorithm. In Proc. of The 1st International Conference on Knowledge Discovery and Data Mining (KDD-95). Google ScholarDigital Library
- D. Badampudi, C. Wohlin, and K. Petersen. 2015. Experiences from Using Snowballing and Database Searches in Systematic Literature Studies. In Proc. of the 19th International Conference on Evaluation and Assessment in Software Engineering (EASE '15). ACM, New York, NY, USA, Article 17, 10 pages. Google ScholarDigital Library
- S. Bird, E. Klein, and E. Loper. 2009. Natural Language Processing with Python -- Analyzing Text with the Natural Language Toolkit. O'Reilly Media, Inc. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and J. I. Michael. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3, Jan (2003), 993--1022. Google ScholarDigital Library
- C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1--27:27. Issue 3. Google ScholarDigital Library
- O. Chapelle and L. Li. 2011. An Empirical Evaluation of Thompson Sampling. In Proc. of the 24th International Conference on Neural Information Processing Systems (NIPS'11). 2249--2257. Google ScholarDigital Library
- M. K. Choong, F. Galgani, A. G. Dunn, and G. Tsafnat. 2014. Automatic Evidence Retrieval for Systematic Reviews. Journal of Medical Internet Research 10, e223 (Oct 2014).Google ScholarCross Ref
- C. Cortes and V. Vapnik. 1995. Support-Vector Networks. Machine Learning 20, 3 (1995), 273--297. Google ScholarDigital Library
- D. S. Cruzes and T. Dybå. 2011. Research Synthesis in Software Engineering: A Tertiary Study. Information and Software Technology 53, 5 (2011), 440--455. Google ScholarDigital Library
- F. Q. B. daSilva, A.L. M.Santos, S. Soares, A. C França, C. V F. Monteiro, and F.F. Maciel. 2011. Six Years of Systematic Literature Reviews in Software Engineering: An Updated Tertiary Study. Information and Software Technology 53, 9 (2011), 899--913. Google ScholarDigital Library
- O. Dieste, A. Grimán, and N. Juristo. 2009. Developing Search Strategies for Detecting Relevant Experiments. Empirical Software Engineering 14, 5 (2009), 513--539. Google ScholarDigital Library
- R. E. Fan, K. W. Chang, C.J. Hsieh, X. R. Wang, and C.J. Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research 9 (2008), 1871--1874. Google ScholarDigital Library
- M. Ghafari, M. Saleh, and T. Ebrahimi. 2012. A Federated Search Approach to Facilitate Systematic Literature Review in Software Engineering. International Journal of Software Engineering & Applications 3, 2 (2012), 13--24.Google ScholarCross Ref
- S. Jalali and C. Wohlin. 2012. Systematic Literature Studies: Database Searches vs. Backward Snowballing. In Proc. of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. 29--38. Google ScholarDigital Library
- B. A. Kitchenham and P. Brereton. 2013. A Systematic Review of Systematic Review Process Research in Software Engineering. Information and Software Technology 55, 12 (2013), 2049--2075. Google ScholarDigital Library
- B. A. Kitchenham, D. Budgen, and P. Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. CRC Press. Google ScholarDigital Library
- B. A. Kitchenham, Z. Li, and A. Burn. 2011. Validating Search Processes in Systematic Literature Reviews. In Proc. of the 1st International Workshop on Evidential Assessment of Software Technologies.Google Scholar
- B. A. Kitchenham, R. Pretorius, D. Budgen, P. Brereton, M. Turner, M. Niazi, and S. Linkman. 2010. Systematic Literature Reviews in Software Engineering - A Tertiary Study. Information and Software Technology 52, 8 (2010), 792--805. Google ScholarDigital Library
- C. Marshall and P. Brereton. 2013. Tools to Support Systematic Literature Reviews in Software Engineering: A Mapping Study. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 296--299.Google Scholar
- C. Marshall, P. Brereton, and B. A. Kitchenham. 2014. Tools to Support Systematic Reviews in Software Engineering: A Feature Analysis. In Proc. of the 18th International Conference on Evaluation and Assessment in Software Engineering. ACM, 13. Google ScholarDigital Library
- M. Miwa, J. Thomas, A. OfiMara-Eves, and S. Ananiadou. 2014. Reducing Systematic Review Workload Through Certainty-based Screening. Journal of Biomedical Informatics 51 (2014), 242--253. Google ScholarDigital Library
- K. P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press. Google ScholarDigital Library
- D. Q. Nguyen. 2015. jLDADMM: A Java Package for the LDA and DMM Topic Models. http://jldadmm.sourceforge.net/. (2015).Google Scholar
- B. K. Olorisade, E. de Quincey, P. Brereton, and P. Andras. 2016. A Critical Analysis of Studies That Address the Use of Text Mining for Citation Screening in Systematic Reviews. In Proc. of the 20th International Conference on Evaluation and Assessment in Software Engineering (EASE '16). ACM, 14:1--14:11. Google ScholarDigital Library
- A. O'Mara-Eves" J. Thomas, J. McNaught, M. Miwa, and S. Ananiadou. 2015. Using Text Mining for Study Identification in Systematic Reviews: A Systematic Review of Current Approaches. Systematic Reviews 4, 1 (2015), 5.Google ScholarCross Ref
- J. R. Quinlan. 1986. Induction of Decision Trees. Machine Learning 1, 1 (1986), 81--106. Google ScholarCross Ref
- K. A. Robinson, A. G. Dunn, G. Tsafnat, and P. Glasziou. 2014. Citation Networks of Related Trials are Often Disconnected: Implications for Bidirectional Citation Searches. Journal of Clinical Epidemiology 67, 7 (2014), 793 - 799.Google ScholarCross Ref
- G. Salton, E. A. Fox, and H. Wu. 1983. Extended Boolean Information Retrieval. Communication of the ACM 26, 11 (1983), 1022--1036. Google ScholarDigital Library
- B. Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin-Madison. 11 pages. https://minds.wisconsin.edu/handle/1793/60660.Google Scholar
- B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. 2016. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. of the IEEE 104, 1 (2016), 148--175.Google ScholarCross Ref
- M. Skoglund and P. Runeson. 2009. Reference-based Search Strategies in Systematic Reviews. In Proc. of the 13th international conference on Evaluation and Assessment in Software Engineering (EASE'09). 31--40. Google ScholarDigital Library
- Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. 2006. Hierarchical Dirichlet Processes. J. Amer. Statist. Assoc. 101, 476 (2006), 1566--1581.Google ScholarCross Ref
- G. Tsafnat, P. Glasziou, M. K. Choong, A. Dunn, F. Galgani, and E. Coiera. 2014. Systematic Review Automation Technologies. Systematic Reviews 3, 1 (2014), 74.Google ScholarCross Ref
- B. C. Wallace, K. Small, C. E. Brodley, J. Lau, C. H. Schmid, L. Bertram, C. M. Lill, J. T. Cohen, and T. A. Trikalinos. 2012. Toward Modernizing the Systematic Review Pipeline in Genetics: Efficient Updating via Data Mining. Genetics in Medicine 14, 7 (2012), 663--669.Google ScholarCross Ref
- C. Wohlin, P. Runeson, P. A. da Mota Silveira, E. Engstrom, I. do Carmo Machado, and E. S. de Almeida. 2013. On the Reliability of Mapping Studies in Software Engineering. Journal of Systems and Software 86, 10 (2013), 2594--2610.Google ScholarCross Ref
- H. Zhang, M. A. Babar, and P. Tell. 2011. Identifying Relevant Studies in Software Engineering. Information and Software Technology 53, 6 (2011), 625--637. Google ScholarDigital Library
Index Terms
- A Machine Learning Approach for Semi-Automated Search and Selection in Literature Studies
Recommendations
Guidelines for snowballing in systematic literature studies and a replication in software engineering
EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software EngineeringBackground: Systematic literature studies have become common in software engineering, and hence it is important to understand how to conduct them efficiently and reliably.
Objective: This paper presents guidelines for conducting literature reviews using ...
Systematic literature studies: database searches vs. backward snowballing
ESEM '12: Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurementSystematic studies of the literature can be done in different ways. In particular, different guidelines propose different first steps in their recommendations, e.g. start with search strings in different databases or start with the reference lists of a ...
Automation of systematic literature reviews: A systematic literature review
Abstract ContextSystematic Literature Review (SLR) studies aim to identify relevant primary papers, extract the required data, analyze, and synthesize results to gain further and broader insight into the investigated domain. ...
Comments