skip to main content
10.1145/3084226.3084243acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

A Machine Learning Approach for Semi-Automated Search and Selection in Literature Studies

Authors Info & Claims
Published:15 June 2017Publication History

ABSTRACT

Background. Search and selection of primary studies in Systematic Literature Reviews (SLR) is labour intensive, and hard to replicate and update. Aims. We explore a machine learning approach to support semi-automated search and selection in SLRs to address these weaknesses. Method. We 1) train a classifier on an initial set of papers, 2) extend this set of papers by automated search and snowballing, 3) have the researcher validate the top paper, selected by the classifier, and 4) update the set of papers and iterate the process until a stopping criterion is met. Results. We demonstrate with a proof-of-concept tool that the proposed automated search and selection approach generates valid search strings and that the performance for subsets of primary studies can reduce the manual work by half. Conclusions. The approach is promising and the demonstrated advantages include cost savings and replicability. The next steps include further tool development and evaluate the approach on a complete SLR.

References

  1. S. Augier, G. Venturini, and Y. Kodratoff. 1995. Learning first order logic rules with a genetic algorithm. In Proc. of The 1st International Conference on Knowledge Discovery and Data Mining (KDD-95). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Badampudi, C. Wohlin, and K. Petersen. 2015. Experiences from Using Snowballing and Database Searches in Systematic Literature Studies. In Proc. of the 19th International Conference on Evaluation and Assessment in Software Engineering (EASE '15). ACM, New York, NY, USA, Article 17, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Bird, E. Klein, and E. Loper. 2009. Natural Language Processing with Python -- Analyzing Text with the Natural Language Toolkit. O'Reilly Media, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. M. Blei, A. Y. Ng, and J. I. Michael. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3, Jan (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1--27:27. Issue 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. O. Chapelle and L. Li. 2011. An Empirical Evaluation of Thompson Sampling. In Proc. of the 24th International Conference on Neural Information Processing Systems (NIPS'11). 2249--2257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. K. Choong, F. Galgani, A. G. Dunn, and G. Tsafnat. 2014. Automatic Evidence Retrieval for Systematic Reviews. Journal of Medical Internet Research 10, e223 (Oct 2014).Google ScholarGoogle ScholarCross RefCross Ref
  8. C. Cortes and V. Vapnik. 1995. Support-Vector Networks. Machine Learning 20, 3 (1995), 273--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. S. Cruzes and T. Dybå. 2011. Research Synthesis in Software Engineering: A Tertiary Study. Information and Software Technology 53, 5 (2011), 440--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Q. B. daSilva, A.L. M.Santos, S. Soares, A. C França, C. V F. Monteiro, and F.F. Maciel. 2011. Six Years of Systematic Literature Reviews in Software Engineering: An Updated Tertiary Study. Information and Software Technology 53, 9 (2011), 899--913. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. O. Dieste, A. Grimán, and N. Juristo. 2009. Developing Search Strategies for Detecting Relevant Experiments. Empirical Software Engineering 14, 5 (2009), 513--539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. E. Fan, K. W. Chang, C.J. Hsieh, X. R. Wang, and C.J. Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research 9 (2008), 1871--1874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Ghafari, M. Saleh, and T. Ebrahimi. 2012. A Federated Search Approach to Facilitate Systematic Literature Review in Software Engineering. International Journal of Software Engineering & Applications 3, 2 (2012), 13--24.Google ScholarGoogle ScholarCross RefCross Ref
  14. S. Jalali and C. Wohlin. 2012. Systematic Literature Studies: Database Searches vs. Backward Snowballing. In Proc. of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. 29--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. A. Kitchenham and P. Brereton. 2013. A Systematic Review of Systematic Review Process Research in Software Engineering. Information and Software Technology 55, 12 (2013), 2049--2075. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. A. Kitchenham, D. Budgen, and P. Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. CRC Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. A. Kitchenham, Z. Li, and A. Burn. 2011. Validating Search Processes in Systematic Literature Reviews. In Proc. of the 1st International Workshop on Evidential Assessment of Software Technologies.Google ScholarGoogle Scholar
  18. B. A. Kitchenham, R. Pretorius, D. Budgen, P. Brereton, M. Turner, M. Niazi, and S. Linkman. 2010. Systematic Literature Reviews in Software Engineering - A Tertiary Study. Information and Software Technology 52, 8 (2010), 792--805. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Marshall and P. Brereton. 2013. Tools to Support Systematic Literature Reviews in Software Engineering: A Mapping Study. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 296--299.Google ScholarGoogle Scholar
  20. C. Marshall, P. Brereton, and B. A. Kitchenham. 2014. Tools to Support Systematic Reviews in Software Engineering: A Feature Analysis. In Proc. of the 18th International Conference on Evaluation and Assessment in Software Engineering. ACM, 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Miwa, J. Thomas, A. OfiMara-Eves, and S. Ananiadou. 2014. Reducing Systematic Review Workload Through Certainty-based Screening. Journal of Biomedical Informatics 51 (2014), 242--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Q. Nguyen. 2015. jLDADMM: A Java Package for the LDA and DMM Topic Models. http://jldadmm.sourceforge.net/. (2015).Google ScholarGoogle Scholar
  24. B. K. Olorisade, E. de Quincey, P. Brereton, and P. Andras. 2016. A Critical Analysis of Studies That Address the Use of Text Mining for Citation Screening in Systematic Reviews. In Proc. of the 20th International Conference on Evaluation and Assessment in Software Engineering (EASE '16). ACM, 14:1--14:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. O'Mara-Eves" J. Thomas, J. McNaught, M. Miwa, and S. Ananiadou. 2015. Using Text Mining for Study Identification in Systematic Reviews: A Systematic Review of Current Approaches. Systematic Reviews 4, 1 (2015), 5.Google ScholarGoogle ScholarCross RefCross Ref
  26. J. R. Quinlan. 1986. Induction of Decision Trees. Machine Learning 1, 1 (1986), 81--106. Google ScholarGoogle ScholarCross RefCross Ref
  27. K. A. Robinson, A. G. Dunn, G. Tsafnat, and P. Glasziou. 2014. Citation Networks of Related Trials are Often Disconnected: Implications for Bidirectional Citation Searches. Journal of Clinical Epidemiology 67, 7 (2014), 793 - 799.Google ScholarGoogle ScholarCross RefCross Ref
  28. G. Salton, E. A. Fox, and H. Wu. 1983. Extended Boolean Information Retrieval. Communication of the ACM 26, 11 (1983), 1022--1036. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin-Madison. 11 pages. https://minds.wisconsin.edu/handle/1793/60660.Google ScholarGoogle Scholar
  30. B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. 2016. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. of the IEEE 104, 1 (2016), 148--175.Google ScholarGoogle ScholarCross RefCross Ref
  31. M. Skoglund and P. Runeson. 2009. Reference-based Search Strategies in Systematic Reviews. In Proc. of the 13th international conference on Evaluation and Assessment in Software Engineering (EASE'09). 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. 2006. Hierarchical Dirichlet Processes. J. Amer. Statist. Assoc. 101, 476 (2006), 1566--1581.Google ScholarGoogle ScholarCross RefCross Ref
  33. G. Tsafnat, P. Glasziou, M. K. Choong, A. Dunn, F. Galgani, and E. Coiera. 2014. Systematic Review Automation Technologies. Systematic Reviews 3, 1 (2014), 74.Google ScholarGoogle ScholarCross RefCross Ref
  34. B. C. Wallace, K. Small, C. E. Brodley, J. Lau, C. H. Schmid, L. Bertram, C. M. Lill, J. T. Cohen, and T. A. Trikalinos. 2012. Toward Modernizing the Systematic Review Pipeline in Genetics: Efficient Updating via Data Mining. Genetics in Medicine 14, 7 (2012), 663--669.Google ScholarGoogle ScholarCross RefCross Ref
  35. C. Wohlin, P. Runeson, P. A. da Mota Silveira, E. Engstrom, I. do Carmo Machado, and E. S. de Almeida. 2013. On the Reliability of Mapping Studies in Software Engineering. Journal of Systems and Software 86, 10 (2013), 2594--2610.Google ScholarGoogle ScholarCross RefCross Ref
  36. H. Zhang, M. A. Babar, and P. Tell. 2011. Identifying Relevant Studies in Software Engineering. Information and Software Technology 53, 6 (2011), 625--637. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Machine Learning Approach for Semi-Automated Search and Selection in Literature Studies

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          EASE '17: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering
          June 2017
          405 pages
          ISBN:9781450348041
          DOI:10.1145/3084226

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 June 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate71of232submissions,31%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader