1 Introduction

Throughout history, the threat of pandemics has raised concerns for the healthcare community. The potential threat of spreading major infected diseases around the world before anyone aware of it is a controversial issue. The apparent prevalence of Severe Acute Respiratory Syndrome (SARS) and various types of influenza in the past have indicated the extent to which a pandemic disease can affect the health systems of countries [1, 2]. Coronavirus disease (COVID-19) is the last series of pandemic diseases that affect the world powerfully. COVID-19 or novel Coronavirus (2019-nCoV) is an infectious disease caused by coronavirus 2 (SARS-CoV-2) that began on December 8, 2019, from Wuhan, China [3, 4]. Since a novel coronavirus (nCoV) is a new strain of the coronavirus family that has not been seen before, the world faces serious challenges to control this outbreak [5, 6]. During the fierce outbreaks, not only clinical specialists have been trying to invent novel treatments and vaccines, but also scientists in the field of data science and technology are trying to discover the infectious and help control it by applying information-based methods [7, 8].

Nowadays, an extensive amount of health data is collected through patient care from different numerous sources due to the digital health revolution [9, 10]. Hence, the modern world of medicine is rich in information but it is poor in knowledge [11, 12]. Therefore, striving to this new pandemic and possible future pandemics has become one of the notable concerns of scientists.

In the last decades, some valuable studies have been published regarding pandemics and data mining (DM) techniques[13]. Such studies were conducted with the aim of better understanding, controlling, and manage pandemics using various data mining methods. Due to the importance to fight the COVID-19 pandemic, conducting a survey on the most popular and efficient data mining methods could have a significant impact on selecting the most effective techniques in pandemic studies. Thus, it can help us to reveal the unknown character of the new pandemic and the next possible pandemics. As follows, the core objective of this review is sought to collecting, summarizing, and analyzing the existing articles to aid track and analysis of such studies that have been published in terms of pandemics and data mining methods. The specific research questions (RQ) of this review are: (RQ1) To determine how many studies published over the past years and previous months regarding last pandemics and COVID-19 outbreak, (RQ2) Representing an overview of published studies and their characteristics, (RQ3) Investigating the published studies regarding data mining techniques, (RQ4) Identifying the source of data, (RQ5) Determining the most favorite DM techniques in terms of their frequency and clinical domains, (RQ6) Identifying the main approaches of published studies.

2 Method

The present study was completed according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist to ensure the inclusion of relevant studies [14]. Next, the synthesis of eligible articles based on the main characteristics was conducted to classify the main characteristics of studies.

2.1 Literature search

A systematic search of the scientific database, Web of Science, Scopus, and PubMed databases from 2010 up to 16 Oct 2020 was completed using “data mining”, “prediction model”, “data mining techniques”, “data mining methods”, “pandemics”, “pandemic”, “COVID-19”, “SARS-CoV-2”, and “coronavirus disease” as keywords. Boolean search strategies were designed based on these keywords in each database.

2.2 Inclusion and exclusion criteria for study selection

Articles were included if they met the following criteria:

1) The focus of this study is on pandemic diseases such as COVID-19,

2) Only the articles about using data mining techniques or knowledge discovery methods were included. Due to the variety of methods in this field, these types of methods are selected based on the study was conducted by Patel and Patel [15].

3) Studies were limited to those published in the English language.

Articles were excluded if they met the following criteria: 1) The title, abstract, or full text of the article did not relate to any pandemics or COVID-19 disease, 2) Book chapters, letters to editors, short briefs, reports, commentaries, technical reports, review or meta-analysis were excluded, 3) Non-English papers, 3) Image processing methods were not considered. 4) The full text was not available. To reduce the bias of unavailable full-text, the full texts of non-open access articles were obtained by contacting to authors. Therefore, all of the full-text of articles were retrieved by researchers.

2.3 Data extraction phase

In scientific databases searching (Web of Science, Scopus, and PubMed), 311 articles were retrieved through the web interface of scientific websites. Some inclusion and exclusion criteria were defined for screening papers. In the first phase, all titles and abstracts of retrieved articles were examined to select eligible studies. All of the titles and abstracts were screened by three reviewers (MT, SS, and SR) to find relevant articles. Another reviewer (MG) reviewed a sample of studies randomly. The quality analysis of the individual papers was assessed by the Joanna Briggs Institute (JBI) checklist which provides robust checklists for the appraisal and assessment of most types of studies [16, 17]. Since all types of studies were included in our review, we applied this checklist. Decisions on study eligibility and quality were made by two reviewers; any disagreements were resolved by discussion. The flow of screening articles based on the [17] PRISMA method illustrates in Fig. 1.

Fig. 1
figure 1

The PRISMA diagram for the identification, screening, and eligibility of studies

Phase three involves full-text screening. In this phase, the full texts of relevant studies were screened thoroughly by four reviewers (MT, SS, SR, and MG). Through a full-text review, the final decision was made by RS if there was a disagreement between the authors in the selection of eligible studies.

Finally, 50 studies remained as eligible articles. Some classifications were assumed to classify and analyze the included studies. The extraction forms were designed by researchers to manage the reviewed articles. This classification comprises general information and specific information. General information includes author names, publication date, and publisher. Specific information includes the main objective, DM techniques, application of DM method, health discipline, main outcomes, evaluation results, data sources, sample sizes, applied software, and country. Included articles were analyzed to extract their characteristics based on the predefined classification. All of the extracted information was re-examined by all authors to reach an agreement. The next reviewer (RS) evaluated and validated the results. EndNote X9 is used for resource management, and all qualitative analysis was performed in SPSS v20.

3 Results

Earlier searches in scientific databases yielded 311 citations. First, 13 articles were excluded in the duplicate removal phase. Next, 82 articles were omitted due to their irrelevancy in the full-text screening stage. All included articles could be included in our review according to the JBI checklist. In the last screening phase. Finally, 50 articles were identified as eligible studies.

3.1 Study characteristics

All eligible papers that met our inclusion criteria included 47 journal papers and three conference papers. The distribution of studies by year is described in Table 1. As it is apparent, the majority of studies were published in 2020. Thus, the frequency of publication of these articles by month in 2020 was also examined. The trend of published articles regarding the month in 2020 is shown in Fig. 2. The “International Journal of Environmental Research and Public Health” has the first rank with six articles among the journals.

Fig. 2
figure 2

The distribution of papers by their month of publications in 2020

Table 1 Characteristics of papers based on publication years

A summary of the included articles based on predefined categories is described in Table 4 in Appendix. To visualize the frequency of words that appeared more frequently in reviewed articles, all articles summarized in the word cloud in Fig. 3.

Fig. 3
figure 3

Word cloud of most applied words in reviewed articles

3.2 Sample size and data sources of articles

Out of 50 studies, only 35 citations reported their sample size. Due to the variety of samples, the range of sample size was very wide. In other words, the samples considered were very different due to the variety of applied methods. The sample size ranged from 53 cases to 1,413,297 posts. In total, 35 different data sources were cited for eligible articles. social media platforms (n = 10), Hospital information sources (n = 7), and World Health Organization (n = 4) data sources were the three most common sources of information.

3.3 The distribution of articles based on the countries

In terms of the country, articles have been published in 14 different countries. The article also uses global data on the disease pandemic. The distribution of articles by country is shown in Fig. 4 on the worldwide map. As it turns out, China has the highest frequency among other countries.

Fig. 4
figure 4

The distribution of papers based on countries

3.4 The distribution of literature by main approaches

All of the articles in this study took a specific approach to fight the pandemic diseases and provide a better understanding by applying DM techniques. Based on the survey, we classified all of the articles by their main approaches in 11 categories that are shown in Table 2. One of the main objectives of eligible articles is Infoveillance. The term infoveillance has come to be used to refer to a type of syndromic surveillance that uses information and online tools in public health domains. Regarding infoveillance, regression was applied to provide new insight into the origins of the outbreak based on the analysis of social media information [18].

Table 2 Frequency of main approaches

As can be seen from Table 2, the majority of studies (22%) devoted to the disease characteristic. In the case of diseases, studies show that the most common use of data mining techniques to fight pandemics was related to the new pandemic COVID-19 (n = 44). Other diseases such as H1N1 Influenza (n = 2), Other types of Influenza pandemics (n = 2), and SARS (n = 2) were also considered.

3.5 The distribution of data mining techniques in reviewed articles

Since the main objective of this study was to determine to what extent data mining techniques are employed to fight pandemics, the frequency of applied methods was investigated in this section according to a study conducted by Patel and Patel [15]. Table 3 showed an overview of the distribution of applied data mining methods in reviewed articles. The analysis showed that all of the applied methods were classified into 14 main categories. It is apparent that the most favorite method was employed in reviewed articles belonged to Natural language processing (NLP) techniques (22%). While logistic regression analysis with 20% of studies was in the second rank to determine the association of the independent variables with one dichotomous dependent variable[68]. It should be noted here that most studies have used more than one data mining technique.

Table 3 Frequency of data mining techniques in reviewed studies

Additionally, the distribution of employed DM techniques regarding main approaches is illustrated in Fig. 5. The distribution and frequency of employed DM techniques based on main approaches can provide an appropriate insight for researchers regarding pandemics. The numbers in this figure indicate the number of studies per axis.

All of the DM techniques are categorized into supervised and unsupervised techniques. In a supervised learning method, the algorithm learns on a labeled dataset to provide an answer. While unsupervised learning techniques in which patterns are extracted from the unlabeled input data [69]. Thus, all of the applied methods in the reviewed articles were divided into three categories: supervised techniques (90%), unsupervised techniques (4%), and a combination of supervised and unsupervised techniques (6%).

Fig. 5
figure 5

Distribution of employed DM techniques regarding main approaches

3.6 The distribution of reviewed articles based on applied software

Special tools and a suitable platform are needed to perform data mining methods. In this section, we have examined the frequency of various tools used in these studies. SPSS software has the highest percentage (22%) among other tools, next R software has the second rank with 10 papers (20%), followed by Python software with nine studies (18%). MATLAB and RapidMiner software also accounted for one percent of the studies. Out of 50 studies, 13 studies (26%) did not specify the employed tools.

3.7 The characteristics of reviewed articles based on the main health domains

According to reviewed studies, we can classify all eligible articles in this review into eight categories based on their clinical discipline. The identified clinical and health disciplines with their distribution and their frequency are described in Fig. 6. From the chart, it is obvious that the greatest demand belonged to infectious disease with 18 papers (36%). Next, epidemiology is the second most discipline considered by included studies with 13 studies (26%). This analysis can be highly useful to determine literature gaps in terms of health domains.

Fig. 5
figure 6

The frequency of main health disciplines in reviewed articles

4 Discussion

The main objective of this review was to summarize the studies carried out on the application of data-driven DM methods in pandemics. Therefore, 50 articles were selected and analyzed from 311 retrieved studies. The finding and results are discussed in this section. The data sources used in the included studies were very diverse. In terms of country, most studies were conducted in China. This can be explained by the fact that most pandemics began in this country.

Nowadays, social media has become a new source of data [70] and they can generate more information in a short period than other resources. Since accessibility to these kinds of data is easier than other sources of data, the foremost of studies were devoted to applying text mining techniques regarding Infoveillance. The qualitative analysis revealed that researchers preferred to use supervised techniques such as regression to produce predictive models for a better understanding of unknown pandemics. All of these methods have been pragmatically used in different fields of medicine efficiently [71]. Additionally, classification methods have been used more than predicted in studies. By selecting the best method for implementing accurate prediction models, researchers can discover certain biomarkers in unknown diseases which can allow them to forecast important outcomes [72, 73]. Therefore, developing prediction models not only can help physicians but also aid health policymakers and societies.

Since the majority of studies were conducted in China, these models may be faced with overfitting. However, none of the studies recommended applying developed models in real practice. However, most authors were optimistic about the development of predictive models. Shamsuddin's opinion regarding the development of forecasting models is in line with our study[74]. Wyntass et al. conducted a systematic review study regarding predictive models of COVID-19. They concluded that proposed models are poorly reported with a high risk of bias [75].

Results showed that controlling the transmission of infectious disease is the main concern in pandemic disease [76]. Usually, the nature of a new disease in a pandemic is unknown, and identifying the characteristics of a new disease is one of the most important concerns for scientists. That it’s why the majority of studies are devoted to revealing disease characteristics. It can be explained by the fact that scientists should be paid more attention to diagnosis than other tasks in pandemic disease [77]. The next important issue in pandemic diseases is how the disease spreads. Hence, almost 10% of the studies have been dedicated to predicting the prevalence of the disease.

However, the sample size of datasets is very diverse due to a variety of applied methods. The results showed that most of the studies used various data sources with a limited number of data sets. Using large data sets can improve the strength of the results and improve the accuracy of the model's predictions [78], which in turn can help scientists better to fight this new disease. Accordingly, researchers are recommended to use large datasets for their studies even internationally, to achieve better diagnostic and therapeutic decisions.

In terms of diseases, most efforts were made under the heading of COVID-19. In the second place, the topics were related to influenza pandemics. This result is expected due to the high prevalence of these two diseases. Using and retrieving large amounts of data provided by electronic systems as a data source can improve access to data [79]. As a result, conducting data-driven studies has become easier in recent years than ever before. The fact that diseases related to other pandemics did not appear in this search may be due to the authors of these articles considered these diseases as epidemics.

In this study, we encountered some limitations. Nowadays, a vast majority of studies are published regarding COVID-19 daily. We investigated the literature up to 16 Oct 2020. Therefore, some studies might be neglected in the publication time of this article. Consequently, further research is needed to complete our results. Another limitation of the proposed research is that the electronic search process was performed in only three journal databases, and the rest of the databases were skipped while accessing the quality of journal articles which can be addressed in future research. The present study helps researchers to have a useful background for future work to understand the general context of data mining techniques in pandemics and their applications. Further studies could cover the study of data mining applications in a broader concept, or it can include the development of search strategies in larger databases. Analyzing and incorporating non-English written papers with automatic translator tools could be the subject of the next article. At least, it could be interesting to compare the number of non-English papers with English ones.

5 Conclusion

This review could help scientists to reach published researches regarding DM techniques and fierce pandemics easier. In this study, we surveyed the data mining techniques utilized in global pandemics, however, most of these techniques have been developed in the current context to prevent and predict the COVID-19 epidemic. According to our survey, we found out that the foremost objective of DM applications is related to disease characteristics. Also, it can help the policymakers and decision-makers in better decision-making regarding managing and preventing the major pandemics in the countries.