Covid-19 Global Spread Analyzer: An ML-Based Attempt

Corresponding Author: Rana Husni Al Mahmoud Department of Computer Science, University of Jordan, Amman, Jordan Email: rana.husni@gmail.com Abstract: The novel Coronavirus 2019 (COVID-19) has caused a pandemic disease over 200 countries, influencing billions of humans. In this consequence, it is very much essential to the identify factors that correlate with the spread of this virus. The detection of coronavirus spread factors open up new challenges to the research community. Artificial Intelligence (AI) driven methods can be useful to predict the parameters, risks and effects of such an epidemic. Such predictions can be helpful to control and prevent the spread of such diseases. In this study, we introduce two datasets, each of which consists of 25 country-level factors and covers 137 countries summarizing different domains. COVID-19STC aims to detect the increase of the total cases, whereas COVID-19STD aimed for total death detection. For each data set, we applied three feature selection algorithms (vis. correlation coefficient, information gain and gain ratio). We also apply feature selection by the Wrapper methods using four classifiers, namely, NaiveBayes, SMO, J48 and Random Forest. The GDP, GDP Per Capital, EGovernment Index and Smoking Habit factors found to be the main factors for the total cases detection with accuracy of 73% using the J48 classifier. The GDP and E-Government Index are found to be the main factors for total deaths detection with accuracy of 71% using J48 classifier.


Introduction
The word Epidemic, derived from Greek, means the spread of disease rapidly to a large number of people in a short period of time within community, population, or region, whereas a pandemic is an epidemic that spread over multiple countries or continents, such as the H1N1 outbreak in 2009 and Coronavirus Disease 2019   (Hays, 2005). The history of the epidemic goes long as far back as to the Middle era. The world was suffering from several epidemics (Pyne et al., 2015), which brings the need for a scientific and systematic means to study the distribution and determinant causes and risk factors of health-related states and events in specified populations, also known as Epidemiology (Diekmann and Heesterbeek, 2000). During the period of an epidemic, casting and forecasting are of crucial importance for public health planning and control domestically and internationally especially when the number of infections increases exponentially .
COVID-19 has raised serious concerns as its spread has become a global threat . The virus began to spread widely in China at the end of 2019, before spreading rapidly in other parts of the world , despite the large-scale containment efforts of the Chinese government (Fanelli and Piazza, 2020). It has been declared as a global pandemic by the World Health Organization (WHO) on March 11, 2020 .
The prediction of new infected cases, deceased cases and healed cases is certainly essential for health policy makers in order to estimate the capacity of a health system to cope with the stress caused by a pandemic (Fanelli and Piazza, 2020). Many scholars are trying to understand the spread dynamics of COVID-19 and to propose effective prevention and control strategies since December 2019 (Fanelli and Piazza, 2020;Zhang et al., 2020;Jung et al., 2020).
Epidemiology modeling allows for epidemiological parameters estimation from data, identification of patterns, assessment of the relative merits of alternative control strategies and prediction of epidemiological or evolutionary dynamics. It helps in gaining insights into infectious as well as in designing control strategies (Bauch et al., 2005). In this study, we introduce two datasets, each of which consists of 25 country-level factors and covers 137 countries summarizing Geographic, Demographic, Economic, Healthcare System, Transportation, Technological, Social, Cultural, Religious and Political domains. COVID-19STC aims to detect the increase of the total cases, whereas COVID-19STD aimed for total death detection. Then we analyzed the two datasets using different machine learning algorithms with various feature selection sachems. Four of these factors found to be able to create models comparable to those models created based of the twenty-five potential factors analyzed.
The paper is organized as follows. Related work is presented next followed by a description of the data sources used and the proposed method. Experiments and results' analysis are given in section 4, followed by discussion of the research results. Conclusions and directions for future work are presented in the last section.

Related works
Epidemics of viruses have been studied with the aid of graphs and random graphs for decades (Kephart and White, 1992). A directed random graph used in (Khelil et al., 2002) to extend epidemiological models to investigate the spreading of computer viruses. Cellular automata used to model the spread of diseases in small-world networks in (Verdasca et al., 2005). The small-world network model is found to be better than the classical Susceptible, Infected and Recovered (SIR) for describing the local variability. A 4-state model was used to simulate the SARS transmission under a small-world topology in (Anghel et al., 2007) using the Hong Kong SARS data. The model takes into account those who were infected but not yet infectious. The research suggests that outbreaks could be prevented if the patients with symptoms were isolated as soon as possible (Costa et al., 2011).
Classical mathematical epidemiology found to be successful in informing public health policy makers. Such models focus on rate-based differential equation models, where the population is partitioned into subgroups based on various criteria and uses differential equation models to describe the disease dynamics across these groups (Marathe and Ramakrishnan, 2013). A mathematical model for the spread of Ebola fever epidemics was built in (Legrand et al., 2007) based on data from Democratic Republic of Congo (DRC) in 1995 and in Uganda in 2000, concentrating on the rapid institution of control measures. Other researchers analyzed Ebola outbreak with different strategies including (Sau, 2017;Pandey and Karthikeyan, 2011;Pigott et al., 2014). Nevertheless a potential weakness of such approach is its inability to capture the complexity of human interactions and behaviors (Marathe and Ramakrishnan, 2013).
Recently, advances in machine learning, data mining and data science make it possible to develop an indispensable solution to treat using data. It is used to predicate epidemiological characteristics and control the spatiotemporal transmission of disease throughout the world (Hamer et al., 2020). The use of machine learning and reasoning methods in support of computational epidemiology is a rich area with many significant research challenges (Marathe and Ramakrishnan, 2013).
Machine learning was used in (Forna et al., 2019) to study the epidemiological characteristics of the Ebola virus outbreak in West Africa. The research presented in (Sadilek et al., 2012) explores how individuals contribute to the global spread of disease. Using the Support Vector Machine learning algorithm (SVM), scholars predicted if users were sick based on their tweets. Geo-tagged tweets are used to infer user locations and the move of individuals between cities and the timelines of target users are used to infer their interactions with others. Machine learning techniques are used to evaluate the performance of the time series forecasting of casualties in the case of Ebola Outbreak in Recently many scholars are attracted to find a way to predict and recover, either based on data analysis or on health models, epidemic predictions of COVID-19 (Peng et al., 2020;Zhao et al., 2020a;2020b;Chen et al., 2020b;Li et al., 2020b;Wu et al., 2020;Imai et al., 2020;Hilton and Keeling, 2020;Kastner et al., 2020;Jia et al., 2020;Zeng et al., 2020;Buizza, 2020). Methods to predict COVID-19 patients, using a mobile phone, was presented in (Rao and Vazquez, 2020) and a prognostic prediction model based on XGBoost machine learning algorithm built in (Yan et al., 2020) to identify early detection of high-risk patients before they transmitted from mild to critically ill. The work in (Ivanov, 2020) shows how that epidemic outbreaks represent one specific case of supply chain disruptions. This type of supply chain risks is distinctively characterized by long-term disruption existence and its unpredictable scaling, simultaneous disruption propagation and epidemic outbreak propagation and simultaneous disruptions in supply, demand and logistics infrastructure. An online/mobile GIS and mapping dashboards and applications for tracking the COVID-19 epidemic and associated events described in (Boulos and Geraghty, 2020). The work in (Killeen et al., 2020) presents aggregated out-of-home activity information for various points of interest for each county of the US, as well as providing tools to read them, to help researchers investigating how the disease spreads. The metrics they are working on include demographics ethnicity, housing, education, employment, income, climate, transit scores, healthcare system-related. To predict the country-specific risk of (COVID-19), a shallow Long Short-Term Memory (LSTM) based neural network optimized using Bayesian optimization presented in (Pal et al., 2020). Observed spread of COVID 19 found to be correlated with climatological temperatures, latitude, travel, population density and sociological trends as pointed out in (Poole, 2020). Similar findings presented in (Sajadi et al., 2020).
An attempt to forecast the number of deaths in China due to COVID-19 China is presented in (Gao et al., 2020) based on official accumulated the number of deaths using Boltzmann function and the Richards function. The generalized additive model used in recent study on death rates in Wuhan. The study suggested that the temperature variation and humidity are factors affecting death rates due to the COVID-19 . Based on statistical analysis of data from 54 countries, the it was suggested in (Chen et al., 2020a) that temperature, wind speed and relative humidity combined together could predict the epidemic situation, which could help decision maker on COVID-19 outbreak control.
An attempt to statistically analyze COVID-19 infections based on data obtained form WHO is presented in (Kumar and Hembram, 2020) and found that the infection curve of China and Republic of Korea almost saturated. No solid reasoning provided for such findings as the aim of the work is to provide statistical analysis.
As with , the work in (Jia et al., 2020) attempted to predict the epidemic curve in China. However, they adopted three mathematical models: Logistic model, Bertalanffy model and Gompertz model. They based their work on SARS data and found that Logistic model outperforms the other two models and that the accumulative number of infections in Chain would between 80261 and 85140 (Pandey and Karthikeyan, 2011).
Machine learning modeling used in (DeCaprio et al., 2020) for identify individuals who are at the greatest risk due to COVID-19 based on data for complications due to other upper respiratory infections to address limited COVID-19 specific information. They used a feature set derived from medical insurance claims. A variant of the Susceptible-Exposed-Infectious-Removed (SEIR) model used in  with Long-Short-Term-Memory (LSTM) recurrent neural network to derive the epidemic curve in China based on SARS data. They emphasized that the adopted control measures in January 2020 in China was necessary for reducing the spread of COVID-19. Whereas the work of (Fong et al., 2020b) proposed Polynomial Neural Network with corrective feedback (PNN + cf) to help forecasting the number of infections even with small data set. The method found to useful in generating acceptable forecast for a novel disease such as COVID-19. The work further elaborated in (Fong et al., 2020a) and a deep learning-based Composite Monte-Carlo (CMC) is used in conjunction of fuzzy rule induction techniques and validated based on COVID-19 data from The Chinese Center for Disease Control and Prevention1 (CDCP) considering factors such as infection rates and death rates. The work focuses fusing on deterministic and non-deterministic data series into a Monte-Carlo (MC) simulation for fuzzy decision making to help with early decision making of a novel disease, as decision making for a novel disease can be critical in the initial stage an epidemic especially when available data considered scarce.
A Modified Auto-Encoder (MAE) method for real time forecasting of the new and cumulative COVID-19 cases based on WHO data under various interventions strategies in various countries presented in (Hu et al., 2020) and concluded that public health intervention was extremely necessary. A delay of one month in Italy increased the maximum number of cases from 29,475 to 1,493,498 and a delay of one month in Germany from increased the maximum number of cases from 8,795 to 144,542.
Unlike most existing related works which were based data from China or limited set of countries, our study is based on publicly available data related to most countries, when such data could be retrieved. In this study we attempt to consider a large number of macrolevel factors such as GDP rather than considering microlevel factors such as repertory system complications or considering a limited number of factors.

Methodology
We start by analyzing potential factors that may be used to model the spread of the disease and group them into categories to simplify their analysis. Factors analyzed in this study were divided into Geographic, Demographic, Economic, Healthcare System, Transportation, Technological, Social, Cultural, and Religious and Political metrics categories as indicated in Table 1. For the purpose of this study, we rely on publicly available data only as processing privacyprotected data can be done in a separate study due to the time needed to obtain privacy-protected data. Following data extraction, we create a dataset for further analysis. To avoid noise and find the most appropriate set of features that can use to model the spread of the disease, we apply some well-known feature selection  Figure  1 shows the flow layout of these stages.

Data Extraction and Dataset Creation
We tried to extract publicly available data related to all factors in Table 1. Unfortunately, for several factors, we could not find such data for most countries. Table 2 lists potential factors and the number of countries for which relevant data we could extract. All data extracted on 15/4/2020. Because of missing values in features for some countries; we limit ourselves to the 25 factors that, each of which covers at least 137 countries such as GDP, population density and air traffic out. The list of these factors is presented in Table 3 followed by the corresponding countries in Table 4.
Following data extraction, typical data normalization adapted to address differences in data ranges for various factors in the dataset.

Selection of Best Subset of Features
Feature selection schemes can be used for removing noisy, irrelevant, and redundant features which results in a smaller subset of relevant features from the original ones (Miao and Niu, 2016) aiming to get more accurate models. Information gain and wrapper selection among the well known feature selection schemes. A survey of feature selection schemes can be found in (Miao and Niu, 2016;Chandrashekar and Sahin, 2014;Molina et al., 2002).
The correlation coefficient indicates the strength and direction of a relationship between two random variables. The commonest use refers to a linear relationship. Two variables have strong dependency when their correlation coefficient value is close to 1 or -1. When the value is 0, it means that the two variables are not related at all (Hsu and Hsieh, 2010). Information Gain (IG) is an entropy-based feature evaluation method, widely used in the field of machine learning. IG measures the number of bits of information obtained for category prediction by knowing the presence or absence of a feature in an instance (Lei, 2012). IG evaluates features individually, scores each feature without considering the redundancy between them and selects the number of features which are predefined with the highest correlation rates (Quinlan, 1986).
The information gain measure is biased toward tests with many outcomes. That is, it prefers to select attributes having a large number of values (Han et al., 2011). Gain Ratio (GR) is a modification of the information gain that reduces its bias. Gain ratio takes number and size of branches into account when choosing an attribute (Priyadarsini et al., 2011).  Wrappers are methods for feedback which incorporate the ML algorithm into the feature selection process, i.e., they depend on a specific classifier's performance to assess the quality of a set of features. Wrapper methods look through the space of feature subsets and calculate the accuracy of one classifier for each feature that can be added to or removed from the feature subset (Janecek et al., 2008).

Evaluation Measures
We compare the performance of all these methods based on a set of standard evaluation measurements (described next):  Accuracy: Accuracy is a metric used to estimate how a classifier can correctly predict low, neutral, and high instances for each class. It can be calculated as the ratio of correctly classified instances to the total number of instances (Sokolova and Lapalme, 2009  Area Under Curve (AUC) Area Under the Curve (AUC) is commonly used as a summary measure of the Receiver Operating Characteristic (ROC) curve. Which measures the trade-off between sensitivity and specificity. The higher the area the better is the decision rule (Metz, 1978)

Experiments and Evaluation Results
In this section, we present the conducted experiments to test the performance of the proposed classification models and discuss their evaluation results.

Experiments Setup
All experiments were conducted using a personal computer with Intel R core TM i5-5500U CPU @ 2.53 GHz/4 GB RAM. We experimented with different algorithms, namely: (1) Sequential Minimal Optimization (SMO), (2) Random Forest, (3) J48 and (4) Naive Bayes. In all experiments, all algorithms were implemented using Weka. All classification algorithms are trained using 10fold cross-validation. In 10-fold cross validation, the available data is randomly divided into 10 disjoint subsets of approximately the same size. Nine sets are used for building the classifier and the remaining subset is used as the test set. Then the test set is used to determine the accuracy. This is done ten times in order to use every subset as a test subset. The accuracy calculated as a mean of the accuracy value for each of the classifiers.

Creating Datasets
We illustrate the methods presented in this study using two datasets:  Predicting the spread of COVID-19 with respect to the total cases (COVID-19STC 1 )  Predicting the spread of COVID-19 with respect to the total deaths (COVID-19STD 2 ) Each dataset contains 25 features of 137 countries. In the COVID-19STC dataset, the target class is Total Cases, whereas, in the COVID-19STD dataset, the target class is Total Death.
The COVID-19STC dataset is sorted in ascending order according to the total cases feature. We assign '1' (low) as a label to the countries in the first one-third part in the sorted dataset; we assign '2' (intermediate) to the countries in the second third in the sorted dataset; finally, the remaining countries in the third part will have '3' (high) as a class label. The same process done for COVID-19STD dataset. The COVID-19STD dataset is sorted according to the total deaths feature in ascending order. The dataset is labelled similar to the COVID-19STC.

Building a Model based on all Factors
Initially, we try to use all extracted factors presented in Table 3. We tried several machine learning algorithms and reported best results in Table 5 and for the COVID-19STC dataset and Table 6 COVID-19STD dataset where random forest outperformed other algorithms for the two datasets. In terms of AUC, NaiveBayes came next for both datasets. Such results help setting a base line for later comparisons.

Selection of the Best Subset of Features
A reductionist view assumes that the prediction of virus speed relies on the sum of risk features, as is the case with most scoring systems. We believe that such a reductionist approach is limited in its ability to successfully predict the spread of virus. The majority of virus speed (e.g., low, intermediate, or high) do not arise from a linear interaction between isolated factors, but from non-linear interactions among a web of determinants (Geographic, Demographic, Economic, ...etc.). For each data set , we run three feature selection algorithms: Correlation coefficient, information gain and gain ratio. The values from these methods are recorded in Tables 7  and 8 for Individuals Using Internet and E-Government Index had the highest correlation coefficient, GDP has the highest information gain and CO2 Emissions has the highest gain ratio. For COVID-19STD UNDP index has the highest correlation coefficient, CO2 Emissions has the highest information gain and GDP has the highest gain ratio.
The larger values correspond features for each method, indicate the importance of these features to the prediction. The wrapper model techniques evaluate the features using the learning algorithm that will ultimately be employed. Thus, they "wrap" the selection process around the learning algorithm. We apply Wrapper method on the original data sets by using four classifiers: NaiveBayes, SMO, J48 and Random Forest. A comparison of selected features based on wrapper feature selection presented in Table 9 for COVID-19STC dataset and in Table 10 for COVID-19STD dataset.        Each newly obtained data set contains only the selected features from each algorithm and calculate overall accuracy, F-Measure, RMSE and AUC by 10-fold cross-validation as presented in Tables 11 and 12 for COVID-19STC and COVID-19STD datasets respectively.
Let's take a closer look at how good were the feature selection methods in choosing the best subset of features for better prediction. Figure 2 summarizes the essential features that result from the various feature selection methods.
We select features that have value greater than 0.3 for correlation coefficient, information gain and gain ratio, as shown in Table 7, in addition to the features selected using wrappers: NaiveBayes, SMO, J48 and Random Forest, which are presented in Table 11, with respect to  There are four factors: GDP, GDP Per Capital, E-Government Index and Smoking Habit that are highly selected by the selection methods. This gives an indication of the factors that effect Covid19 spread in terms of the total cases.
We have evaluated the performance of our feature selection processes using accuracy, F-Measure, RMSE and AUC. These metrics help us to examine whether the methods can correctly and efficiently recognize the optimized features and show us the effect of feature selection in the classification stage. Table 13 presents the classification results depending on these four final selected features for COVID-19STC dataset.
It was observed from Table 13, that J48 gives highest accuracy of (73%), NaiveBayes and Random Forest were in the second place with (72%). Figure 3 summarizes the essential features that result from the various feature selection methods for COVID-19STD dataset. For correlation coefficient, information gain and gain ratio, features that have value greater than 0.3 for correlation coefficient, information gain and gain ratio, as shown in Table 8, in addition to the features selected using wrappers: NaiveBayes, SMO, J48 and Random Forest, which are presented in Table 12.
The are two factors (Vis. GDP and E-Government Index) that are highly selected by the selection methods. That gives an indication of factors that effect Covid19 number of deaths. Table 14 present classification result  depending on these four final selected features.   Table 14 presents the classification results using these two final selected features for COVID-19STD dataset. One can notice that J48 gives higher accuracy with (71%), Random Forest is in the second place with (68 %).

Discussion
Results indicate there are four factors that gained sufficient weight to be considered of strong correlation with Covid19 spread. The weight of each factor is stipulated from observing that factor's appearing within the top result group of several approach angles used. Each one of these four factors appeared among the top results of at least 4 methods used (representing approach angles).
The four factors are: GDP, GDP Per Capita (ppp), eGovt. Development Index and Smoking:  Finding of GDP as a Factor of Strong Correlation with Covid19 Spread Although the result may not sound intuitive at first glance, but it makes sense from different perspectives:  GDP is an economic indicator that represents a broad measure of overall domestic production. It functions as a comprehensive scorecard of the country's economic health. Thus, the higher the GDP the higher the national production activity is. Production activities would expectedly require a massive level of interactions at all tiers and throughout the entire value chain; from sourcing raw materials to finished products that involve processing, manufacturing and/or exchanging goods or services. The more human interactions there are, the higher the opportunity for Covid19 to spread  On another dimension, GDP is directly connected to exports, foreign trade (in industrial economies) and to tourism (in service economies) -both tie GDP to the influx of air traffic into and out of the country carrying visitors, labour and tourists, which are a major factor in spreading the disease by potential carries from abroad continuously mixing with population and increasing the likelihood of an outbreak  On a different level, GDP symbolizes the economic health of a country, which directly connects to that country's availability of sizable expenditure on vital sectors like the healthcare system. The better and more prepared the healthcare system, the higher Covid19 testing activities are going around. The higher the testing, the higher the numbers of positive cases registered  The higher the availability of online Govt. services, the higher the time saved that would've been otherwise spent obtaining the services offline, which leaves more leisure time for people to use on social activities, which increases the overall susceptibility to infection spread through human interactions  The better Telecommunication Infrastructure the higher the internet penetration in the society, leading to greater and faster exposure to misinformation and fake news that is associated with (and further intensifies) disease spread  Finding of Smoking as a Factor of Strong Correlation with Covid19 Spread This result makes sense despite some unproven claims otherwise. WHO 4 officially announced that "There is currently insufficient information to confirm any link between tobacco or nicotine in the prevention or treatment of COVID-19" and stressed that "there are no peerreviewed studies that have evaluated the risk of SARS-CoV-2 infection associated with smoking"  However, WHO envisioned that "Tobacco smokers (cigarettes, waterpipes, bidis, cigars, heated tobacco products) may be more vulnerable to contracting COVID-19, as the act of smoking involves contact of fingers (and possibly contaminated cigarettes) with the lips, which increases the possibility of transmission of viruses from hand to mouth. Smoking waterpipes, also known as shisha or hookah, often involves the sharing of mouth pieces and hoses, which could facilitate the transmission of the COVID-19 virus in communal and social settings"

Conclusion
In this study, we introduce two datasets, each of which consists of 25 country-level factors and covers 137 countries summarizing Geographic, Demographic, Economic, Healthcare System, Transportation, Technological, Social, Cultural, Religious and Political metrics. One of theme (COVID-19STC) aims to detect the increase of the total cases, whereas the other (COVID-19STD) aimed for total death detection. Then we analyzed the two datasets using different machine learning algorithms with various feature selection sachems. Four of these factors found to be able to create models comparable to those models created based of the twenty-five potential factors analyzed.
In the COVID-19STC dataset, the main features that are highly selected by the selection methods are GDP, GDP Per Capital, E-Government Index and Smoking Habit. This gives an indication of the factors that effect Covid19 spread in terms of the total cases. In the COVID-19STD dataset, GDP and E-Government Index were highly selected by the selection methods, which gives an indication of the factors that effect Covid19 number of deaths. GDP and GDP Per Capita are economic indicators that represent a good measurement of a country's standard of living, in addition to the higher production or export activities. Production and export activities would expectedly require a massive level of interactions. The more human interactions there are, the higher the opportunity for Covid19 to spread. Smoking Habit increases the possibility of transmission of viruses from hand to mouth. It often involves the sharing of mouth pieces and hoses, which could facilitate the transmission of the COVID-19 virus in communal and social settings. A natural future step would involve analyzing further factors, both based on proprietary data or privacy-protected data, such as patients' medical data, geographical location(s) and travel habits.