Robust Modelling of Coronavirus Cases in Egypt: Poisson INARCH and Negative Binomial INARCH

E-mail: hanan167@yahoo.com Abstract: This study compares robust Poisson INARCH(P) models (more briefly: RP-INARCH) and robust negative binomial INARCH(p) models (more briefly: RNB-INARCH) to fit the new daily confirmed cases for the first wave of COVID 19 in Egypt. The robust estimation of these models is based on some modifications of the Conditional Maximum Likelihood Estimates (CMLE). The simulation results show that RNB-INARCH is more robust than RP-INARCH, but less efficient if the data contain isolated or patched additive outliers in terms of the bias calculation, whereby the low-bias model is more robust. These results are confirmed by the application study on COVID-19 data. The Akaike Information Criterion (AIC) is also compared for these models .


Introduction
In 2020, Corona virus or "SARS-COV-2" (abbreviation of Corona-2 virus, severe active respiratory syndrome) also known as "COVID 19" spread globally. In 11 March 2020, the World Health Organization declared this virus as a pandemic. By the end of 2020, the virus has spread in at least one case in 218 countries. In Egypt, the primary case was reported on 14/02/2020. At the end of 2020, it confirmed its higher new cases of COVID-19 that occurred every day and it's become essential from the statistical point of view to analyze the behavior of this disease. Biostatistics as an important branch of statistic provides us with models which help to solve most problems of human health and diseases. Elsaied (2021) proposed some examples in the medical field, which used different statistical models: Heinen (2003) used a conditional autoregressive Poisson model (CAP) to fit the time series of monthly polio cases in the United States. Promprou et al. (2006) used ARIMA model to predict dengue haemorrhagic fever cases in southern Thailand. Martinez et al. (2011) used SARIMA model to predict dengue fever cases in Campinas, São Paulo State, Brazil. Ahdika and Lusiyana (2017) compared the INAR (1) Poisson model and Markov prediction model in West Java, Indonesia to predict the number of Dengue Hemorrhagic fever (DH) patients. On the other hand, Elsaied (2021) expanded other models that were applied to COVID-19 data: Giuliani et al. (2020) used GLM model class to model and predict the spatio-temporal spread of Corona virus disease 2019  in Italy. Fahmy et al. (2020) used a generalized and modified version of classical SIR/SEIR" Susceptible-Exposed-Infectious-Recovered" models to analyze the dynamics outbreak of COVID-19 in Egypt, Qatar and Saudi Arabia. Asamoah et al. (2020) extended the generalized SEIR epidemic model to capture the dynamics of COVID-19 in Ghana and Egypt. Lauer et al. (2020) assumed totally different distribution: Log normal, gamma, Weibull and Erlang to estimate the period of time of COVID-19 from publicly reportable confirmed cases. Babaei et al. (2021a) investigated a stochastic SEIAQHR model for transmission of Corona-virus disease. Their model was established due to several safety protocols, for instance social-distancing, mask and quarantine. Also, Babaei et al. (2021b) introduced a stochastic model to describe the spread of coronavirus with considering several disease compartments related to different age groups. Khan et al. (2021), proposed a stochastic model to discuss the dynamics of novel corona virus disease. They formulated the model to study the long run behavior in varying population environment. Additionally, Elsaied (2021) recommended to use the RP-INARCH(p) models to fit the number of new daily confirmed cases of "COVID 19" in Egypt, wherein those models proved its reasonability than the non robust ones to model such data. These models are considered to be the simplest subclasses of the INGARCH (p, q) models. The INGARCH (p, q) were developed early by Ferland et al. (2006) andFokianos et al. (2009) among others. This study suggests to compare between RP-INARCH(p) and RNB-INARCH(p) models to fit "COVID 19" count data. In (2014), Elsaied and Fried analyzed the robust estimation of the Poisson-INARCH model for counting time series data. Also, negative binomial INARCH(P) (NB-INARCH) models and their comparison with the Poisson-INGARCH(P) (P-INARCH) models were proposed for time series data. For non-robust data: Davis and Wu (2009) studied the negative binomial model of time series counting and provided a performance estimator, which is an efficiency estimator, very close to the ML estimator. Zhu (2010) compared between Yule-Walker, the conditional least squares estimator and the maximum likelihood estimator (MLE) in terms of the mean absolute deviation error (MADE) under the shape parameter of NB-INGARCH model is known. Jamaludin et al. (2020) used P-INGARCH and NB-INGARCH models to investigate the behavior of asthma disease in Johor Bahru. In keeping with their results NB-INGARCH with identity and log link function is adequate in representing the asthma data than P-INGARCH model. As far as we know that there are few researches applied for robust time series data: Xiong and Zhu (2019) conferred the robust quasi-likelihood estimation for the NB-INGARCH (1,1) model within the presence of transient shifts and additive outliers with an application to transaction counts data. Recently, Elsaied and Fried (2021) discussed robust M-estimation for NB-INARCH(p) models. In their study, they mixed the Mestimation approaches developed by Elsaied and Fried (2014) for the restricting Poisson INARCH model with Aeberhard et al. (2014) for the negative binomial regression model. The remainder of this study, through its sections, will compare between RP-INARCH (1) and RNB-INARCH (1) to fit the new daily confirmed cases for the first wave of COVID 19 in Egypt. Section 2 provides the theoretical basis for modelling and estimating these models. Section 3 uses some built-in functions in the R program to calculate the parameter estimates for our model. The simulation comparison between our functions is given in section 4. The use of COVID-19 count data is described in section 5. Section 6 contains some conclusions based on the analysis results in sections 4 and 5.

Modeling and Estimation
Modeling Let {Yt :t ℕ} denotes a count time series and Yt| Ft−1 with Ft−1 stands for the history of the process generated by Yt−1 up to time t-1. When the process allows equaldispersion, we assume the observations at each time point conditionally on the past to follow a Poisson (Yt| Ft−1 ∼Pois(µt)). But when the process allows overdispersion, we assume the observations at each time point conditionally on the past to follow the negative binomial Yt| Ft−1 ∼NB(r,pt). Here, r ℕ is a constant number of successes and pt is a probability of success that varies with time. Aeberhard et al. (2014) represented the density function for the linear NBmodel in terms of the inverse probability µt = (1 − pt)/pt with κ = r −1 ≥ 0 is used as a measure of the overdispersion parameter, as shown below: The conditional Poisson model can be obtained from (1) when κ converges to zero.
The INARCH(p) models of orders p can be defined in terms of the condition mean for the both models as follows: (2) Here, t ≥ 1, α0 >0 is an intercept and αi, i = 1,...,p, are positive regression parameters. A stationary process that satisfies (2) with mean exists if 1. The conditional mean and the conditional variance for the NB-INARCH(p) are given by: E (Yt|Ft−1) = µt and Var (Yt| Ft−1) = . For the P-INARCH(p), the conditional mean is equal to the conditional variance if κ  0.
For κ, we use the moment estimation defined in (Breslow, 1984;Lawless, 1987;Christou and Fokianos, 2014) in the following: (4) here p + 1 is that the dimension of θ. Following to Elsaied and Fried (2021), an Mestimation approach estimation to estimate θ given in (3) has the following form: with related conditional Pearson-residuals and σt 2 as given above. dt,i: i = 0,...,p denotes the bias correction terms. The bias corrections can be calculated in the same way as in Elsaied and Fried (2014) by: To estimate κ, we use the M-estimator of Aeberhard et al. (2014) as follows: with: For computation, we can fix κ and drive a new estimate of κ iteratively as follows: Initialize σ from Poisson and then retrieve κ in an iterative process until the numerical convergence is obtained. Here, we use the ψ function for Tukey, which can be abbreviated as: with IA is the indicator function of a real set A. The tuning constant C determines the robustness and the efficiency of the estimators. Choosing a higher value of C can improve efficiency, but will reduce the robustness to outliers.
The asymptotic properties of the consistency and normal convergence of the M-estimators for the RP-INARCH and RNB-INARCH models are discussed in Elsaied and Fried (2014) and Elsaied and Fried (2021), respectively. Elsaied (2021) proposed two functions implemented in the R program to compute the estimates of the P-

53
INARCH (1) parameters in a robust and non-robust way. We recall the following notation for using those functions: "CML-Pois" to compute the conditional maximum likelihood estimator and "Tukeycor-Pois" to compute Mestimation estimator with bias correction term. Similarly, we provide two new functions to calculate the parameter estimates of the NB-INARCH model (1) given in (3) and (5), by using "CML-NB" to denote the first function and"Tukeycor-NB" to call the second one. We follow the steps below to calculate the "CML-NB" algorithm in the R-program: 1. Define a function in data(y), the parameter vector θ = (α0,α1) to calculate the score equation St,θ(z) with κ = 0 defined in (3) to get the Poisson likelihood estimator of θ 2. Define the function "CML-NB" as a function on θ and data (y) in order to solve the estimation score equation St,θ(z) defined in (3) to obtain the NB quasi-likelihood estimator of θ. These estimates are obtained using the "constrOptim" R function with initial values obtained under Poisson case resulted from Step 1 3. In order to simplify the calculation of κ, assume that κ = 0.3 is a known value, as given by Xiong and Zhu (2019) While the steps to calculate "Tukeycor-NB" algorithm in R program are as follows: 1. Define a function in data (y), the parameter vector θ = (α0,α1) and in the tuning constant (C). Then solve the score equation St,θ(z) with κ = 0 given in (5) to obtain Mestimation for Poisson quasi-likelihood estimator of θ 2. Define the function "Tukeycor-NB" as a function of y and C. This function solves the estimation equation St,θ(z) defined in (5) to calculate the NB M-estimation of θ with bias correction term . As above again, we use the "constrOptim" function with initialization resulted from step 1 3. Here, also κ is assumed to be known as described above In order to determine which of the two models is the best, we suggest to use Akaike Information Criterion (AIC). The general AIC formula for a given modal using the CMLE of the parameter vector θ is as follows: (6) with LML is the log likelihood of the fitted model and N is the number of parameters in the model. Different modified robust versions of (6) are given in Ronchetti (1997) and Tharmaratnam and Claeskens (2013) among others. In order to simplify the calculation of AIC, we call the "robmixglm" function in the package robmixglm in Rprogram. The "robmixglm" is a function in: [Formula(yt ∼yt−1 with t ≥ 2),family = c("Poisson","nbinom",....)]. It returns AIC as a "robmixglm" object by using the following command: AIC (robmixglm(yt ∼yt−1)). We note that this package offers the possibility to calculate AIC criteria which are not available under "glmrob" function in the R package robustbase, for more details see Beath (2017).

For Clean Data
We start by comparing the efficiency for data without outliers and generate 500 time series with the length n = 1000 from P-INARCH (1) and NB-INARCH (1) models as a function in the tuning constant C. The efficiencies for "Tukeycor-Pois" and "Tukeycor-NB" are measured relatively for "CML-Pois" and "CML-NB", respectively. As shown in Fig. 1 (both top), the "Tukeycor-Pois" estimators achieve about 85% level of efficiency for α0 and about 90% level of efficiency for α1. While the "Tukeycor-NB estimator improves this level to reach about 95% with tuning constant C >9. Figure 1 (both below) shows the biases also as function of the tuning constant. There are slight differences in biases for all estimators with tuning constant C >7. These differences become lager when C <7 especially for α0.
According to Fig. 1 and since the suitable selection of the tuning constants depends in principle on the parameters of the model, we suggest to use C  {7,9,11} where for: In the next two subsections, we will compare the estimated values in cases of presence isolated or patch additive outliers and increase or fix their sizes using the previously suggested values of the tuning constant C to control the efficiency level.

Isolated Additive Outliers
In this section, we will compare the biases between our estimators to increase 4, 8,..., 40 isolated outliers of size 5 or 10 at random selected positions. We use 100 time series, each with a sample size of 500. For size 5 in Fig. 2, the loss of robustness of the "CML-Pois" or "CML-NB" is expressed as an increase in positive (negative) bias for α0 (α1), especially for the larger number of outliers. For α0, "Tukeycor-NB", with C = 7 has the smallest bias. Whereas for α1, "Tukeycor-Pois" with C = 7 has the smallest bias. Both followed by the biases with C = 9. But CMLEs give the largest biases for both is " " 9," " is=" " 11," " is<" " Tukeycor pois efficiency Tukeycor NB efficiency C Tukeycor pois efficiency Tukeycor NB efficiency Tukeycor pois efficiency Tukeycor NB efficiency α0 and α1. For size 10 in Fig. 3, all estimators show similar biases behavior however the CMLEs give the largest biases. Here, the differences in biases between "CML-Pois", "CMLNB" in one side and "Tukeycor-Pois", "Tukeycor-NB" on the other hand, it becomes larger than in case of size 5 for both for both α0 and α1.

Patch Additive Outliers
Here, we consider a scenario that has a consecutive patch outlier of the same size. This type may occur due to temporary interruptions or temporary level shifts. Figure  4 compares our estimated biases of 20 patch outliers of fixed size 5 at positions 101: 120. We use 100 time series for each with sample size = 500. All estimators show small biases differences. "Tukeycor-NB" with C = 7 and C = 9 give smaller biases than with C = 11 for both α0 and α1. Figure 5 compares the biases in case of 20 patch of additive outliers. The size increases from 1 to 20. For α0, all estimators show small biases differences, except for "CML-Pois" with a large number of outliers >10. But for α1, "Tukeycor-NB" with C = 7 and C = 9 give smaller biases than those for "Tukeycor-Pois". Followed by the biases with C=11. Whereas "CML-Pois" and "CML-NB" have the largest biases, especially for number of outliers >6.
Altogether, we conclude that "Tukeycor-Pois" and "Tukeycor-NB" with C = 7 and C = 9 give better results than those with C=11. Also, they are better than the CMLEs under the both scenarios of outliers: Isolated or patch additive. Bias for alpha_0, n = 500 Bias for alpha_1, n = 500

Using COVID-19 Data
This section compares the performance of our best function resulted above, which are: Tukeycor-Pois and Tukeycor-NB with tuning constants 7 and 9 through fitting RP-INARCH(1) and RNB-INARCH(1) for the variable of the number of new "COVID-19" confirmed cases in Egypt. Figure 6 shows 265 observations of each day new confirmed cases of "COVID 19" from March 7, 2020 to November 26, 2020. This period corresponds to the time until the end of the first wave of COVID 19 in Egypt as it is declared. The source of this data and its description are given in Elsaied (2021). Here, we follow the steps below to reach which one of the two models is the best choice: 1. Fit RP-INARCH (1) and RNB-INARCH (1) to the real confirmed COVID-19 cases using the functions "Tukeycor-Pois" and "Tukeycor-NB" with tuning constants 7 and 9 2. Use the modified version of the "interv-multiple" function in the R-package "tscount" to detect and remove the outlier. This function returns the data after each step of the correction process using the same setting parameter as given in Elsaied (2021) 3. Repeat Step 1 again, but after removing the outliers according to Step 2 4. Compare the values of AIC criteria for our functions (the less AIC is the better) depending on the function "robmixglm" as defined before.
Table 1 (right) shows the results of the final estimates by RP-INARCH(1) and RNB-INARCH(1), which depend on the data obtained from the last step of the correction procedure. Here is step 24, Fig. 6. By using of the "intervmultiple" function, we detect 10 essential outliers without repeating and without including the small level shifts at the beginning of the series with 7 of 10 (70% of the determined outliers) were level shift outliers. These outliers at point times: 45, 69, 117, 119, 133, 237 and 255. As excepted for clean data, "Tukeycor-Pois" and "Tukeycor-NB" with tuning constants 7 and 9 give similar results for and . Also, the differences in "RMSEs" are negligible. In Table 1 (left), however, all estimators give different values for both for and . "Tukeycor-NB" with C=7 and C=9 give smaller RMSEs" than for "Tukeycor-Pois". For both types of data, the "RMSEs" are computed by simulation using 100 data sets of sizes 265 from P-INARCH (1) and RP-INARCH (1) models with true parameter values computed by the CMLE.
For both data types, "Tukeycor-NB" has smaller AICs than that of "Tukeycor-Pois" and we find that the values of AIC do not affect the values of the tuning constant C. This is because that the function "robmixglm" does not depend on the tuning parameter values to be determined in advance, Beath (2017).
We note that for the dependence parameter α1, there's nearly no serial relation and α0 absorbs all outliers' effects. An explanation of this problem is that the presence of seven level shifts can be explained as the long sequence of outliers, that is, the high degree contamination.
According to our situation here, we deduce that: At tuning constants 9: Both estimators have the same efficiency and RNB-INARCH(1) is more robust. At tuning constants 7: RP-INARCH(1) is more efficiency but less robust than RNB-INARCH(1) and the choice of any one depends on the aim of the study or the desire properties of the estimator. Covid 19 data (real data) data after Step 2 outliers removal data after Step 10 outliers removal data after Step 24 outliers removal

Conclusion
We have compared the robust parameter estimates of INARCH (1) model with conditional Poisson or negative binomial distributions. Our proposal of negative binomial is more robust with the tuning constant = 7, but has a lower efficiency compared to Poisson. For the tuning constant = 9, both have similar efficiencies, but the negative binomial is more robust. Obviously, it is more difficult to choose a tuning constant in order to achieve high efficiency and high robustness for different outlier scenarios (isolated or patched). Therefore, when robustness is important, we prefer to use negative binomial. When efficiency is important, we prefer Poisson. Alternatively for future work, we suggest to use the concept of numerical quadrature, ordinary differential equation with block method, non-linear partial differential equation with B-Spline collocation and differential transformation technique instead of Poisson process.

Ethics
The author confirms that this article is original and contains unpublished material and the author has read and approved the manuscript and no ethical issues involved.