Statistical Size and Power of Eight Normality Tests in Presence of GARCH (1 1) Errors

Corresponding Author: Julio César Alonso Department of Economy, Universidad Icesi, Cali, Colombia. Street 18 Num. 122135 Cali-Colombiaa Email: jcalonso@icesi.edu.co Abstract: In this work, we assess the power and size of eight normality tests underthe assumption that errors follow a GARCH (1, 1) process by using MonteCarlo simulations. Four results stand out. First, the presence of a GARCH(1, 1) process increases the probability of making type I error. Second, Pearsonnormality test is recommended if it is assumed that errors follow a GARCH(1, 1) process. Third, statistical power varies depending on the type of heteroscedasticity and distribution considered. Fourth, normality tests have lowstatistical power and size (less than or equal to the nominal level) for smalland homoscedastic samples.


Introduction
Autoregressive conditional heteroscedasticity (ARCH) models that describe heteroscedastic behavior in time series errors were introduced by Engle (1982) more than 36 years ago. Four years later Bollerslev (1986) generalized the ARCH model by introducing the generalized autoregressive conditionally heteroscedastic (GARCH) models.
Nowadays, GARCH models are widely used and, in some contexts, have a better fit than an ARCH model (Enders, 2003). GARCH models have proven very useful for modeling financial time series behavior it solves Ordinary Least Squares (OLS) estimator's inefficiency caused by heteroscedastic errors. This result makes possible to use standard errors, t and F statistics to make inferences (Green, 2012).
It also provides a measure of volatility, on which financial decisions related to risk analysis and portfolio selection are based and can be useful in the analysis of changes in exchange and interest rates (Bollerslev et al., 1992).
GARCH models can be estimated by the method of Maximum Likelihood (ML), which assumes that the errors follow a normal distribution. This assumption is essential for some estimation methods, such as ML, but it is also necessary to make an inference from small samples and for constructing prediction intervals of any sample size.
If innovations of GARCH models are expected to follow a distribution different from the normal distribution, the literature suggests a Quasi-Maximum Likelihood (QML) method. However, Engle and Gonzalez-Rivera (1991) showed that estimators lose efficiency if the density function of the error term is not adequately specified.
Other authors have arrived at similar conclusions. Bellini and Bottolo (2008), through Monte Carlo simulations, found that ML and QML estimators underestimated or overestimated volatilities depending on the misspecification assumed. That variability can often generate a spurious "IGARCH effect" when estimating under a weak stationary constraint. Similarly, Klar et al. (2012) stated that QML estimators associated with an incorrect specification of the error term might imply a loss of efficiency of the estimators, which could imply a wrong assessment of Value-at-Risk (VaR) and an inaccurate forecast of priced options. Therefore, the normality assumption is crucial for a practitioner when estimating GARCH models. However, little is known about the statistical power and size of normality formal tests under the presence of errors that do not follow an independent and homoscedastic data generating processes.
As far as the authors are aware, Vavra (2011) and Fiorentini et al. (2004) are the only approaches that have studied the performance of normality test under the presence of errors that follow a GARCH process. The first article evaluated three tests of normality (Jarque-Bera (JB), the J test based on the generalized method of moments and a third based on quantiles) and their findings showed that the quantile test has a better performance than JB and that produces results consistent for all samples and distributions of the innovations studied. The second study found that the Jarque-Bera test (JB) can be safely applied to a broadclass of GARCH models -M; however, it did not examine the GARCH (1, 1) models.
Given the literature, we have several hypotheses about the effect that heteroscedastic errors are going to have on statistical power and size of the normality test: (i) statistical power will improve as the sample size grows, (ii) the behavior of the statistical power will vary among the distributions choose in the Monte Carlo study, those that are similar to a normal distribution (such t Student) will have better statistical power and (iii) in small sample Shapiro -Wilk will be the most powerful test.
The remainder of the paper is structured as follows. In the next section, we describe the method and the data generating process. In section three, we present our findings from the Monte Carlo simulations. The paper continues with a discussion of the results in section four. At last, the conclusions and contributions of this study.

Materials and Methods
Following Alonso and Montenegro (2015), we test the behavior of eight of the most popular normality tests in the literature under the presence of heteroscedastic errors that follow a GARCH (1, 1) process. The test we study(fromnow on we will refer to them by the names in brackets) are: (i) Shapiro-Wilk (SP), (ii) Jarque-Bera (JB), (iii) D 'Agostino-Pearson, (K),(iv) Pearson (PCHI), (v) Shapiro -Francia (SF), (vi) Anderson-Darling (WCM),(vii) Lilliefors (LKS) and (viii) Cramér-von Mises (CM). We present the statistic for each test in Appendix 1.
Following Bera & Ng (1993), it is possible to classify these eight normality tests into two categories: distance tests and goodness of fit tests. The distance tests are the CM, WCM and LKS. CM is an Empirical Distribution Function (EDF) test that compares the cumulative distribution function (CDF) of a normal distribution with the estimated distribution function from the sample data and evaluates how similar are they (Razali and Wah, 2011). WCM, also an EDF test, uses the Cramérvon Mises statistic weighted with its accumulative distribution function so that the tails of the estimated distribution have more weight than the CM test. The LKS is a modification of the Kolmogorov-Smirnov (KS) test, while its statistic is determined in the same way as KS's, the critical values are not the same; therefore, LF leads to different conclusions (Razali and Wah, 2011).
On the other hand, the considered normality tests that belong to the goodness of fit tests category are JB, K, PCHI, SP and SF. JB and K are moment tests since the detection of non normality distribution come from evaluating two sample moments: skewness and kurtosis.
The main difference between those two is the transformation made to the sample moments. Besides that, both tests compare their statistic with a critical value from a Chi-Square distribution with two degrees of freedom (Singh and Masuku, 2014). The PCHI test implies a statistic that is the sum of the ratio of the squared difference between the observed frequency of data of type i and its expected frequency (Mbah and Paothong, 2014), weighted by its expected frequency. SP evaluates if a random sample comes from a normal distribution. It sums the square of the ordered sample values weighted by a constant, generated from the means, variance and covariances of the corresponding order statistics of the sample and divides the results by the sum of the square of the deviations (Mbah and Paothong, 2014). The SF is a similar test to the SP, but it is designed for large samples.
In this Monte-Carlo experiment, we consider the effect on normality tests' power of: (i) the sample size, (ii) distribution and (iii) parameter values of a GARCH (1, 1) process. Especially, the experiment will consider: The Monte Carlo experiment implies the following steps: 1. Generate data for the vector y y×1 using the following data generating process: where, ‫ݐݔ‬ corresponds to a non-stochastic variable generated a priori (and only once) from a uniform distribution between zero and one. The random vector ε‫ݐ‬ is a not auto correlated error term but is heteroscedastic and follows a GARCH (1,1) process: ht is the conditional variance of the error and v t is a white noise process. ω, δ and θ are constants a . For all cases, ω is set to 0.000001. 2. Regress yt into xt and a constant by ordinary least squares method, which minimize the sum of the squares of the differences between yT ×1 and those predicted ( 1 T y × ⌢ ). In other words: 3. Obtain the error term: 4. Apply normality tests to estimated residuals ( 1T ε × ) and record if the null hypothesis is rejected (significance level of 0.05) or not b . 5. Repeat 10,000 times steps one throw four. 6. Calculate the observed size or power, depending on the case, as the proportions of rejections.

Standard Normal Distribution
When the error is homoscedastic, normality tests show a statistical size close to nominal (α = 0.05) for all sample sizes (Table 1). In small samples, CM test has the closest value to the nominal. For samples of size 1000, the best tests is WCM. An interesting result arises when considering a heteroscedastic error term. In general, the observed statistical size of the tests becomes distorted regardless of the values of δ and β. There are a few unexpected exceptions for small samples. For example, all tests continue to show a size close to the nominal (0.05) for an error term following a GARCH(1, 1) model with δ =0.1 and β = 0.8 and sample size 25 and 50. Moreover, for δ =0.4 and β = 0.4 only for sample size 25 the statistical size of all eight tests is relatively close to 0.05. In all other cases, the observed size is far from the theoretical (see Table 1). On the other hand, for the GARCH (1, 1) model with δ =0.1 and β = 0.8, JB has the lowest empirical size among the eight tests for samples of 25 and 50 observations; however, for large samples (500,1000 and 3000) JB presents the greatest distortion. For samples between 100 and 3000 observations, the PCHI presents the lowest statistical size compared to the other eight tests; despite that, as the number of observations growth statistical size also grows and becomes 0.103, twice the nominal. For the GARCH (1, 1) model with δ =0.4 and β = 0.4 we obtain similar results. One important difference for samples of size1000 from the previous case is that all tests exhibit on average a probability of96.6% (disregarding the PCHI that has the smallest size) of making the mistake of rejecting the null hypothesis when it is true. Finally, all tests, except CM, present the greatest distortions under errors from a GARCH (1, 1) model with δ =0.8 and β = 0.1, since for samples of 500 or more observations the probability of making type I error is of 100%. Power with Error Term from Student's t-Distribution Tables 2, 3 and 4 present results for Student's t distribution with three, five and 10 degrees of freedom, respectively. In general terms, the power of normality tests is about one when the sample size is 1000 or 3000; except the CM test that has the lowest power in those two samples sizes. However, the power decreases as we increase the number of degrees of freedom because the tdistribution approaches a normal distribution. This phenomenon intensifies in samples of size 25 and 50, but it is almost imperceptible in large samples. For example, for the homoscedastic residuals, the SF test has the greatest power in the distribution with three degrees of freedom, for five degrees its power reduces to 0.239 and for the distribution with 10 degrees of freedom is 0.124. For heteroscedastic residuals, the power of the same test is 0.387, 0.199 and finally 0.096, respectively. On the other hand, the SF test has the highest empirical power for all samples considered when the error has a constant variance and follows a distribution with three degrees of freedom. The same applies to Student's t-distribution with five degrees of freedom, excluding the sample of 25 observations. Instead, for a Student's t-distribution with 10 degrees of freedom, JB shows the best power for samples of 100 to 3000 observations. The SF is the most powerful test under the three types of heteroscedasticity and for the three Student's t-distributions considered. It is interesting to note in Table 3 and 4 that GARCH (1, 1) model with δ =0.4 and β = 0.4 and δ =0.8 have a positive effect on the test's power when compare with the power obtained from applying the tests to the homoscedastic case. However, the above does not hold for the CM test when it is applied to samples of 1000 and 3000 observations since the probability of rejecting a false hypothesis is significantly reduced. For example, it becomes 0% in the case of a distribution with 10 degrees of freedom and a sample of 3000 observations. Moreover, β = 0.1 have a positive effect on the test's power when compare with the power obtained from applying the tests to the homoscedastic case. Power with Error Term from a Laplace Distribution Results for the Laplace distribution are similar to those of the Student's t distribution. Tests show a statistical power close or equal to one for the homoscedastic and heteroscedastic residuals in samples of 500, 1000 and 3000 observations (see Table 5). For small sample sizes, all tests have relatively low power. That improves when errors come from GARCH(1,1) models with δ =0.4 and β = 0.4 and δ =0.8 and β = 0.1. Moreover, when δ =0.1 and β = 0.8 empirical power for all tests is worse. SF has the biggest power for samples of size 25, 50 and 100 in both heteroscedastic and homoscedastic error. For those same cases, PCHI has the lowest power.

Power with Error Term from a Uniform Distribution
Results for this distribution are similar to those found with the Student's t and Laplace distribution (see Table 6). All tests have power equal or close to one in large samples (500, 1000 and 3000 observations). The CM is the only test that shows a statistical power of 0% for a sample of 3000 when the error has a constant variance over time or comes from GARCH(1,1) models with δ = 0.1 and β = 0.8. For those two cases, the K test shows the best power in samples of 50 and 100 observations. For δ = β = 0.4 the WCM presents the greatest power in samples of size 25, 500, 1000 and 3000; however, only for the last two sample sizes, the statistical power is above 0.9. Finally, for from δ = 0.8 and β = 0.1, SF is the test with the best power in samples of 50 and 100, as in the Laplace distribution and the power is greater than 0.9 in large samples. Furthermore, when δ=0.1 and β=0. 8 normality test's statistical power increases slightly in samples of 25 observations, but it decreases in samples of 50 and 100 observations and for large samples, there is no distortion.
When δ=β=0.4, statistical powered creases for all sample sizes (except 3000) in comparison with homoscedastic errors. Finally, when δ =0.8 and β = 0.1 the statistical power improves in relation to the case when δ = β = 0.4. However, the power is still less than the one obtained when δ =0.1 and β = 0.8 and the homoscedastic case.

Discussion
Concerning the hypothesis presented in the introduction, results show that almost all of what we have stated occur with the data. First, the statistical power improves as the sample size grows. For samples between 500 and 3000 observations, any test can be used because they exhibit good power; while in small samples is preferable to use Shapiro-Francia because it has the highest power when compared to the other tests.
Second, the behavior of the statistical power varied among the distributions chosein the Monte Carlo study. The power of the tests increases or decrease depending on the type of heteroscedasticity and distribution considered. This effect is greater in medium-size samples (50, 100 and 500 observations). Contrary of what we thought, the statistical power of the normality test under a Student's t distribution was not the least affected by the types of heteroscedasticity considered; it was under the Laplace distribution that normality tests were not substantially affected.
Besides that, the power of all test was low for small samples. This result has also been found in other studies where the error term does not meet all the assumptions. For example, Alonso and Montenegro (2015) evaluated normality tests in the presence of errors that follow an AR (1) process. Results showed that the effect of autocorrelation on the power of the tests is asymmetrical, the statistical is distorted inthe presence of strong autocorrelation and all tests have a similar power, which tends to be low for small samples.
Third, not always the Shapiro -Wilk test was the most powerful test in small samples. Instead, Shapiro -Francia has a better power in a small sample. Similar results have been found in simulations about the performance of the normality test under other conditions. Razali and Wah (2011) studied the power of four normality tests (Shapiro and Wilk (1965), Kolmogorov (1933), Lilliefors (1967) and Anderson and arling (1952)) for symmetric and asymmetric distributions and 15 different sample sizes. They concluded that Shapiro-Wilk is the most powerful test, followed by Anderson-Darling, Lilliefors and finally Kolmogorov-Smirnov. The two last tests required a sample size close to 2000 observations to obtain a power likethe Shapiro-Wilk test. Mbah and Paothong (2014) compared the performance of the Shapiro-Francia test with other eight normality test by studying the distribution of their pvalues. They found that Shapiro-Francia is the best test for detecting deviations from normality from the eight tests analyzed. The Monte Carlo simulation set up consisted of 12 sample sizes and eight distributions for the error term (not correlated and homoscedastic), some of them were the standard normal distribution, uniform, Laplace and Student-t distribution with different degrees of freedom. Future research will include other normality tests, such as QH*, which was proposed by Chen and Shapiro (1995). This test had had a better performance than other normality tests based on regression under a diverse combination of symmetric, asymmetric, contaminated and balanced distributions and samples size of 20, 50 and 100 observations (Seier, 2002). Also, it will involve developing and proposing those statistics for each normality test, under each variation of GARCH (1,1) model and distribution analyzed, to capture the characteristics the sample must have to have a theoretical size of 0.05 while maximizing the power of the test.

Conclusion
This paper has presented a Monte Carlo study that describes how the power and size of eight normality test behave under three variations of GARCH (1, 1) models for different sample sizes and distributions. Three important results are obtained from these simulations. First, the probability of making type I error, especially in samples of size 500, 1000 and 3000, increased under the presence of heteroscedastic error terms. Our results imply that Pearson's test (1900) is a suitable choice for samples of size 25, 50 and 100. For larger samples, our results suggest being cautious and complement the validation of normality assumption with other tools since using any of the eight tests studied could lead to wrong conclusions.
Second, in the homoscedastic case, we should apply the Cramér-von Mises. In the presence of GARCH (1, 1) the Pearson (1900) test should be use. Third, the normality tests have low statistical power and size (less than or equal to the nominal employee (α = 0.05)), in small and homoscedastic samples.
Fourth, the recommendation of our experimental work to the scientific community is to use other tools besides the normality tests for making an inference from small samples and for constructing prediction intervals of any sample size. Nonparametric approaches should be considered.
Future research should focus on designing appropriate normality tests when the error term follows a GARCH (1,1). An idea that is worth evaluating is the statistics proposed by Jarque and Bera (1980). Jarque and Bera (1980), unlike Jarque and Bera (1987), proposed a test for normality accounting for heteroscedasticity and serially correlation. This statistic has not been adapted for the problems addressed in this paper. For simplicity, practitioners use Jarque and Bera (1987) test that is a simplified version of the Jarque and Bera (1980) that does not account for heteroscedasticity and autocorrelation. The Jarque and Bera (1980) test needs that the researcher specifies the form of the covariance matrix of the error term and is not implemented in commercial software. This may be a good starting point to design a better test of normality under GARCH (1,1) behavior.

Funding Information
No financial support was received, neither technical assistance.

Author's Contributions
Authors contributed to the same extent to all the process of preparing and developing the manuscript since we operate as a group.

Ethics
This article is original and contains unpublished material. The corresponding author confirms that all of the other authors have read and approved the manuscript and there are no ethical issues involved.
where, Z 2 (·) is the squared standard normal distribution. For the following normality tests, lets define the next vectors and matrices: • X: vector of dimensions (1×݊ )containing the order statistic of the sample (ܺܺ(݅ )) • ߪ 2 ‫:܄‬ The covariance matrix of the vector of all (ܺ (݅) ) • c: vector of expected values of the n order statistics from a normal standard distribution • ‫:܉‬ vector of dimensions (1×݊ ) such that: 1 2 1/ 2 # ( )