A Monte Carlo Study of Seven Homogeneity of Variance Tests

Problem statement: The decision by SPSS (now PASW) to use the unmodif ied Levene test to test homogeneity of variance was questioned. It was compared to six other tests. In total, seven homogeneity of variance tests used in Analysis Of V ariance (ANOVA) were compared on robustness and power using Monte Carlo studies. The homogeneit y of variance tests were (1) Levene, (2) modified Levene, (3) Z-variance, (4) Overall-Woodward Modifi ed Z-variance, (5) O’Brien, (6) Samiuddin Cube Root and (7) F-Max. Approach: Each test was subjected to Monte Carlo analysis th rough different shaped distributions: (1) normal, (2) platykurtic, (3) leptokurtic, (4) moderate skewed and (5) highly skewed. The Levene Test is the one used in all of t he latest versions of SPSS. Results: The results from these studies showed that the Levene Test is neithe r the best nor worst in terms of robustness and power. However, the modified Levene Test showed ver y good robustness when compared to the other tests but lower power than other tests. The Samiud din test is at its best in terms of robustness and power when the distribution is normal. The results of this study showed the strengths and weaknesses of the seven tests. Conclusion/Recommendations: No single test outperformed the others in terms of robustness and power. The authors recommend that ku rtosis and skewness indices be presented in statistical computer program packages such as SPSS to guide the data analyst in choosing which test would provide the highest robustness and power.


INTRODUCTION
A very popular statistical package Statistical Package for the Social Sciences (SPSS now called PASW) uses the Levene Test to test for homogeneity of variance prior to conducting tests of the equality of means in the t-test and One-way ANOVA (Oladejo and Adetunde, 2009;Zeng et al., 2010;Mazahreh et al., 2009). A question arose as to whether the designers at SPSS chose the "best" test for homogeneity of variance, since there are many others available. A subsequent literature search produced some research on the Levene test (Gastwirth et al., 2009;Carroll and Schneider, 1985;Tomarken and Serlin, 1986). A further search found other tests of homogeneity of variance in studies by Overall and Woodward (1974;; O'Brien (1981) and Levy (1975) that may have been a better choice than the Levene test. Carroll and Schneider (1985) described the Brown and Forsyth comparison of the original Levene Test to two modifications of it. The original Levene Test used sample means. The modified Levene tests used the median and the trimmed mean. They demonstrated through Monte Carlo studies that the median and the trimmed means outperformed the original test when the homogeneity of variance assumption was violated.
The modified Z-variance test is presented by Overall and Woodward (1976). Overall and Woodward (1976) had compared the robustness and power of this modification against four other homogeneity of variance tests: (1) Z-variance unmodified, (2) Wilson-Hilferty (3) Bartlett and (4) Box. Using a series of Monte Carlo studies, Overall and Woodward (1976) demonstrated the superiority of the modified Z-variance test over the other four tests. Unfortunately, the Overall-Woodward modification of the Z-variance test is not well known. This modification appears only in a technical report that may no longer be available or easily accessible from the original source. However, a copy of this report can be obtained from the corresponding author of this article.
The O'Brien Test is mentioned by Howell (2001) but little research could be found on it. From O'Brien (1981) this test appears promising. With all of these more complicated formulas developed to attack the problem concerning homogeneity of variance, there is a simple one that will also be used in this study. The Fmax test developed by Hartley in 1950 (Pardo et al., 1997) is very simple involving no more than computing the ratio of the greatest subgroup variance and the smallest subgroup variance.
In this study, seven homogeneity of variance tests will be compared using a Monte Carlo approach. The seven tests are (1) the original Levene Test, (2) The modified Levene Test using the median, (3) the Zvariance Test, (4) Modified Overall-Woodward Zvariance test, (5) O'Brien Test (6) Samiuddin Cube-root Test and (7) the Fmax test. A major goal of the study is to evaluate just how good the original Levene test is when compared to these other alternatives. "Goodness" of each test is determined by examining the robustness and power for each test. Should SPSS and other statistical packages consider using other tests along with the Levene?
The Levene test: In 1960, Levene proposed an alternative method to the Bartlett Test (Klotz and Johnson, 1993) for testing the assumption of homogeneity of variance for independent sample t-test and ANOVA designs. The Bartlett test works well for data that are normally or approximate normally distributed. The Bartlett test does not fare well for data that follow a leptokurtic or skewed distribution (Overall and Woodward, 1974). According to Levene (Gastwirth et al., 2009), the test he proposed was less sensitive to departures from normality. This says that the Levene Test had fewer Type 1 errors than the Bartlett Test for distributions that were aberrant from normality.
The Levene Test is defined as the following:  The modified Levene test: The modified Levene test is nearly identical to the original Levene test. The difference is that the median is used instead of the mean in computing Z ij . That is ij is the median of the ith subgroup. This is the modification studied earlier by Brown and Forsyth is referenced in Carroll and Schneider (1985).

The Z-variance test:
The large sample normal deviate transformation of chi-square proposed by Fisher (1995) formed the basis of the Z-variance test. The formula Fisher (1995) It is well known that sample variances tend to have a chisquare distribution (Overall and Woodward, 1974). A normal deviate transformation is used to obtain Z-score equivalents of the sample variance. Sample variance is related to the chi-square by use of the following formula: These are then used in an F-test to determine if they are different. This F-test is presented in a number of elementary statistic textbooks (Comrey, 2009;Mendenhall and Beaver, 1991). Overall and Woodward (1974) found this test to perform very well for data that are normally distributed. Their Monte Carlo studies discovered that this test produced too many Type 1 errors when samples are drawn from leptokurtic or skewed distributions.
The null and alternative hypothesis for the Zvariance test is the same as the one for the Levene Test. The test statistic as written by Overall and Woodward (1974) is: The Z i 's are assumed to be approximately unit normal with zero mean.
is the upper critical value of the F-distribution with k-1 and ∞ degrees of freedom at a significance level of α.

The Overall-Woodward modified Z-variance test:
To counter the distortions of the original Z-variance test, Overall and Woodward (1976) conducted a series of studies to determine a c value so that variances of the Z i would remain stable when the sample data deviate moderately from normality. Using regression, Overall and Woodward (1976) found a c value based on sample size, skewness and kurtosis. They determined c to be a scaling coefficient that affects the variability of the Z i values.
The new formula for c is: The index of kurtosis used by Overall and Woodward (1976) is the 4th power of the Z-scores within each sample (subgroup) divided by n i -2 degrees of freedom: O'Brien test: As a fifth comparison for this study, the O'Brien Test (O'Brien, 1978;1981) was used. O'Brien (1981) has claimed that his test is a general method that does fairly well for behavioral science data. O'Brien (1981) states that the test is robust to data that departs from normality. It is also easy to program into statistical packages like SPSS, it is competitive with other tests in terms of power and it can be easily used in different ANOVA designs with equal or unequal sample sizes. O'Brien (1981) stated that not much research has been done on this statistic. The computational operations for this test are straightforward. Every raw score, Y ij in the study is transformed using the following formula: The mean of the V-values per subgroup will be equal to the variance computed for each subgroup, i.e., The test statistic for the O'Brien Test will be the F-value computed on applying the usual ANOVA procedure on the transformed scores V ij .
Samiuddin cube-root test: Samiuddin and Atiqullah (1976) developed a homogeneity of variance test which he refers to as the Bayesian test of homogeneity. Samiuddin and Atiqullah (1976) show that the "cuberoot" test is superior over some other tests such as the Bartlett test when the sample distributions are not homogeneous. However, Levy (1978) has shown Samiuddin and Atiqullah's findings to be flawed or misleading. This study will re-examine the Samiuddin Test in terms of robustness and power and in comparison to some other homogeneity of variance tests not tested by Samiuddin and Atiqullah (1976) and Levy (1978).

Test comparisons:
To compare the Levene, modified Levene, O'Brien, Fmax, Z-variance and modified Zvariance tests a series of Monte Carlo studies were performed (Agunbiade and Iyaniwura, 2010;Alabi et al., 2008;Rana et al., 2008). Each statistic is evaluated in terms of robustness and power. For robustness, the fewer Type 1 errors a test makes (falsely claiming unequal variances, when in fact the variances are equal), the greater the robustness. (Abu-Shawiesh, 2008;Vrbanek and Wang, 2007). With power, the higher the number of correctly detected unequal variances, when in fact they are unequal, the greater the power of the test.

Monte Carlo study of robustness:
In this section, the first analyses were done to determine how well each test performs when there is no bias. That is, the samples are drawn from a normal distribution with equal variances. For all tests in this study, there are four groups arranged as a fixed effects completely randomized design ANOVA. With four groups, two different sample sizes were used: n = 10 and n = 30. This was the same arrangement used by Overall and Woodward (1976).
A computer program was written to carry out the analyses. In selecting a random number generator, the one described by Overall and Rhoades (1981) was used. This algorithm produced random numbers that follow a normal distribution. Three thousand simulated fourgroup experiments were analyzed for each sample size. The Levene, modified Levene, O'Brien, Fmax, Samiuddin, original Z-variance and modified Zvariance test statistics were computed for each simulated experiment. The computer program counted the number of times the null hypothesis was rejected at the α = 0.05 level for 3000 experiments. The probability associated with each test statistic was computed using a subprogram developed by Jaspen (1965) and Veldman (1967). The results of these tests are given in the first column of Table 1.
The analyses were repeated for the same sample sizes with non-normal distributions that still had homogeneous variances. Following the descriptions provided by Overall and Woodward (1976) simulated experimental data were created for leptokurtic, platykurtic, chi-square (df = 6) and chi-square (df = 5) distributions. The chi-square distributions were used to approximate skewed distributions where the chi-square distribution with 6 degrees of freedom is less skewed than the one with 5 degrees of freedom. Three thousand simulated experiments were analyzed by the seven methods for each of the non-normal samples for the two sample sizes. The frequency that the null hypothesis was rejected at the α = 0.05 level for each method (Levene, modified Levene, Fmax, Z-variance, Modified Z-variance, Samiuddin and O'Brien) for each distribution-type (normal, leptokurtic, platykurtic, moderately skewed, highly skewed) for each sample size (n = 10, n = 30) is given in Table 1.
Monte Carlo study of power: The next major consideration is the power of each test. Will the test accurately detect real differences between the subgroups with heterogeneous variances? Overall and Woodward (1976) found the modified Z-variance test to outperform the original Z-variance test as well as several others when the true underlying distribution is normal or platykurtic. Overall and Woodward (1976) did not make any comparison between the different homogeneity of variance tests for the other types of distribution because of the large number of Type 1 errors found during the robustness phase. However, such high levels of Type 1 errors were not found for the Levene Test or the modified Levene Test in the current study. Hence in the study reported here, the two Levene test are compared to the two Z-variance tests, the O'Brien test, the Samiuddin Cube Root test and the Fmax test across the five different distributions. Overall and Woodward (1976) found the modified Z-variance test to be slightly less powerful than the original Zvariance test.
A series of 3000 simulated 4-group experiments were created where the group means were equal, but the sample variances were different. Using the same setup as found in Overall and Woodward (1976), the group variances followed the ratio of 1:2:3:4.

Monte Carlo study of robustness:
The results from these Monte Carlo studies demonstrate that the modified Levene performed the best in producing overall the fewest type 1 errors across all distributions. In every distribution and sample size studied, the modified Levene had values below 0.05. The next best in an overall sense is the O'Brien test. Except for small samples case (n = 10) combined with a skewed distribution, the O'Brien test also had more values below 0.05 than the other remaining methods when both samples sizes are taken into consideration. The Overall-Woodward Modified Z-variance test was the next best and matched the O'Brien test very well for the larger sample size. The unmodified Levene Test could be rated fourth with the Fmax test and the Samiuddin test tied for fifth in the comparison. The original Zvariance test fared the worst in the comparison. Overall and Woodward (1976) had previously demonstrated that the modified Z-variance test was superior to the original Z-variance test in terms of robustness. The tests of interest here are how the two versions of the Zvariance tests and the O'Brien, Samiuddin and Fmax tests fared against the highly popular Levene Test. On every comparison, the modified Z-variance test outperformed the Levene Test. In cases involving leptokurtic and skewed distributions, the Levene Test did better than the original Z-variance test. So, when there are no differences between the sample variances (null hypothesis is true), the modified Z-variance test did better than the original Z-variance and Levene tests. The O'Brien test did better than the modified Z-variance test. In almost all tests the modified Z-variance test and the O'Brien test did better for larger samples than for smaller samples. The very simple Fmax test did as well or better than the original Z-variance test. As stated by Carroll and Schneider (1985), the modified Levene test using the median outperformed the original Levene test. The Samiuddin Cube Root test was at its best when the distribution was normal. It also performed well for distributions that were platykurtic. For other distributions it was not as good as the original Levene test. The "best" values in Table 1 are in bold print.
Monte Carlo study of power: For normal and platykurtic distributions with the larger sample, n = 30, the modified Z-variance test correctly rejected the null hypothesis more often than the Levene Test. The original Z-variance test was either the best or second best when the underlying distribution was normal, leptokurtic and skewed for both n = 10 and n = 30 sample sizes.  The Samiuddin test did better than the original Zvariance test for both sample sizes when the distribution was normal. The Samiuddin and Fmax tests were either as good as or slightly better than the original Z-variance test when the distribution was leptokurtic. The O'Brien test did its best when the distribution was normal or platykurtic and the sample size was 30. In general, the O'Brien test was less powerful than the other tests.
In terms of power, it appears that the original Zvariance test fared the best for n = 10 distributions of varying kurtosis and skew. However, the Levene test consistently showed more power than the Overall and Woodward modified Z-variance test. However, it should be noted that nearly all of the n = 10 analyses showed minimal power, as expected with this small sample size. Notable exceptions to this were the original Z-variance test and Fmax test for leptokurtic distributions. The original Levene Test outperformed the modified Levene test on both sample sizes and for all five types of distributions. The data shows the modified Levene Test to be the worst amongst the 7 tests in terms of robustness except when the distribution was leptokurtic or platykurtic for n = 10.
For large samples, the original Z-variance test, Samiuddin test and the Fmax test exhibited the greatest power except when the distribution was platykurtic. For the platykurtic distribution for n = 30, the modified Zvariance and the O'Brien tests were the best. The "best" values are printed in bold in Table 2.

DISCUSSION
When considering robustness, the modified Zvariance test appears to be superior over the Levene and original Z-variance tests. The O'Brien test did better than the modified Z-variance test. However when looking at power, the original Z-variance test was better than the other tests for four of the distributions. Although the Levene test was not a standout in terms of robustness, power, sample size and distributional shape, it did not have the peaks and valleys as demonstrated by the two Z-variance tests. On robustness, the Levene test never attained the α = 0.05 mark on any of the tests. It outperformed the modified Z-variance test in terms of power only for the small sample (n = 10) case for the platykurtic distribution. The O'Brien test appears to be the weakest in power. The Fmax test resembled the original Z-variance test. The simple Fmax test results were surprisingly good. This simple test seemed to do quite well in terms of robustness and power. The modified Levene Test outperformed the original Levene Test in terms of robustness. However, the reverse was true when considering power. For a normal distribution and large samples, the Samiuddin test was found to be the best of the 7 tests. The Samiuddin test did fairly well with both sample sizes except for the platykurtic distribution in terms of robustness and power.
The utility of the modified Z-variance test and the Samiuddin Cube-Root test is high, especially in those cases when the researcher with a priori evidence feels that the data approximates a normal distribution with a wider spread. Overall and Woodward (1976) recommended that a better c value be found through empirical means. The authors of this study agree that the search for a highly robust and powerful test still needs to be found.

CONCLUSION
In light of these findings, perhaps computer programs should not be limited to only one statistical test for homogeneity of variance. It would be highly useful for the researcher to be able have several of these tests available in a computer output. The user of statistical packages such as SPSS (aka PASW) should check descriptive statistics concerning the kurtosis and skews of the data. If the index for kurtosis is negative then the distribution is platykurtic. If it is zero or near zero it is mesokurtic (possibly normal) and if the kurtosis index is positive, the distribution is leptokurtic. Likewise, the computed index of skewness can also be evaluated. A zero or near zero value indicates a symmetric distribution. If the index value is negative or positive then the distribution is skewed. By using these simple guidelines along with the robustness and power values, the data analyst can decide on which test of homogeneity of variance should be used and interpreted. For those computer programs that do not provide a measure of skewness and/or kurtosis, the following formulas can be used: (X -X) n Kurtosis = -3 S ∑ The modified Levene test was evaluated in this study and was found to be superior in terms of robustness and power to the original Levene test. In any case, researchers need to be aware that when homogeneity of variance tests fails to reject the null hypothesis of equal variance, the probability of a Type 2 error may be high depending on the test used. With SPSS (now PASW), the Levene test appears to have mediocre power that gets better with larger sample sizes.
In conclusion, the original Levene test (used by SPSS) is not the best choice. There are better tests that can be used and some are more preferable depending on the distributional shape of the data. The mere substitution of the median for the mean in the original Levene Test computations will alleviate some of the problems. It appears that a combination of skewness and kurtosis should give better information to the data analyst to select an appropriate homogeneity of variance test. None of the tests presented here were difficult to compute. Hence the computer overhead in including some or all of these tests is small. Relying upon defaults set in some statistical analytic software may lead to an increase in Type 1 errors associated with homogeneity of variance testing and/or loss of power that can be avoided or minimized.