An Alpha-Exhaustive Multiple Testing Procedure

Corresponding Author: Mark Chang Boston University, Boston MA, USA and Veristat, Southborough MA, USA Email: mark.chang@veristat.com Abstract: A multiple testing procedure can be a single-step procedure such as Bonferroni’s method or a stepwise procedure such as Hochberg’s stepup method and Hommel’s method. It can be an α-exhaustive or αconservative approach. We develop a single α-exhaustive procedure that can improve power 2-5% over Hochberg’s and Hommel’s methods in common situations when the test statistics are mutually independent. The method can also be generalized to dependent test statistics. The idea behind our method is to construct the rejection rules using the product of marginal p-values and by controlling the upper bounds of the kth order terms so that α is controlled for any configuration of k null hypotheses. Such upper bounds or critical values are determined progressively from k = 1 towards k = K, the number of null hypotheses in the problem.


Introduction
Multiple testing problems are common in pharmaceutical statistics and life-sciences in general. The main goal of Multiple Testing Procedures (MTP) is to (strongly) control the family wise type-I error rate. A MTP can be a single-step data-independent procedure such as Bonferroni's method or a data-dependent stepwise procedure such as Hochberg's stepup method and Hommel's stepup method. It can be anα-exhaustive or α-conservative approach. Conceptually, stepwise procedures are usually more powerful than single-step procedures and α-exhaustive procedures are usually more powerful than α-conservative approach. However, consider these two aspects together, comparisons of testing procedures are not that simple, often depending on the configuration of the alternative "hypotheses" or more precisely, the truths. For example, the power of a fallback procedure is dependent on the weight and "effective sizes" in the alternative hypotheses. A fixed sequence testing procedure is a special case of the fallback procedure and its power is heavily dependent on the order of the test sequence of the hypothesis.
We develop a simple single α-exhaustive procedure that can improve power 2-5% over Hochberg's and Hommel's methods in common situations when the test statistics are independent. The method can also be generalized to dependent test statistics. The idea behind our method is to construct the rejection rules using the product of marginal p-values and by controlling the upper bounds of the kth order terms so that α is exhausted for any configuration of k null hypotheses. Such upper bounds or critical values are determined progressively from k = 1 towards k = K, the number of null hypotheses in the problem. Unlike common stepwise test procedures, in which every step in the decision rule will only involve one critical value for decision-making, the proposed α-exhaustive approach is a single-step method with multiple critical values involved in the decision rules.
The paper is organized as follows. In section 2, we will review several important stepwise test procedures that will be used in our power comparisons. In section 3, we elaborate our progressive α-exhaustive procedure for two-hypothesis testing, outline the idea, derive the formulations for critical values and provide examples of using this procedure in comparison with other methods. We also provide the power formulation for the α-exhaustive procedure for two-hypothesis testing. In section 4, we provide power comparisons among several different methods using simulation. In section 5, we extend the α-exhaustive procedure to threehypothesis testing and compare with Hommel's procedure in power under broad conditions. In section 6, we further describe the α-exhaustive procedure for general K-hypothesis testing and simulation algorithms for determining the critical values. In the last section, discussion and summary are provided. We place mathematical derivation in the Appendix. To make the procedure ready for practical use, we have included the SAS code in the Appendix.

Multiple Testing Procedures
Stepwise procedures are different from single-step procedures, in the sense that a stepwise procedure must follow a specific order to test each hypothesis. In general, stepwise procedures are more powerful than single-step procedures. There are three categories of stepwise procedures that are dependent on how the stepwise tests proceed: Stepup, stepdown and fallback procedure. The commonly used stepwise procedures include the Bonferroni-Holm stepdown method (Holm, 1979), the Holm stepdown method (Dmitrienko et al., 2009, p.65), Hommel's stepup procedure (Hommel, 1988), Hochberg's stepup method (Hochberg, 1988), the fallback procedure (Wiens, 2003) and the sequential test with fixed sequences (Westfall et al., 1999).

Stepdown Procedure
A stepdown procedure starts with the most significant p-value and ends with the least significant p-value. In the procedure, the p-values are arranged in an ascending order, i.e., from the smallest to the largest: with the corresponding hypotheses: The test proceeds from H (1) to H (K) . If p (k) >C k α (k = 1,..., K), retain all H (i) (i≥k); otherwise, reject H (k) and continue to test H (k+1) . The constants C k are different for different procedures.
The adjusted p-values are: Therefore an alternative test procedure is to compare the adjusted p-values against the unadjusted α. After adjusting p-values, one can test the hypotheses in any order.
Fallback Procedure (Wiens, 2003) The Holm procedure is based on a data-driven order of testing, while the fixed-sequence procedure is based on a prefixed order of testing. A compromise between them is the so-called fallback procedure. The fallback procedure was introduced by Wiens (2003) and was further studied by Dmitrienko et al. (2006;Hommel and Bretz, 2008). The test procedure can be outlined as follows.
In the fallback procedure, we allocate the overall error rate α among the hypotheses according to their weights w k , where w k ≥ 0 and 1 k k w = ∑ . For fixed sequence test, w 1 = 1 and w 2 = ... = w K = 0: • Step 1: Test H 1 at α 1 = αw 1 . If p 1 ≤α 1 , reject this hypothesis; otherwise retain it. Go to the next step is rejected and at α k = αw k if H k−1 is retained. If p k ≤α k , reject H k ; otherwise retainit. Go to the next step The formula for the adjusted p-value is complicated to be written explicitly.

Stepup Procedure
A stepup procedure starts with the least significant pvalue and ends with the most significant p-value. The procedure proceeds from H (K) to H (1) . If, P (k) ≤C k α (k = 1, ..., K), reject all H (i) (i≤k); otherwise, retain H (k) and continue to test H (k−1) .
The adjusted p-values are: Progressive α-Exhaustive Testing Procedure An α-exhaustive procedure is a closed testing procedure based on intersection hypothesis tests the size of which is exactly α. In other words, Pr (Reject H I ) = α for any intersection hypothesis H I , I⊆{1, ...,K}. Put in a simple way, in an α-exhaustive procedure, the supremum of the probability of false rejection in any null hypothesis configuration is equal to α.
Many stepwise test procedures have been developed, which are not necessarily α-exhaustive. Therefore, there is a room for improvement. However, an α-exhaustive procedure is not necessarily a powerful test. A fixed sequence testis an α-exhaustive test, but it is often the least powerful test if the sequence of tests was inappropriately chosen.
Let's discuss the situation of two-hypothesis testing: Here k H is the negation of H k , k = 1, 2. In this setting, if either H 1 or H 2 is rejected, the null hypothesis H o is rejected. Let p 1 and p 2 be the marginalp-values for testing H 1 and H 2 , respectively.
The idea behind this procedure is to borrow strength among marginal p-values. In plain language, the procedure says that we don't have to make an α adjustment, as long as p 1 ≤α and the other p-value p 2 is small. For example, if p 1 = α and p 2 = 0.01α, we can reject H 1 . The α 1 and α 2 are so determined that when both H 1 and H 2 are true, the type-I error will not exceed α.
The procedure can control the Familywise Error Rate (FWER) strongly but at the same time exhaust all α under all the null hypothesis configurations: H 1 , H 2 and H 1 ∩H 2 . This is done progressively as described below.
Step 1: To control the familywise error rate at α level, when only H 1 is true and H 2 is not true, then p 1 p 2 ≤α 1 can be satisfied with probability of 1 (if, for example, the test drug is very effective, p 2 will be virtually always 0). Therefore, to control FWER, a necessary condition is sup Pr(p 1 p 2 ≤α 1 ∩p 1 ≤α|H 1 ∩ 2 H ) = sup Pr (p 1 ≤α|H 1 ) = α. Therefore, type-I error is strongly controlled and exhausted when 1 2 H H ∩ is true. By the same token, we can reach the same conclusion when 2 1 H H ∩ is true.
Step 2: Now we need to determine α 1 and α 2 to exhaust α when H 1 ∩H 2 is true. In this study, we only consider the case when the two test statistics, p 1 and p 2 , are independent.
Under the global null hypothesis H 0 , p 1 and p 2 are iid U (0, 1), which is equivalent to two standard normal test statistics: z 1 = 1-Φ(p 1 ) and z 2 =1-Φ(p 2 ) under H 0 being independent. However, working on the p-scale, the testing procedure can be used for different endpoints (normal, binary, survival), as long as p 1 and p 2 are independent and stochastically equal to or larger than uniform p 1 and p 2 . Thus: Consequently, the type-I error rate under H 1 ∩H 2 is given by (for simplicity, we just use FWER (H 1 ∩H 2 )): In (9), we have used the following result: if α 2 ≤min(α 1 , α 2 ), then: However, if α 2 > min(a 1 , α 2 ), then the probability becomes ( Fig. 2): To summarize the type-I error rates under various null configurations, we have: where, α min = min (α 1 , α 2 ): The α 1 and α 2 are determined so that FWER(H 1 ∩H 2 ) = α. We are not interested in α 1 ≥α, because p 1 p 2 ≤α 1 in the rejection criteria has no effect. In fact, FWER(H 1 ∩H 2 ) = 2α-α 2 = αwill have no solution for anyα between 0 and 1. We are not interested in α 1 <α 2 either, because it makes the conditions, p 1 <α and p 2 <α, have no effect in determining the rejection boundary. In fact: has no solution for α 1 <α 2 and α 2 <α 2 . Therefore, the only scenario that we are interested in is: With (14), we can determine the rejection boundaries α 1 ,α 2 for given α. Here are the steps: That is, α 2 is the solution of: Examples of critical values from (15) are presented in Table 1 and 2.
The rejection boundaries in Table 1-3 have been verified each by 10,000,000 simulations.
Using the test procedures described in the previous section, we summarize the rejection status in Table 4. The α-exhaustive procedure can reject at least one hypothesis except for scenario 5. The reason that α-Ex method (with α 1 = α 2 ) cannot reject any hypothesis for scenario (5) is that the method emphasizes the consistency of the evidence against all the hypotheses and such consistency is obviously not presented in scenario (5).

Power Comparison of Two Hypotheses
There seems a general impression that whatever the test procedure to use the power of rejection cannot be improved significantly for two-hypothesis testing. However, this is not necessarily true. We have compared power of seven different testing methods described in section 2 and presented results in Table 5, where Power 1 is probability of simultaneously rejecting H 1 :δ 1 ≤ 0 and H 2 : δ 2 ≤ 0 and Power is the probability of rejecting either H 1 or H 2 . For the fallback procedure the weights w 1 = w 2 = 0.5 are used. The fixed sequence method is equivalent to the fallback procedure with w 1 = 1 and w 2 = 0.
The progressive α-exhaustive procedure performs overall the best, while the Hommel method performs the second best. In general, Holm procedure is uniformly more powerful than the Bonferroni procedure. Hochberg's procedure is uniformly more powerful than Holm's procedure and Hommel's procedure is uniformly more powerful than Hochberg's procedure. Holm, fixed-sequence and fallback are nonparametric and control FWER for any joint distribution of test statistics. Hommel and Hochberg procedures are semi parametric and control FWER only for some joint distributions, including positively dependent test statistics such as multivariate normal test statistics. Nonparametric procedures make no assumptions about the joint distribution of test statistics which results in power loss (Dmitrienko, 2013). For two-hypothesis testing, Hochberg's method is equivalent to Hommel's method. The power of the fallback method depends on the weights i w and the order of the hypotheses.
The critical values can also be determined through simulations. Especially when dimension is high, simulation is a convenient way to obtain the results: For given (α 1 , α 2 ,α 3 ), we can use simulation by trying different α 4 until FWER(H 1 ∩H 2 ∩H 3 ) = α. We have verified the critical values through simulations: For α = 0.025 and α 1 = α 2 = α 3 = 0.004855, α 4 = 0.002677; the type-I error rate is 0.025003 under H 1 ∩H 2 ∩H 3 through 10,000,000 simulations. This progressive method to determine the critical values can be generalized to K-hypothesis testing.

Power Comparison
Let H i : δ i ≤ 0, i = 1, 2, 3. Using the rejection boundaries in Table 7, we can easily obtain the power of the α-Ex method through simulations. To compare the performance of α-Ex, we compared the best method, Hommel's method as the standard.
Since there are three hypotheses, it is meaningful to compare our approach to other more recently developed approaches. However, the gate keeping procedure is difficult to communicate with the nonstatisticians and requires large set of tests when the number of individual hypotheses increases. An iterative graphical approach by  deals with those weakness and constructs the Bonferroni-type tests with a simple updating algorithm that fully describes a sequentially rejective test procedure. The graphic approach was then extended by dissociating the underlying weighting strategy and applied using weighted Bonferroni tests, weighted parametric tests and weighted Simes tests (Bretz et al., 2011). The existing methods controls the FWER; However, the power is addressed or compared with other methods. From Table 8, we can see that the α-Ex procedure provides more power in all cases except the case when δ 1 = δ 2 = 0 and δ = 0.3.
Again, like in the case of two-hypothesis testing, when the parameters in the alternative hypotheses (e.g., effects of the different endpoints) are very different, we should use different α 1 , α 2 and α 3 such that their trend is in opposite to the trend of parameters in the alternative hypotheses.

K-Hypothesis Testing Procedure
We now discuss the progressive α-exhaustive procedure for K-hypothesis testing. To avoid the rejection boundary being too small, causing inconvenience, we use term in the decision rules for a general K-hypothesis testing. It is obvious that these two test statistics are equivalent in terms of power.
For K-hypothesis testing, the K rejection rules are specified as: We will reject H i if and only if: We didn't include higher order term of p-product because through simulations, we find that even we set p i p j p k p m ≤ 1, the FWER is controlled. This means that for a multiple testing problem with more than four hypothese, the proposed procedure may not be an alpha-exhaustive one.
The rejection boundaries α 1 , α 2 ,...α K are progressively determined: Determine α 1 = α based on one-hypothesis testing, then given α 1 , determine α 2 based on two-hypothesis testing; and given α 1 and α 2 , determine α 3 based on three-hypothesis testing. The process continues until α K is determined. For high dimensional hypothesis testing problems, Partition Principle for multiple testing (Hsu, 1996) can be used to reduce the number of null configurations to be tested. Simulation is usually more convenient than numerical integration when the dimension is high.

Summary and Discussion
To construct a MTP, we need to consider at least three things to ensure the power: (1) α-exhaustive, (2) synergize strengths among data for local hypothesis or marginal p-values and (3) be able to use correlations between local test statistics or local p-values. In principle, the proposed α-exhaustive procedure has considered all three aspects. To achieve α-exhaustive, we use the marginal p-value product corresponding to each null hypothesis configuration and enforce it with an upper bound in the rejection rules. Such p-value product terms in the rejection rules also ensure the synergy between the marginal p-values. The K-hypothesis testing algorithm can be applied to the test statistics with correlations with modifications of critical regions for rejection (the critical values in Table 1-3 are applicable for independent test statistics), but due to its complexity and larger applications in clinical trials (dose-finding, subgroup analysis, adaptive design), we are developing separate manuscripts to address that.
Unlike traditional stepwise procedures, the decision rule in the progressive α-exhaustive procedure explicitly uses a set of statistics (p 1 ,p 1 p 2 ,p 1 p 3 ,p 1 p 2 p 3 ,etc.) with a set of critical values in the decision rule for rejecting a single H k (k = 1, 2,..., K). In this sense, the decision rules in the α-exhaustive procedure are expressed in the form of "adjusted p-values" and hence the order of the tests is irrelevant. Many stepwise testing procedures also have the feature of borrowing strengths among data for local hypotheses, but such dependencies are realized through a discrete function. Theα-exhaustive procedure uses a continuous dependency function of marginal pvalues, i.e., product of p-values, which is more effective. We have also tried other functions such as average p-values or linear combination of normalinverse p-values, the results are similar.
In summary, the proposed progressive α-exhaustive procedure is not only statistically powerful, but it also stresses the importance of clinical/practical meaningfulness since the method emphasizes the consistency among the evidences coming from different endpoints, different doses and different populations, that is, the totality of the evidence. The test procedure is simple and performs well in broad situations. When the true "standardized" effect size (value of the parameter) is very different for different hypothesis, the critical values don't have to the same for rejecting all the hypotheses. Instead, the critical values can be different and optimized based on the prior information on effect size or considering the importance of different endpoints. From (A 1 ), we can see that among α 1 ,α 2 and α 3 , at least two should be the same, eitherα 1 = α 2 orα 2 =α 3 (α 1 ≤α 2 ).

Appendices
The type-I error rate under H 1 ∩H 2 ∩H 3 can be expressed as: 2 2 p dp dp dp dp dp dp p p since dp dp p a a For π 123 , we just give the result for the case when α 1 = α 2 = α 3 . Assume ≥ , otherwise only p 1 p 2 p 3 ≤α 4 has an effect in the joint probabilities, while p 1 p 2 ≤α 1 and other will have no effect. In fact, the α 1 is so small that the three curves p i p j =α 1 cut the cube into a smaller cube α 1 ×α 1 ×α 1 , as shown in Fig.  5. Therefore, we have: 1 3 123 1 1 1 dp dp π α Ω = = ∫ (A8) To Summarize, we have: