Adaptive Superiority and Noninferiority Trial Design with Paired Binary Data

Corresponding Author: Jing Wang Boston University, Boston, MA, USA Email: jwa222@bu.edu Abstract: Non-inferiority of a diagnostic test to the standard is a common issue in medical research. For instance, we may be interested in determining if a new diagnostic test is noninferior to the standard reference test because the new test might be inexpensive to the extent that some small inferior margin in sensitivity or specificity may be acceptable. Noninferiority trials are also found to be useful in clinical trials, such as image studies, where the data are collected in pairs. Conventional noninferiority trials for paired binary data are designed with a fixed sample size and no interim analysis is allowed. Adaptive design which allows for interim modifications of the trial becomes very popular in recent years and are widely used in clinical trials because of its efficiency. However, to our knowledge there is no adaptive design method available for noninferiority trial with paired binary data. In this study, we developed an adaptive design method for non-inferiority trials with paired binary data, which can also be used for superiority trials when the noninferiority margin is set to zero. We included a trial example and provided the SAS program for the design simulations.


Noninferiority Design
As the European regulatory agency, Committee for Medicinal Products for Human Use (CHMP, 2005) stated, "Many clinical trials comparing a test product with an active comparator are designed as noninferiority trials. The term 'noninferiority' is now well established, but if taken literally could be misleading. The objective of a noninferiority trial is sometimes stated as being to demonstrate that the test product is not inferior to the comparator. However, only a superiority trial can demonstrate this. In fact a noninferiority trial aims to demonstrate that the test product is not worse than the comparator by more than a pre-specified, small amount. This amount is known as the noninferiority margin, or delta." Until recent years, the majority of clinical trials were designed for superiority to a comparative drug (the control group). A statistic shows that only 23% of all NDAs from 1998 to 2002 were innovative drugs and the rest were accounted for as "me-too" drugs (Chang, 2010). The "me-too" drugs are judged based on noninferiority criteria. The increasing popularity of noninferiority trials is a reflection of regulatory and industry adjustments in response to increasing challenges in drug development.
From a methodological perspective, Chan (2001) derived power and sample size formulations for noninferiority trials using an exact method. Kong et al. (2004), studied noninferiority diagnostic test for using a bivariate normal distribution. Wiens and Heyes (2003) proposed analysus strategy that allows to consider interactions in noninferiority trials. Liu et al. (2002) investigated two asymptotic test statistics, a Wald-type test statistic (sample-based) and a Restricted Maximum Likelihood Estimation (RMLE-based) test statistic, to assess non-inferiority based on paired binary endpoints. They found that the RMLE-based test controls type I error better than the sample-based test. Lu and Bean (1995) and Nam (1997) proposed test statistics and sample size determination for comparing two diagnostic methods for the non-inferiority test of sensitivity. Lu et al. (2003) discussed simultaneous comparisons of sensitivity and specificity. However, all these methods are only applicable for classical design with fixed sample size. We will develop in this study an adaptive design method for noninferiority trials with paired binary endpoint and discuss its application in diagnosis test.
There are three major sources of uncertainty about the conclusions from a non-inferiority (NI) study: (1) The uncertainty of the active-control effect over a placebo, which is estimated from historical data, (2) the possibility that the control effect may change over time, violating the "constancy assumption" and (3) the risk of making a wrong decision from the test of the noninferiority hypothesis in the NI study, i.e., the type-I error. These three uncertainties have to be considered in developing a noninferiority design method.

Commonly Used Noninferiority Design Methods
Most commonly used noninferiority trials are based on parallel, two-group designs. Three-group designs with a placebo may sometimes be used, but they are not very cost-effective and often face ethical challenges when including a placebo group, especially in the United States.
There are three commonly used methods of noninferiority designs: The fixed-margin method, the λ-portion method and the synthesis method (in original and log scales). We denote the test and the active-control groups by subscripts T and C, respectively. Where there is no confusion, the letter T will also be used for test statistics. We will use the hat "^" to represent an estimate of the corresponding parameter, e.g., θ is an estimate of θ.

Fixed-Margin Method
The null hypothesis for the fixed-margin method can be defined as: where, θ can be the mean, hazard rate, adverse event rate, recurrent events rate, or the mean number of events.
The constant noninferiority margin δ NI ≤0 (assuming a larger value of the parameter is desirable; otherwise, δ NI should be larger than zero) is usually determined based on a historical placebo control study (see more discussions later). When δ NI = 0, (1) becomes a null hypothesis test for superiority. The rejection of (1) can be expressed a simple way: The test drug T is not inferior to C by δ NI or more.
The rejection of (2) can be interpreted in layman's terms: Drug T is at least 100λ NI % as effective as drug C.

Synthesis Method
The null hypothesis for the synthesis method is given by: Assuming we have proved θ C -θ P >0, (3) is then equivalent to: where, 0<λ NI <1. For the superiority test, λ NI = 1.
The rejection of (3) can summed up in these terms: The test drug T is at least 100λ NI % as effective as C after subtracting the placebo effect. When λ NI = 0, (3) represents a null hypothesis for a putative placebocontrol trial.
The rejection rule is specified as follows (assuming a larger θ is preferred): Equivalently, we can use the confidence interval of ε: The power of the test statistic T under a particular H a can be expressed as: where, ε and σ are estimated by (7).
Solving (12) for the sample size, we obtain: Equation 13 is a general sample size formulation for a trial with a normal, binary, or survival endpoint (Chang, 2007a).
For the test statistic given by (9), the p-value is given by: where, Φ is the standard normal cdf.

Remark
A common misconception is that for an NI trial the sample size calculation must assume θ T = θ C or p 01 = p 10 , which is not true at all. One can choose an NI design because the difference θ T -θ C is positive but too small for a superiority test with reasonable power or unreasonably large sample size. The treatment difference can be positive or negative depending on the particular situation. The power and sample size calculation should be based on the best knowledge about the value of θ T -θ C and this knowledge should not change because of the different choice of hypothesis test. Therefore, for a given value of θ T -θ C and power, superiority testing always requires a larger sample size than noninferiority testing.

Adaptive Design
We now discuss how to incorporate Nam's formulation (Nam, 1997) into group sequential and adaptive designs. Let T k be a test statistic on p-value scale at the kth stage. The stopping rules are given by: where, α k <β k (k = 1, ..., K-1) and α K = β K . For convenience, α k and β k are called the efficacy and futility boundaries, respectively. The adoptions can be changes in the timing and the number of interim analyses, sample-size re-estimation, etc.
To reach the kth stage, a trial has to pass the 1st to (k-1) th stages.
Therefore the c.d.f. of T k is given by: where, In a classic group sequential or adaptive design, the test statistic on p-value scale can be expressed as (Chang, 2007b): where the weights w ki satisfy For the error-spending approach, 1 , 1.., , 1,..., , where n i is the subsample size (not cumulative sample size) at stage i.
Using the error-spending approach, the changes in the timing (information time) of the interim analyses and the total number of analyses can be changed after the initiation of the trial as long as the change is independent of treatment difference. For sample size re-estimation, we use fixed weights w ki , i.e., the weight will not change even when the sample size is modified. Two other commonly used test statistics for adaptive designs are the product of stagewise p-values and the linear combination of stagewise p-values: And: The two-stage design stopping boundaries for (15) can be calculated using numerical integration or simulation, whereas the stopping boundaries for (18) and (19) can be analytically obtained for two-stage designs. Specifically, for the test statistic defined by (18), after choosing the efficacy stopping boundary α 1 and futility stopping (β 1 = 1), the efficacy stopping boundary for the 2nd stage is given by (Chang, 2007b): Similarly, for the test statistic defined by (19), the stopping boundary is given by: For the error-spending approach numerical integrations gives the OF-like boundary, Pocock-like boundary and power-function boundary (with ρ = 0.2) as follows: α 1 = 0.00260 and α 2 = 0.0240 (OF), α 1 = 0.0147 and α 2 = 0.0147 (Pocock) and α 1 = 0.00625 and α 2 = 0.02173 (PF). These stopping boundaries will be used later in our trial example.
If we use Nam's test statistic defined by (6)-(8) for the subsumple at the ith stage, we then can calculate the "stagewise" p-value for the ith stage based on (14), that is: where, ˆ, i i n ε and ˆi σ are the corresponding quantities in (14) but calculated based on a subsample at the ith stage. (22) is valid as long as n i is large.

Conditional Power and Sample-Size Reestimation
The general expression of conditional power at the interim analysis for a two stage adaptive design can be written as (Chang, 2007b): Where: If the trial continues, i.e., α 1 <p 1 ≤β 1 , for a given conditional power cP, we can solve (23) for the adjusted sample-size for the second stage:

Type-I Error Control
We have used an approximation of the normal distribution for z given by (6) and (22) for the classic and adaptive designs, respectively. We want to check how well such approximations work in terms of type-I error control. Various scenarios have been checked with 1,000,000 simulation runs for each scenario. The scenarios with larger type-I errors are presented in Table  2 (sample size = 3000 pairs). For a classic design, we use 3000 pairs. For an adaptive design with sample size reestimation, we use 1500 pairs for the interim analysis and the maximum sample size allowed is N max = 6000. We can see from the table that type-I error is well controlled when the proportion p 10 ≥2%. When p 10 <2%, there is a slight inflation of the error.
When we run the same set of simulations with a smaller sample size of 300 pairs and N max = 600 pairs, the type-I error is far below 2.5% for all cases. For p 10 ≥2%, smaller sample sizes give smaller error but the difference is small; for p 10 <2%, the error is much smaller than 2.5% with 300 pairs. Therefore, we can say the method can be applied to NI adaptive designs.

Preliminary Data for Trial Design
The adaptive design considerations will be oriented toward comparisons of the diagnostic performance of two scanning methods, separately for sensitivity (using data from positive patients) and specificity (using data from negative patients).
The two methods (Method 1 is a good standard) for the detection of metastatic disease in a group of subjects with known prostate cancer use standardized clinical end-points of documented disease including clinical outcome, serial PSA levels, contrast enhanced CT scans and radionuclide bone scans. A small study was conducted on a group of matched patients. The sensitivities are 63 and 84% for method 1 and method 2, respectively. The specificity is 80% for both methods.
The patients per CT/bone scan data are presented in Table 3.

The Effectiveness Requirements
The requirements for gaining the regulatory approval are defined as follows: • Superiority on sensitivity with 10% margin (point estimate) and NI on specificity with 7.5% margin (CI); the hypothesis testing is based on the results from 2 out of 3 image readers • Statistical methods: McNemar's test with and without cluster adjustment. However, since we don't have data about the cluster, our sample size calculation will be based on testing without considering clustering The effectiveness claim will be based primarily on subject level results, that is, a diagnosis of whether or not the patient has any evidence of metastatic prostate cancer, disregarding the number of sites of disease. The analyses of lesions will provide additional information on the ability of the diagnostic tests to determine localization and staging of the disease. For this reason, the sample size will be based on analysis results on the subject level. It is required that Method 2 has at least a 10% improvement (based on a point estimate) over Method 1 in sensitivity and is noninferior to 1 in specificity with a margin of 7.5%.

Design for Sensitivity
For the sensitivity requirement, we use group sequential design to handle the uncertain information with high power 95%. The simulation is done by setting the noninferiority margin to zero in the SAS program in the appendix, which was also verified using the commercial software package ExpDesign Studio 5.0. Table 2. Type-I error rate control (%) against α = 2.5%
For group sequential designs (GSD), three different error-spending functions are considered: (1) The O'Brien-Fleming-like error-spending function (OF), (2) the power-function with ρ = 2 (PF) and (3)  Given the data in Table 3 and a 95% power, we design the group sequential trial with one interim analysis at 50% information time. The simulation results are presented in Table 4. To choose an "optimal" design, we perform the following comparisons: • Comparing the results from the OF and the PF designs, we can see that the latter requires a smaller expected sample size ( a N ), a 7.5% reduction (73 versus 67.5 pairs) because the PF design has a larger Early Efficacy stopping Probability (EEP = 0.429) than the OF design (EEP = 0.263). The maximum sample size is almost the same for the two designs.

Design for Specificity
For specificity, due to large uncertainty in the information (rates in Table 3), our design starts with a lower power 85%, then uses sample-size re-estimation at interim with 50% information time and the targeted conditional power 90%.
Like the GSD for sensitivity, we start with a classical design for specificity. Given the data in Table 3, i.e., p 10 = 0.1 and p 01 = 0.1, the calculation indicates that 322 pairs are required for an 85% power at a level of significance 2.5% (one-sided) based on Nam's test (1997) and the sample size calculation method presented earlier.
We use the same three error-spending functions for the adaptive trial for specificity: (1) OF, (2) PF with ρ = 2 and (3) the Pocock. All designs have two stages and the interim analysis will be performed at 50% information time with a sample size of 161 pairs. The sample size adjustment is based on a targeted conditional power of 90% and the maximum sample size N max is 500 pairs. In all designs we use the futility boundary α 1 = 0.5 which means approximately that if at interim analysis we observe 10 01ˆ0 NI p p δ − − ≤ , we will stop the trial for futility. The simulation results are presented in Table 5, where EEP and a N are the early efficacy stopping probability and expected sample size, respectively, when H a (p 10 = p 01 = 0.1) is true. Note: β 1 = 0.5, the proportions of shifting: p 10 = 0.2, p 01 = 0.03 The simulation results are summarized in Table 5. Following the same steps for comparing different adaptive designs in sensitivity, we find the PF design is better than the OF design. To evaluate the PF design against the Pocock design, we need to perform the simulations under H o : p 10 -p 01 -δ NI = 0 (p 10 = 0.1, p 01 = 0.175 and δ NI = 0.075). Under this null hypothesis, the OF, PF and Pocock designs have almost the same expected sample size ( o N ) 335 with futility stopping probability 47%. This is because they use the same futility boundary and same sample size at the interim analysis, while the efficacy stopping boundary has virtually no effect on sample size.
We also studied the effect of SSR. We assume there is a small difference in proportions but within the noninferiority margin: p 10 = 0.1, p 01 = 0.11. We want to know if the power is reasonably preserved in this case.
The simulation results ( Table 6) show that that GSD cannot well preserve power in this case. The effect of sample size adjustment on power is higher for the OB and FP designs than the Pocock designs because the OB and PF designs spend more alpha on stage 2. The Pocock design has already spent 50% alpha before the interim analysis; therefore, the sample-size adjustment at stage 2 has less effect on the power. Compared with the OB design with SSR, the PF design with SSR has a smaller expected sample size S N (374 versus 386). We noticed that the expected sample size under H o is high even when the null hypothesis is true. Therefore, we ran simulations with an aggressive futility boundary β 1 = 0.25 (less than original 0.5). The sample size under H o reduces from 335 to 265. However, the reduction is at the cost of power: The power is reduced from 84 to 79% when p 10 = 0.1, p 01 = 0.11. Therefore we still recommend using β 1 = 0.5, which means that if at interim analysis the observed difference is at the non-inferiority margin, we will stop the trial for futility.
Through these comparisons, we can conclude that the PF design with SSR is most preferable for the specificity design.

Summary of Design
For sensitivity, totally 86 positive patients with one interim analysis will provide 95% power for the superiority test. The error-spending function for the stopping boundary is αt 2 , where t is information time or sample-size fraction and the futility stopping rule is p 1 >β 1 = 0.5. The design features a 43% early efficacy stopping probability if the alternative hypothesis is true, a 45% early futility stopping probability if the null hypothesis is true. The expected sample size is 68 and 67 under Ha and Ho, respectively, an 18% savings in comparison to 82 pairs for the classic design.
For specificity, we use the two-stage design, featuring sample size re-estimation at interim analysis with 161 pairs. The sample size re-estimation will be based on a 90% conditional power with a cap of 500 pairs. The two-stage adaptive design has 94% power for the non-inferiority test with an NI margin of 7.5%. The error-spending function for the stopping boundary is αt 2 , where t is information time and the futility stopping rule is p 1 >β 1 = 0.5.
The design features a 33% early efficacy stopping probability when the alternative hypothesis is true, a 47% early futility stopping probability if the null hypothesis is true. The expected sample size is 336 and 335 under H a and H o , respectively, a 23% savings as compared to the classical design (N = 438) with the same 94% power.
Given a 95% power for the sensitivity test and a 94% power for the specificity test, which are assumed to be independent, the overall probability of getting an effectiveness claim for the diagnosis test (Method 2) is about 90%.
The stopping rules for sensitivity and specificity are the same but sample size re-estimation is allowed for the design for specificity: If the interim p-value for the sensitivity (specificity) test is p 1 ≤0.00625, the null hypothesis for sensitivity (specificity) will be rejected. If the p-value for sensitivity (specificity) test is p 1 >0.5, stop recruiting positive (negative) patients. If 0.5≥p 1 >0.00625, we continue to recruit positive (negative) patients and the sample size will be reestimated for negative patients based on a 90% conditional power. At the final analysis, if the p-value for the sensitivity (specificity) is p 1 ≤0.02173, then the null hypothesis for sensitivity (specificity) will be rejected. In the end, if both null hypothesis tests for sensitivity and specificity are rejected, then the new diagnosis test (Method 2) will be claimed effective.