A Permutation Test for Comparing Two Correlated Receiver Operating Characteristic Curves

Corresponding Author: Okeh Uchechukwu Marius Department of Industrial Mathematics and Applied Statistics, Ebonyi State University Abakaliki Nigeria, Nigeria Email: uzomaokey@ymail.com Abstract: The area under the Receiver Operating Characteristic (ROC) curve (AUC) is a summary measure when comparing two ROC curves. However, this summary measure is less informative when two ROC curves cross and have the same AUCs. In order to detect differences between ROC curves and to be able to tackle the problem of exchangeability of the labels between two diagnostic tests within subject, an alternative permutation test based on betweensubject permutations of the labels of the subjects within each diagnostic test is proposed for assessing a change in the AUCs in a continuous matched pair of data from two diagnostic test procedures having both non-diseased and diseased subject in each of the test. The Wilcoxon signed rank test statistic was modified as a permutation test under the null hypothesis of equality of AUCs. An algorithm for carrying out complete enumeration of all the distinct permutations of the paired test results was developed which provides exact p-values. Using simulated data, the proposed test compares in statistical power to the modified sign test proposed by Braun and Alonzo but the proposed test has better operating characteristics, that is greater statistical power to detect a crossing alternative and is less conservative in test size and in the range of parameters of at least 0.8 of AUCs on the average with a correlation of at least 0.4 and small to moderately large sample sizes. Similarly in applying real life data, the proposed test has the more likelihood of rejecting null hypothesis of equality of AUC1 and AUC2 at nominal level of 0.05 with the proposed test having a p-value of 0.0312 against the Braun and Alonzo’s test with a p-value of 0.0387. This is because the proposed test is modified to adjust for the presence of zero differences in values and considers the signs of values as well as the absolute ranks of values. Also the estimates of AUC1 and AUC2 for the two diagnostic tests are 0.668 and 0.887 respectively showing that AUC2, that is 2hour 100g Oral Glucose Tolerance Test (OGTT) is superior to AUC1 (2hour 70g OGTT) at a time that the specificity is greater than 0.7.


Introduction
Nonparametric inference for a difference in Areas Under the Curve (AUCs) for paired studies was first proposed by DeLong et al. (1988). They developed a conventional fully nonparametric approach to compare two correlated AUCs of two diagnostic tests for paired samples of subjects by using the asymptotic theory of generalized U statistics (Hoeffding, 1948) and used the jackknife to estimate the covariance of the 2 U-statistics all lead to an asymptotically normal test statistic. The test by DeLong et al. is limited by the fact that the AUC has an unbiased non-parametric estimator called the indicator variable that requires the comparison of all the number of subjects responding positive and negative. Other nonparametric inference procedures include those based upon an analysis of variance of jackknife pseudo-values (Dorfman et al., 1992;Song, 1997) and bootstrap-based methods (Campbell, 1994;Moise, 1988). However, these methods are valid for large sample size, so that computational time could be long and their test of difference in AUC is not valid in small samples. A competing nonparametric approach that is valid for small sample size is permutation test. Permutation based procedures are specific to hypothesis testing. A permutation procedure constructs a permutation sample space, which consists of the equally likely permutation samples. The permutation samples are created by interchanging the units of the data that are assumed to be "exchangeable" under the null hypothesis. The permutation sample space is the exact probability space of the possible arrangements of the data under the null hypothesis given the original sample. This natural permutation test is characterized by exchanging the paired units when two diagnostic test procedures are to be compared with paired data. Three permutation tests for paired Receiver Operating Characteristic (ROC) studies currently exist: One proposed by Venkatraman and Begg (1996), one from Bandos et al. (2005) and the other from Braun and Alonzo (2008). However, Venkatraman and Begg (1996) and Bandos et al. (2005) proposed a permutation tests concerning correlated Receiver Operating Characteristic (ROC). The test of Bandos et al. (2005) directly tests for an equality of AUCs, the test of Venkatraman and Begg (1996) is more general and tests for equality of the underlying ROC curves, while the test by Braun and Alonzo (2008) compares the ROC curves but is designed to have increased power of detecting a difference in AUCs. As a result, the test of Venkatraman and Begg is less powerful for testing equality of AUCs but more general in testing for the equality of the overall ROC curves. While the test by Venkatraman and Begg is specifically designed to detect any differences between two ROC curves at every operating point, the test of Braun and Alonzo is designed to detect differences in AUCs. By comparison, the statistical power of the permutation test by Bandos et al. (2005) is more than the nonparametric approach employed by DeLong et al. (1988) in terms of when the AUCs are large, small sample sizes and moderate correlation between diagnostic test procedures. Meanwhile, the estimator proposed by DeLong et al. (1988) possesses an upward bias which on the one hand results in an improved (compared to the unbiased estimator) type I error of the statistical test for equality of the AUCs when AUCs are small, but on the other hand results in loss of statistical power when AUCs are large (Bandos, 2005;Bandos et al. 2005). Bandos et al. (2005) compared the performance of their test to that of DeLong et al. (1988) via simulation and found that the permutation test had greater power than the nonparametric test developed by DeLong et al. (1988) when there was moderate correlation between diagnostic tests, large AUCs and small sample sizes. The permutation tests by Bandos et al. as well as Venkatraman and Begg requires exchangeability of the two diagnostic test procedures within the nondiseased and diseased labels of subjects. These permutation tests require that both diagnostic tests are exchangeable within subject and require an appropriate transformation, such as ranks, because the measurements of test results are on different scales. This means that both of these tests assume the same condition of exchangeability of the diagnostic results under the null hypothesis, but differ with respect to their sensitivity to specific alternatives and the availability of an asymptotic version.
We propose an alternative permutation test that does not require data transformation due to the presence of zero differences or tied absolute values of differences which makes test results to be taken on at most the ordinal scale and exchangeability of two diagnostic test procedures rather requires between subjects permutation of the non-diseased and diseased labels of subjects within a given diagnostic test procedure. This permutation test is based on the works of Braun and Alonzo (2008) that in their work used sign test as their permutation test. While sign test considers the direction of units measured, our test considers both the direction and magnitudes of the units of interest. In an effort to assess a difference in AUCs of the two diagnostic tests, an algorithm for computing the exact permutations of the test statistic will be implemented. In the next section, we propose our permutation test and show that it is equal to modified Wilcoxon signed rank test statistic. In section 3, we shall also present an algorithm for computing the exact distribution of the permutation test. In section 4, we describe the simulation and real life data, apply the proposed test on the data and present the results. In section 5, we discuss the result of the simulation in terms of the operating characteristics of the proposed test, compare the test size and power of our test and a competing test as well as compare the power of the two tests using real life data. In section 6, we make our summary and conclusions.

Proposed Permutation Test
The proposed method discussed here is a permutation test designed to compare the AUCs of two diagnostic test procedures given as AUC1 and AUC2 having a total number of n subjects and where subject labels are exchangeable within each diagnostic test under null hypothesis. Since an issue in a permutation test is to choose a test statistic that discriminates between the null and alternative hypothesis and given the fact that a popular choice is a test statistic developed in asymptotic theory, we therefore modify for use, Wilcoxon signed rank test statistic.
The procedure is such that a total number of N nondiseased subjects and M diseased subjects each received both diagnostic tests. Let the test results of diagnostic tests 1 and 2 for the non-diseased subject be Xi1 and Xi2 where i = 1,…,N. Also let the test results of diagnostic tests 1 and 2 for the diseased subject be Yj1 and Yj2 where j = 1,…,M. Also let X = {(X11, X12), (X21, X22),…, (XN1, XN2)} denotes pairs of vector of measurement on non-diseased subjects and let Y = {(Y11, Y12), (Y21, Y22),…, (YM1, YM2)} be the pairs of vector of measurement on diseased subjects. Therefore the difference in AUCs given as AUC = AUC2-AUC1 is estimated non-parametrically as: Consider according to Hanley and McNeil (1982), that this indicator function is: In other to test the null hypothesis H0: AUC2-AUC1 = 0, we combine N and M subjects to have a total of n subjects and let S1 = {S11, S12,…, S1N, S1,N+1, S1,N+2…., S1n} be n measurements arising from diagnostic test 1 while the subscripts p = 1,2,..,N shows test results for the non-diseased subjects while q = N +1, N +2,….,n shows test results for the diseased subjects. Based on this arrangement within diagnostic test 1, we compare every subject's test result to every other subject's test result. Thus: This implies that every diseased subject is compared to all non-diseased subjects and all (M-1) other diseased subjects. Similarly, every non-diseased subject is compared to all diseased subjects and all (N-1) other non-diseased subjects. Also let S2 = {S21, S22,…, S2N, S2,N+1, S2,N+2…., S2n} be n measurements arising from diagnostic test 2 while the subscripts p = 1,2,..,N shows test results for the non-diseased subjects while q = N +1, N +2,….,n shows test results for the diseased subjects. Similarly within diagnostic test, 2, we compare every subject's test result to every other subject's test result, that is: Given the above definitions, therefore Rpq = 1-Rpqm; m = 1,2.
To test the null hypothesis that AUC = 0, which is similar to testing the null hypothesis that the difference between paired samples is a distribution that is symmetric around zero, we adopt the transformation in Equation 2 whose indicator function is [1,0.5,0] and adjust for the presence of ties (zero difference) from the diagnostic pairs and disease status[0,1] and map to [1,0,-1]. Given the specifications above, we generalize the estimate of AUC as: Note that Qpq is the difference between the sample pairs of S1 being measurements arising from diagnostic test 1 and S2 being measurements arising from diagnostic test 2. This is based on the exchangeability of the diseased and non-diseased labels of the subjects within each diagnostic test. The indicator function Tpq takes value 1 at the calibrated cut-off point c of a given diagnostic test if subject test result p is non-diseased and subject test result q is diseased. It takes -1 if subject test result p is diseased and subject test result q is non-diseased. Values of 0 represents cut-offs at which both subject test results p and q are diseased or non-diseased. Recall that the AUC is equivalent to two-sample Wilcoxon test statistic (Pardo and Franco-Pereira, 2017) and can be used to carry out test of symmetry around zero for paired samples. Based on that finding, the Equation 5 above which is the modified Wilcoxon Signed rank test statistic is equivalent to difference in AUCs and can be used as a test statistic for the test of symmetry around zero. This proposed test statistic is more powerful than the modified sign test statistic (Oyeka, 2009) proposed by Braun and Alonzo (2008) for comparing correlated ROC curves as it utilizes both the signs, Tpq and the absolute ranks of Qpq.
When both diagnostic tests results are measured continuously, testing the hypothesis that AUC = 0 is equal to testing the null hypothesis that r(Qpq) is a symmetric distribution around zero. We therefore test the null hypothesis that AUC = 0 by computing AUC = 0 for every permutation of Tpq, the signs of the rank of |Qpq|. Given that our permutation of Tpq requires exchanging the labels of non-diseased subject's test results p and diseased subject's test result q, it is the same as permuting among the subjects, the vector of test results of diseased/non-diseased labels. Therefore, the link between the true diseased status of a given subject as well as its test results arising diagnostic tests 1 and 2 are dislodged under this type of permutation arrangement. This permutation test is therefore valid if either one of the AUC of the diagnostic tests is equal to t, where t is a number in between 0.5 and 1 inclusive.

An Algorithm for Computing the Exact Distribution of the Permutation Test, Ŵ(AÛC)
To ensure that the probability of a type I error is exactly α, thus obtaining exact p-values, an algorithm for obtaining exact permutation distribution of the test statistic, AÛC, is presented by implementing it in Intel Visual FORTRAN. This software package is to be used because it can carry out sampling without replacement, which increases the power of the permutation test. For a complete enumeration of all the paired permutations of the two diagnostic test results, the required number of permutations is given by:  Therefore a paired sample design with n pairs has 2 N+M possible permutations of the variates with each permutation occurring with probability 2 -N+M .
Since S1 = {S11, S12,…, S1N, S1,N+1, S1,N+2…., S1n} and S2 = {S21, S22,…, S2N, S2,N+1, S2,N+2…., S2n} are n measurements arising from diagnostic tests 1 and 2 respectively where the subscripts p = 1,2,..,N represents test results for the non-diseased subjects and q = N +1, N +2,….,n representing test results for the diseased subjects, we consider AUC given in (5) where, fl is the frequency of occurrences of Wl. Given a particular value of n and significant level , c being the critical value is in correspondence to the closest of α. The distinct occurrences of W are therefore all ordered in an increasing order of size. If the point occupied by the observed value of W is h, then the left and right side of the probability distribution of W has level of significance given as: And: Since the alternative hypothesis suggests a two sided test, the left and right side are added up. Therefore, for a symmetric distribution of W around zero: Since permuted subjects labels are represented by S1 and S2 from diagnostic test 1 and 2 respectively, let {1, 1,…, n} be a set of all distinct permutations resulting from S1 and S2 pairs from diagnostic test 1 and 2 such that s is the s th permutation.
The steps involved in the permutation test are defined as follows: 6. Given the empirical cumulative probability distribution p , if p0  , we reject H0 These steps compute the empirical cumulative probability distribution of W under the null hypothesis.

An Algorithm for Calculating the Exact Distribution of Ŵ
The test statistic Ŵ is computed for each permutation in the complete enumeration of the distinct permutations. The distribution of the test statistic is obtained by tabulating the distinct values of the statistic against their probabilities of occurrence in the complete enumeration, bearing in mind that all the permutations are equally likely. The paired permutation is constructed by letting Ssm represent the paired test results of subjects in the two diagnostic tests 1 and 2, where s = 60; m = 1,2. See appendix A1 for the algorithm.

Examples a. Simulation Description and Implementation
Test results from two diagnostic test procedures were simulated for the purpose of comparing the test sizes and statistical power of the proposed permutation test for various underlying AUC differences, different sample sizes and correlations between two diagnostic test procedures as follows. In other to generate data, we assumed and drew two continuous measurements for The main essence of data simulation is to evaluate the ability to control Test size (Type I error) and to achieve higher statistical power for the proposed permutation test as compared to other tests. To know the Test size (type I error) and statistical power of the normal approximation (asymptotic pattern) and exact values of various AUCs that are involved, how correlated subjects' test results are across diagnostic tests at different sample sizes. Here equal correlation is assumed for non-diseased and diseased subjects across the two diagnostic test results that are continuous while non-crossing as well as crossing of ROC curves are similarly considered. We compared the size and power of the permutation test to another method in terms of their exact permutation and their normal approximation. Because of enormous time required to implement the exact permutation procedure, the comparisons done here are limited to sample sizes that are small. In comparing the test size and statistical power of the proposed test in relation to a competing method, six tables were obtained as well as four scenarios showing the ROC curves with varying AUCs.
These are presented below.          . 2) examine the comparison of Test size of the proposed permutation test and Braun and Alonzo's permutation test in terms of their exact and asymptotic methods for assessing a difference in AUC for two continuous diagnostic test procedures when the areas are different for non-crossing and crossing ROC curves respectively. Since large computational time was needed for carrying out the computation of exact permutation, the comparisons shown in Table 1 (Fig. 1) and Table 2 (Fig. 2) are limited to sample sizes that are small where result indicates that good agreement exists between the exact and normal approximation test. Table 1 (Fig. 1) and Table 2 (Fig. 2) shows that even with small sample size of 10 for each of non-diseased and diseased subjects, the normal approximation test is adequate while the exact permutation test required a little computer time to conduct. Subsequent Tables 3 to 6 considered simulating the operating characteristics of the normal approximation test for large sample sizes since the exact permutation test results are essentially equivalent.

--------------------------------------------------------------------------------------------------------
In Table 3 (Fig. 3), we compared and presented the estimates for continuous data of the test size of the proposed asymptotic normal approximation test and normal approximation test proposed by Braun and Alonzo (2008). In Table 4 (Fig. 4) where the areas are same with crossing ROC curves, the test size is the statistical power, since the proposed method is designed to detect a difference in AUCs but formally test the null hypothesis for the equality of AUCs subject to exchangeability. In Table 3 (Fig. 3) and Table 4 (Fig. 4) where the AUCs are same, for moderately large sample sizes such as 40 to 60 with non-crossing ROC curves having at least moderately high correlation between diagnostic tests, the proposed test showed a less conservative test size compared to Braun and Alonzo's test. This effect is especially evident with smaller sample sizes. In Table 4 (Fig. 4) when the AUCs are the same with crossing ROC curves, the test size of the proposed test is very close to that of the Braun and Alonzo' test since both tests is for detecting a difference in AUCs. Therefore the two methods are not advisable to be used to detect crossing ROC curves when the AUCs are the same. The closeness of the test size and the nominal level of significance suggests that two permutation tests (proposed test as well as Braun and Alonzo, 2008) which in comparison provide an asymptotic normal approximation of test of equality of AUCs are comparable in statistical power.      In Table 5 and 6 when the different AUC is at least 0.8 with a correlation of ρ≥0.4 having crossing and noncrossing ROC curves respectively, the proposed permutation test has greater statistical power compared to the test proposed by Braun and Alonzo (2008). This is because the proposed permutation test is less conservative in the stated range of parameters. When the correlation is less than 0.4 with different AUCs less than 0.8, Braun and Alonzo's test has slightly greater statistical power because at this region they test size is slightly high. As sample size increases, the operating characteristics of the two permutation tests near one another.
Therefore, in summary our simulations showed for the proposed permutation test the test size and nominal level of significance are in close agreement for sample sizes that are reasonably small. Again, for sample sizes that are small with large AUCs and moderate correlation between diagnostic tests the proposed test has operating characteristics that is better than the permutation test proposed by Braun and Alonzo (2008). Finally, the statistical power of the proposed permutation test to detect crossing ROC curves with same AUCs is near to the nominal level of significance. This means that for crossing of ROC curves to be detected, the AUCs of the two curves must be different under the range of parameters considered. The Test size and statistical power of each test were computed as the percentage of 10,000 simulations and the null hypothesis of AUC = 0 was rejected at a nominal significant level of 0.05.We generated the permutation of the empirical probability distribution of AÛC in each simulation by generating 10,000 random permutations of the diseased and non-diseased labels.

b. Real Life Data Example
By simple random sampling method, a total of 60 pregnant women underwent two types of diagnostic tests for the in-depth confirmation of Gestational Diabetic Mellitus (GDM) such that their test results were paired or matched to each other. These diagnostic tests are a 75 g Oral Glucose Tolerance Test (OGTT) and a 100 g OGTT. The data is used to evaluate the feasibility of the proposed permutation test at a nominal level of 0.05. The characterization and criteria adopted for diagnosing antenatal mothers who underwent either 75 g OGTT/100 g OGTT were 2 h OGTT characterization while the criteria was ≥ 155 mg/dl for one to be considered diseased/positive (coded 1) for GDM while <155 mg/dl is considered non-diseased/negative (coded 0) for GDM. Exchangeability of the measured test results is a vital condition to achieve result given that these results are paired. If the null hypothesis is true, then we can infer that the subjects' test results in diagnostic 1 and 2 are exchangeable and so the permutation test is applied on raw scores and are not ranked. It showed that there exist a number of pairs with tied test results, even though the test results are continuous. The null hypothesis is that the 2 h 75 g OGTT contributes the same diagnostic information or accuracy as the 2 h 100 g OGTT. That is, AUC1 and AUC2 of the two diagnostic tests are equal. The real data if analyzed will evaluates the performance of the proposed estimates. It will compare the performance of the two diagnostic tests in terms of ROC curves between the two diagnostic tests and a crossing ROC curve will emerge (Fig. 5). The crossing ROC curves will have the areas for the two diagnostic test procedures (Fig.  5). In applying the data, the diagnostic test results need to have a bivariate bi-normal distribution. But according to Wang (2015), most powerful test does not exist for testing bivariate normal distribution. Therefore, for each test result, one resorted to checking only the univariate normality. Checking for univariate normality of two diagnostic test results by Shapiro-Wilk test reveals that the pvalues for the diagnostic tests 1 and 2 for the nondiseased subjects are respectively 0.6124 and 0.8975 while that of diseased subjects for the diagnostic tests 1 and 2 are respectively 0.6345 and 0.8765. The estimates of AUC1 and AUC2 for diagnostic tests are 0.668 and 0.887 respectively. Hence using the proposed permutation test, the p-value of 0.0312 is rejected at a nominal level of 0.05. Using the Braun and Alonzo's permutation test, the null hypothesis is also rejected since the P-value is 0.0387.

Discussion
The proposed permutation test can be used to compare the performances of diagnostic tests for paired sample design. It makes for the conduct of exact permutation test and makes for easy to implement approximation when the sample size is large. Our test which is used in testing the null hypotheses about paired ROC curves (in other words, the equality of AUCs) is designed to have increased power to detect a difference in the AUC. The need for an alternative permutation test based on between-subject permutations of the labels of the subjects within each diagnostic test for detecting differences between ROC curves was necessary so as to tackle the problem associated with few existing methods which is characterized by the exchangeability of the labels between two diagnostic tests within subject. In the real sense of it, the proposed test is for assessing a change in the AUCs in a continuous matched pair of data from two diagnostic test procedures having both diseased and non-diseased subject in each of the test. Here permutations are made between subjects particularly by shuffling the diseased and non-diseased labels of the subjects within each diagnostic test procedure. According to DeLong et al. (1988), the condition for having appropriate test size and increased statistical power stipulates the following: That the sample size for both the non-diseased and diseased subjects must not be more than 60, the average of two correlated AUCs must be at least 0.80 as well as the fact that the correlation within subjects test results is   0.4. At small average AUC, low correlation between diagnostic tests and at sample size higher than 60, the method by DeLong et al. (1988) has improved test size and greater or higher power than our test but these does not apply here where there is evaluation involving diagnostic tests more so when permutation test is required. For small sample sizes, the proposed permutation test and that of Braun and Alonzo have similar test size and statistical power. According to the simulation conducted by Venkatraman and Begg (1996), for non-crossing ROC curves, the statistical power of DeLong et al. has a higher power than that of Venkatraman and Begg. This is because the procedure of Venkatraman and Begg is designed to detect differences in ROC curves as against detecting differences only in AUCs. In other words, when ROC curves cross, the power of test is higher because it detects difference in ROC curves but if roc curves do not cross, DeLong et al.'s test that compare AUCs only have higher power. Therefore, Venkatraman and Begg (1996) Bandos (2005) as well as Braun and Alonzo (2008) but has superior operating characteristics in some ranges of parameters as well as due to the fact that our test is designed to consider the value of signs as well as the absolute ranks of values as well while the test by Braun and Alonzo considered only the signs of values. However, the test by Venkatraman and Begg would have been a better option for use assuming our primary interest was to detect a difference in ROC curves at every operating point. In all our simulation result shows that our permutation test is slightly conservative but has an excellent power to detect a crossing alternative. The test size of the permutation test for sample sizes that are small was investigated using simulations. The algorithm for calculating the exact permutation distribution of AÛC enabled us to obtain a normal approximation to the exact procedure and this is suitable when the sample size is small. The presence of an asymptotic method provides a simple and exact approximation to the permutation test since exact permutation tests can be computationally burdensome if sample size increases.

Summary and Conclusion
The Test size and statistical power of each test were computed as the percentage of 10,000 simulations and the null hypothesis of AUC = 0 were rejected at a nominal level of 0.05. Because the proposed permutation test is formally for testing the null hypothesis of equality of AUC, the rejection rate becomes the statistical power when the ROC curves cross each other. If the sample size is moderate and more especially for small sample sizes in a case of noncrossing ROC curves having equivalent and large AUC given the fact that the correlation between the diagnostic tests are moderate, the test size demonstrated by the proposed test is less conservative than the Braun and Alonzo test. In practical terms, it is not advisable to employ the proposed test in detecting crossing ROC curves when the AUCs from crossing ROC curves are equal because its rejection rate, talking the power is very close to that of Braun and Alonzo test (type I error). The proposed test makes provision for an approximate test of equality of AUCs due to the fact that the rejection rate is very close to the given level of significance. The power of the proposed test is greater than that of Bandos et al. as well as Braun and Alonzo's test if the correlation is at least 0.4 and the average of AUC is at least 0.80 for non-crossing ROC curves since the range of parameters of the proposed test is less conservative. The power of Braun and Alonzo's test is greater when the correlation is lower and the average AUCs is smaller than this, a situation seen at a region where the test size test of this competitive test is slightly elevated. As the sample size increases, the operating characteristics of these comparative tests get closer to each other. In particular, when the ROC curves cross, the rejection rate of the proposed test is higher when the correlations and average of AUCs are higher. Therefore, our simulations shows that the test size of the proposed test and the nominal value shows close agreement when the sample size is reasonably small. In addition, the proposed permutation test has better operating characteristics when the correlation between diagnostic tests is moderate at large average AUC and small sample sizes than Bandos et al. as well as Braun and Alonzo's tests. So the proposed test has power close to the significance level in detecting when ROC curves cross with equal AUCs within the range of parameters considered. This means that for the null hypothesis to be rejected, the AUCs of the two ROC curves must differ. We presented various Tables of comparisons of test size and statistical power of the proposed permutation test and that of the competing test in an effort to assess a difference in the AUCs of two diagnostic tests. In applying the proposed test on real data, we saw in the graph of ROC curves Fig. 5 that 2 h 100 g OGTT diagnostic test is superior at a time that the specificity is greater than 0.7. As soon as the specificity decreases, the disparity between the two diagnostic tests procedures reduces. In applying the proposed permutation test, the diagnostic test results need to have a bivariate bi-normal distribution. But according to Wang (2015), most powerful test does not exist for testing bivariate normal distribution. Therefore, for each test result, one resorted to checking only the univariate normality. Checking for normality of two diagnostic test results by Shapiro-Wilk test reveals that the P-values for the diagnostic tests 1 and 2 for the non-diseased subjects are respectively 0.6124 and 0.8975 while that of diseased subjects for the diagnostic tests 1 and 2 are respectively 0.6345 and 0.8765. Therefore, the null hypothesis for this univariate normal is rejected that the two diagnostic test procedures did not contribute similar information or that their accuracies are not the same. Hence using the proposed permutation test, the P-value of 0.0312 is rejected at a nominal level of 0.05. Using the Braun and Alonzo's permutation test, the null hypothesis of AUC = 0 is rejected also since the P-value is 0.0387. Comparing the proposed test and that of Braun and Alonzo's permutation test in terms of their P-values, one will say that the proposed test is more powerful since it has the more likelihood of rejecting the null hypothesis. These results are consistent with the findings obtained by the proposed permutation test by Bandos et al. (2005). We therefore recommend the use of permutation tests for comparing two diagnostic tests that are correlated as it provides a more exact results with small sample sizes which is the demand of clinical practices. We suggest the use our proposed permutation test to generate a confidence interval for AUC as a complement to the hypothesis test as well as how permutation method can be applied if the test statistic is seen as McNemar test. It is vital to consider the use of a test statistic that will consider the use of absolute ranks as well as absolute magnitude of a test statistic that discriminates between the null hypothesis and alternative hypothesis. Under the present scenario, Wilcoxon signed-ranks test, which is our permutation test equivalent to AUC only use the absolute rank of Qpq and not its absolute magnitude. Future study includes extending the proposed test to accommodate the "multiple-reader" settinga commonly used design in which so many readers evaluate selected cases using different diagnostic tests.