One-Sided Multivariate Tests for High Dimensional Data

Problem statement: For a multivariate normal population with size smal ler than dimension, n<p, the likelihood ratio tests of the n ull hypothesis that the mean vector was zero with a one-sided alternative were no longer valid because they involved with sample covariance matrix which was singular. Approach: The test statistics for one-sided multivariate hyp otheses with n<p were proposed. Results: The simulation study showed that the proposed test s provided reasonable type I error rate for one-sided covariance structures. The y also give good powers. The application of these tests was given by testing of one-sided hypotheses on DNA micro array data. Conclusion: Under that there have no such other tests available at present for this kind of hypothesis testing with n<p yet, the proposed tests are good ones. However, the methodol ogy is valid for any one-sided hypotheses application which involves high-dimensional data.


INTRODUCTION
Suppose one uses a matched-pair design to compare the multivariate responses of two treatments. If the responses are p dimensional and θ = ( θ 1 , θ 2 ,…, θ p ) is the difference, treatment one minus treatment two, of the mean responses, then one may test the null hypothesis, H 0 : θ 1 = θ 2 =…= θ p = 0, to determine if there is a difference in the two treatments. Furthermore, if one believes that for each coordinate, the mean responses for treatment one are at least as large as those for treatment two, then the alternative can be constrained by H 1 : θ i ≥ 0 for i = 1, 2,…,p.
Based on a random sample with n>p from the normal distribution with mean θ and covariance matrix V, Kudo (1963); Shorack (1967) and Perlman (1969) derived the likelihood ratio test of H 0 versus H 1 -H 0 for the cases in which V is known, known up to a multiplicative constant and completely unknown, respectively. Because the likelihood ratio tests with restricted alternatives are complicated to use, Tang et al. (1989) proposed an approximate likelihood ratio test and Follmann (1996) proposed one-sided modifications of the usual χ 2 and Hotelling's T 2 tests of H 0 versus ~H 0 that are easier to implement. Using exact computations and Monte Carlo methods, Chongcharoen et al. (2002) compared the performance of Kudo's test, Follmann's test, a new test, which is a modification of Follmann's test, the permutation test of Boyett and Shuster (1977) and the Tang-Gnecco-Geller test for a known covariance matrix and for a partially known covariance matrix, they compared the powers of these tests with Kudo's test replaced by Shorack's test. For a completely unknown covariance matrix, Chongcharoen (2009) studied the power of these one-sided tests for unknown covariance matrices with equal variances and unequal variances as well as tests obtained by combining the Boyett and Shuster (1977) technique to Follmann's test, the new test, Perlman's test and the Tang-Gnecco-Geller test. In some situations, there are no longer data for n>p. That is, when the number n of available observations is smaller than the dimension P of the observed vectors. For example, the data come from DNA micro arrays where thousands of gene expression levels are measured in relatively few subjects. The one-sided multivariate tests as above are no longer valid for this kind of data because the p×p sample covariance matrix S is singular with rank n<p, S -1 does not exist. Since now there have no one-sided multivariate tests available for the data which has the number n of available observations is smaller than the dimension p yet, therefore the proposed tests were the one-sided multivariate tests for the data with n<p.
Throughout this study, suppose X 1 , X 2 ,…,X n is a random sample from a p-dimensional multivariate when n<p, S is a singular matrix.
The hypotheses H 0 and H 1 also arise in the one-way analysis of variance when the means are known to satisfy an order restriction. For observations which come from k normal populations whose means are known to satisfy a simple ordering, i.e., Bartholomew (1959;1961) where, I (A) is the indicator of A. Also, Bartholomew (1959;1961) considered an arbitrary partial order restriction, which includes the simple tree order, i.e., and with p=k-1 and wi as above, the correlation matrix of X= (X 1 ,X 2 ,…,X p ) for simple order tree restriction is Eq. 3: where, I (A) is the indicator of A as above. In this study, we mainly interested in one-sided multivariate tests which involved both R S and R T . So the powers of the proposed tests are compared for R S and R T including several other correlation matrices.

MATERIALS AND METHODS
The unrestricted alternative test for high dimensional multivariate tests: The unrestricted alternative test for mean of random sample X 1 , X 2 ,…,X n with i X iid N( , V) θ ∼ when n<p, that is, the tests with the hypothesis Eq. 4: are proposed by several researchers recently such as: Dempster (1958;1960) proposed a test for testing the mean difference of two independent samples and developed its approximate distribution. Srivastava (2007) and Srivastava and Du (2008)  denotes the largest integer less than or equal to a and: 2 2 2 2 1 1 2 2 a tr(S) (n 1) 1 (tr(S))ˆr p , a and a tr(S ) a p (n 2)(n 1) p n 1  They compared this version of Dempster's test with Bai-Saranadasa's test and their test which we will discuss after studying their test. Bai and Saranadasa (1996) also proposed a test for testing the mean difference of two independent samples. They derived its asymptotic power of their test. Also they derived the asymptotic power of the classical Hotelling's T 2 test and Dempster's nonexact test for a two-sample problem.  covariance form as Σ = I P and Σ = (1-ρ)I P + ρJ P where J p is a p×p matrix with all entries one and ρ = 0.5 for normal case and ρ = 0, 0.3, 0.6 and 0.9 for non-normal case. From the fact if the test statistic of the proposed test has corrected distribution, then the proportion of rejection the null hypothesis under the null hypothesis is true, or called attained significance level, from the simulation result must close to the probability rejecting the null hypothesis when the null hypothesis is true, or called significance level, here is α, which in their study they set the target significance level as α= 0.5. From Table 1 for nonnormal case in their study, the attained significance level of their test is close to α = 0.05 with the maximum difference from the target 0.003.  For normal cases as in Table 2 in their study, the attained significance level of their test is not close to 0.05 with the minimum difference from the target 0.012 meanwhile the attained significance level of Dempster's test is also close to the target for non-normal case but it is equal to 0.05 for normal case with only Σ = I P . It is true that all the attained significance level results showed in their study are within the range 0.0316 as mentioned in their study but we may be looking for a better test which gives the attained significance level closer to the target significance level. From this simulation results, it may possible that both their test and Dempster's test can be used only some covariance matrix structure cases and they may need to be studied further. Srivastava (2007) and Srivastava and Du (2008) also gave the one sample version of Bai and Saranadasa's test statistic which reject H 0 as (4)  (tr(S ) n 1 (n 2)(n 1) (tr(S)) ) n 1   Srivastava (2007) and Srivastava and Du (2008) proposed a test for one sample which is based on the test statistic: , where T=daig (t 1 , t 2 ,…,t p ) and t 1 , t 2 ,…t p are nonzero constants. They claimed by simulation that for all the components of the random vector are independent, that is, the covariance matrix is a diagonal matrix, their test has the attained significance level given in Table 1 in their study reasonably well in all cases. But we can see in Table  1 in their study that all attained significance level values vary from 0.035-0.065. There is a number of attaining significance level values differ from 0.05. They also claimed that their test has substantial better power than Dempters's test and Bai Bai and Saranadasa (1996) showed that their test, BS's test, has asymptotic powers the same as those of D's test, thus we will investigate the D's test and BS's test further for one-side alternatives.  (Follmann, 1996) where, Z α is the (1-α) th quantile of the standard normal distribution. It also noted that, after Theorem 2.1 of Follmann (1996), the significance level of this test is α.
In Table 2., for every p and n > 5 considered, DF gives the estimated significance level range from 0.047-0.057 for RS, range from 0.044-0.058 for the RT, range from 0.047-0.055 for the correlation matrix . It is shown that the estimated significance levels of both tests approximate reasonably well in all cases considered.

RESULTS
To compare these two tests, the performances of them are studied by Monte Carlo techniques for multivariate normal distributions with the correlation matrices R S and R T , that is for the simple order and the simple tree order correlations with equal weights as well as some other forms of correlation structures such as 1 R ℜ = and 2 R ℜ = . Recall, R S and R T are given in (2) and (3), respectively. The mean vector for the alternative hypothesis is chosen in the non-negative orthant as 1 2 p 2k 1 (v , v ,..., v ) ; v 0 − ′ θ = = and 2k v iid Unif (0,1), ∼ k =1, 2, …, p/2 so that the tests will be rejected. As before, 10,000 iterations are used. In each iteration, n multivariate normal X's with the chosen mean vector and covariance of the form ℜ are generated and the proportion of rejections for these tests was recorded. All of these tests are conducted using the level of significance α = 0.05. Monte-Carlo estimated power of these two tests is given in Table 3-5.  Table 3 gives the powers of the D's test and DF's test in all cases considered. It can be seen that DF's test gives at least better powers than D's test does. Also DF's test gives highest power when p, n large and p>> n for R S . Similar to comparison powers between BS's test and BSF's test, shown in Table 4. BSF's test showed substantially higher power than BS's test does in all cases considered and also gives highest power for R S in each p and n > 5. To compare the BSF's test to DF's test in all cases considered, Table 5 gives their powers. For the correlation matrices R S and R T , both BSF's test and DF's test almost give the same powers and for correlation matrices as 1 R ℜ = and 2 R ℜ = , for each p DF's test gives higher power than the BSF's test when n >10. Therefore, we can conclude that overall both tests, BSF's test and DF's test, gave almost the same powers in every p and n and every covariance matrices structure considered.   Therefore, for protection some gains of using these two tests, we recommend these tests for high dimensional data when p ≥20 and n > 10 for one-side alternatives.
An example: The proposed tests are applied to an example of DNA micro array data which the data are 8280 (p) gene expression information on 110 childhoods suffering from acute lymphoblastic leukemia. To see the changes in gene expression after treatment, the data were cleaned and then obtained the difference of gene expression from before and after treatment of 50 children in 254 (p) gene expressions (http://www.ailab.si/supp/bi-cancer/projections/info).
The results of using these two tests are shown in Table  6. The p-values of DF's test and BSF's test equal to 0.0129, 0.0000 respectively. Thus, all two tests lead to the rejection of the hypothesis that the gene expressions after treatment have the same level as before treatment.

DISCUSSION
At present, there have no such other tests available for this kind of hypothesis testing on highdimensional data yet, the proposed test should be the best one available though it works well on some conditions or under the circumstances considered in this study. Hopefully, there will be some other researchers interested in it.

CONCLUSION
Since for the data with the number n of available observations is smaller than the dimension p (n ≤ p), the proposed one-sided multivariate tests, DF's test and BSF's test, have power larger than the tests with unrestricted multivariate alternative tests. Thus, for comparing the two treatments of data with the dimension p larger than the number n of available observations that one believes that for each coordinate the mean responses for treatment one are at least as large as those for treatment to which at present there have no such other tests available for this kind of hypothesis testing yet, we recommended the proposed tests, DF's test and BSF's test for p ≥ 20 and n > 10 under the circumstances considered in this study.

ACKNOWLEDGEMENT
The researchers would like to express sincere thanks to Research center, The National institute of Development Administration, Bangkok, Thailand for financial support.