Inversion of Covariance Matrix for High Dimension Data

: Problem statement: In the testing statistic problem for the mean vector of independent and identically distributed multivariate normal random vectors with unknown covariance matrix when the data has sample size less than the dimension n ≤ p, for example, the data came from DNA microarrays where a large number of gene expression levels are measured on relatively few subjects, the p×p sample covariance matrix S does not have an inverse.. Hence any statistic value involving inversion of S does not exist. Approach: In this study, we showed a version of some modification on S, S+cI and find a real smallest value c ≠ 0 which makes (S + cI) - 1 exist. Results: The result from study provided when the dimension p tends to infinity and smallest change in S, the (S + cI) -1 do exist when c = 1. Conclusion: In statistical analysis involving with high dimensional data that an inversion of sample covariance matrix do not exist, one way to modify a sample covariance matrix S to have an inverse is to consider a sample covariance matrix, S, as the form S + cI and we recommend to choose c = 1.


INTRODUCTION
Now suppose X 1 , X 2 ,…,X n is a random sample from a p-dimensional multivariate normal distribution with unknown mean µ = (µ 1 , µ 2 ,…, µ p )' and unknown positive definite covariance matrix V with n≤p. The sample mean and p×p sample covariance matrix S are Since covariance matrix is a real symmetric matrix and Harville (1997) showed that every symmetric matrix has an eigenvalues.
The Hotelling's T 2 statistic do not exist, so Dempster (1958;1960), Bai and Saranadasa (1996) and Srivastava and Du (2008) developed test statistics using some other forms of S for their tests instead of inversion of S because it does not exist. By equipping with the knowledge of (Polymenis, 2011;Girgis et al., 2010;George and Kibria, 2010;Yahya et al., 2011;Nassiry et al., 2009) will help to transform ideas. Searle (1982) also showed that the eigenvalues of every real symmetric matrix is real. For n≤p, Johnson and Wichern (2002) have shown that the determinant of sample covariance is zero for all samples, that is, S is singular. Now consider p×p an covariance matrix A with an eigenvalue λ, which for any real vector v ≠ 0, then by the definition of eigenvalue, we have Av = λv, definite if and only if λ>0. Thus the covariance matrix S is at least positive semidefinite. Searle (1982) also showed that for any p×p matrix A, the determinant of A is equal to the product of its eienvalues, that is, Hence, covariance matrix S must have at least one eigenvalue to be zero. Since every positive definite matrix is nonsingular and its determinant is positive, so the easiest way to makes covariance matrix S from high dimensional data to be nonsingular is to modify it to be positive definite matrix. We consider the form S + cI, c ≠ 0 by looking for a smallest c ≠ 0 which makes (S + cI) -1 exist. Now suppose that S has r nonzero eigenvalues, that is, it has exactly r positive eigenvalues and p-r zero eigenvalues. We are interested in modifying S to be nonsingular with the smallest change in S by considering S + cI, c ≠ 0 for any real number.

MATERIAL AND METHODS
Suppose λ 1 ≥ λ 2 ≥ λ 3 ≥…≥λ r > 0 and λ r+1 = λ r+2 = …= λ p = 0 are all eigenvalues of S. From the definition of evigenvalue, for any eigenvalue λ; i = 1, 2,…, p of S and for any real vector v ≠ 0, we have Sv = λv, then (S + cI)v = Sv + cv = λv + cv = (λ + c)v. So, λ + c is an eigenvalue of S + cI. Thus all p eigenvalues of S + cI are λ 1 + c ≥ λ 2 + c ≥ λ 3 + c ≥… ≥ λ r + c > c =…= c. We can see that c cannot be negative because if it does, S + cI cannot be positive definite matrix. Now if 0 < c < 1, the determinant of S + cI is: ∏ which approaches to zero as p tends to infinity, that makes (S + cI) -1 does not exist. Therefore the only one possible case is c≥1, but we are looking for a smallest value c that makes (S + cI) -1 exist. So we pick c = 1. The proof is completed.

RESULTS
The result from this study provided a way to modify a sample covariance matrix, S, came from the data with the number of the dimension p larger than the number of observation n available, n ≤ p, which its inversion of sample covariance matrix do not exist, to be S + cI with smallest change in S and then (S + cI) -1 do exist with c = 1.

DISCUSSION
At present, there are a number of data with the number of the dimension p larger than the number of observation n available, n ≤ p, in many diverse applied fields, e.g., medical, pharmaceutical, agricultural, psychological, educational, social, behavioral, political, criminal, industrial, meteorological, zoological and biological sciences but there are barely any statistical technique for analyzing this kind of data. The resulted technique we found may help the researchers to develop new statistical techniques for analyzing high dimensional data.

CONLUSION
In statistical analysis, when one involves with high dimensional data, the number of sample size less than the number of dimension(variables), any statistic values involving with inversion of sample covariance matrix will not exist because inversion of sample covariance matrix do not exist. One way to modify a sample covariance matrix, S, to have an inverse is to consider a sample covariance matrix, S, as the form S + cI. For this form of sample covariance matrix, we showed when c ≥ 1, that (S + cI) -1 do exist and for smallest change in S, we recommend to choose c = 1.