Optimization of Penalty Parameter in Penalized Nonlinear Canonical Correlation Analysis by using Cross-Validation

: There is Canonical Correlation Analysis (CCA) as a way to find a linear relationship between a pair of random vectors. However, CCA cannot find a nonlinear relationship between them since the method maximizes the correlation between linear combinations of the vectors. In order to find the nonlinear relationship, we convert the vectors through some known conversion functions like a kernel function. Then we find the nonlinear relationship in the original vectors through the conversion function. However, this method has a critical issue in that the maximized correlation sometimes becomes 1 even if there is no relationship between the random vectors. Some author proposed a penalized method with a penalty parameter that avoids this issue when the kernel functions are used for conversion. In this method, however, methods have not been proposed for optimizing the penalty and other hyper parameters in the conversion function, even though the results heavily depend on these parameters. In this study, we propose an optimization method for the penalty and other parameters, based on the simple cross-validation method.


Introduction
Let y and x be q 0 -and p 0 -dimensional random vectors. Without of generality, we assume E[y] = 0 0 q and E[x] = 0 0 p where 0 ℓ is an ℓ-dimensional vector of zeros. Moreover, let Σ = E [(y′, x′)′ (y′, x′)] be a (q 0 + p 0 ) × (q 0 + p 0 ) unknown matrix and yy yx yx xx where Σ yy is a q 0 ×q 0 matrix, Σ yx is a q 0 ×p 0 matrix and Σ xx is a p 0 ×p 0 matrix and we assume det(Σ yy ) ≠ 0 and det(Σ xx ) ≠ 0. Note that Σ yy = Var(y), Σ yx = Cov(y, x) and Σ xx = Var(x), since E[y] = 0 0 q and E[x] = 0 0 p . As a method for finding the linear relationship between y and x, Hotelling (1936) proposed Canonical Correlation Analysis (CCA). This method is formulated as follows: Usually, using the Lagrange method of undetermined multipliers, we can derive the solutions of a and b. More details of CCA can be found in Muirhead (1982), Gittins (1985), Srivastava (2002) and Weenink (2003). This method is currently being used for data analysis (see, e.g., Doeswijk et al., 2011). CCA, however, can not find a nonlinear relationships between y and x, since the maximization term in (1.1) is equivalent to Cov(a′y, b′x), which evaluates the linear relationship between linear combinations a′y and b′x.
In order to find a nonlinear relationship between y and x, we consider converting them by using some known functions like a kernel function. Then, CCA can then find a nonlinear relationship between y and x through the conversion functions. This method is referred to as a Nonlinear Canonical Correlation Analysis (NCCA) and it is shown in section 2. Hardoon et al. (2004) pointed out that NCCA has a critical issue which is also shown in section 2.
Using the same idea as is used in the penalized nonlinear regression model, Akaho (2000) proposed a penalized NCCA when the kernel functions are used for the conversion functions. We will refer to the penalized NCCA as PNCCA even when it uses any conversion functions instead of the kernel function.
In PNCCA, no criteria have yet been developed for optimizing the penalty and other hyper parameters in the conversion function. The reason of this problem, it is difficult to know how to evaluate the result of PNCCA. In particular, determining how to optimize the penalty and other hyper parameters in conversion function is important, since the result of PNCCA heavily depends on these parameters. Hence, in this study, we create a evaluating function for evaluating the estimated value. Based on this function and the ordinary Cross-Validation (CV) method, we propose the simple form of CV method for optimizing these parameters in PNCCA. Details of the proposed function and CV method are presented in section 3.
The remainder of the present paper is organized as follows: In section 2, we present more details of CCA, NCCA and PNCCA. In section 3, we propose the simple CV method for optimizing several parameters in PNCCA.
In section 4, we use numerical studies to compare CCA, NCCA and PNCCA based on the optimized parameters. In section 5, we present our conclusions. Using the proposed CV method, we can select the variables in y and x; we illustrate this method in the Appendix.

CCA, NCCA and PNCCA
In this section, we illustrate CCA, NCCA and PNCCA. We first illustrate CCA, which is expressed as (1.1). Using the Lagrange method of undetermined multipliers, since det(Σ xx ) ≠ 0 and det(Σ yy ) ≠ 0, CCA is the same as solving the following eigenvalue problem: Hence, solving the eigenvalue problem in (2.1) and using the largest eigenvalue and the corresponding eigenvector, we can solve the maximization problem under several conditions in (1.1). More details of CCA can be found in e.g., Muirhead (1982).
However, CCA can not find a nonlinear relationship between y and x. In order to find a nonlinear relationship between them, we convert x as w = ϕ(x) where ϕ(.): Without of generality, we also assume [ ] 0 1 p E w = and we also assume det (Σ ww ) ≠ 0 where Σ ww = Var (w). When we use CCA for y and w, we can find the nonlinear relationship between y and x through ϕ(·). This is the NCCA. However, Hardoon et al. (2004) pointed out that, even if there is no relationship between y and x, the result of NCCA shows there are heavily relationship between them.
In order to avoid this problem, Akaho (2000) proposed PNCCA only when we use the kernel functions as conversion functions. This is the primary method we consider in this study. Since, in our setting, only x is converted, PNCCA is expressed as follows: where, Σ yw = Cov (y, w), λ is a nonnegative penalty parameter and P is a known p 1 ×p 1 nonnegative definite penalty matrix. Note that λd′Pd is the penalty term in (2.2) since λd′Pd≥0 for any 1 p d∈ℝ . Furthermore, we note det (Σ ww + λP) ≥ det (Σ ww )>0 since λ≥0 and P is the nonnegative definite matrix (see, e.g., Lütkepohl (1996) section 4.2.6, (11)). The same as for CCA in (1.1), in order to solve the maximization problem under various conditions in (2.2), we use the Lagrange method of undetermined multipliers as follows:  ( ) Hence, when the penalty parameter λ is given, we can solve (2.2) by using the largest eigenvalue and the corresponding eigenvector of the above eigenvalue problem.
However, although it is important, there are no optimization methods for λ and other parameters in the conversion function ϕ(·). In the next section, we propose a simple CV method for optimizing λ and some of the parameters in the known conversion function ϕ(·).

Proposed Method
In this section, we propose a simple CV method for optimizing the penalty and other hyper parameters in the conversion function φ(·) which are used in PNCCA. In order to propose CV method, we consider evaluating function for the results of PNCCA.
Firstly, since Σ ww , Σ yw and Σ yy are unknown matrices, we use their unbiased estimators to estimate where S yy is a q 0 ×q 0 matrix, S yw is a q 0 ×p 1 matrix and S ww is a p 1 ×p 1 matrix. In We consider creating an objective function in order to evaluate a λ ɶ and d λ ɶ for optimizing several parameters in PNCCA that are the penalty parameter and the other parameter in the conversion function. Since the purpose of PNCCA is maximizing yw a d ′Σ under several conditions, we consider the following evaluation function: Maximizing the above function, we can optimize the parameters in PNCCA. Here, we note that â λ and d λ are derived from {y i , x i } i=1,...,n . However, Σ yw is an unknown covariance matrix. We therefore consider using an estimator for Σ yw that does not depend on {y i , x i } i=1,...,n in order to estimate R * in (3.1) since we use {y i , x i } i=1,...,n for deriving â λ and d λ . Then, let y * and x * be new variables that are obtained independently from {y i , x i } i=1,...,n and let S * be the variance and covariance matrix between y * and w * = φ(x * ). Then, letting S y * w * be the first q 0 ×p 1 matrix in S * , we can regard S y * w * as an estimator for Σ yw . Based on S y * w * , the evaluation function R * in (3.1) is estimated by using the average of the following value: Nevertheless, this evaluation function * R in (3.2) also can not be used directly for optimizing the parameters in PNCCA since y * and w * are not obtained. We thus use the simple CV method to optimize the penalty parameter and other hyper parameter in the conversion function that are in PNCCA. As similar as the ordinary CV method for some regression model, we divide {y i , x i } i = 1,...,n into two subsets. One of them is used for estimation, and other one is used for evaluating the estimated value. Let The essence of the propose method is to obtain a matrix that is an alternative to S y * w * . The alternative matrix to it can not be derived from using only one sample.
We now use v i and v j , (i ≠ j) to derive an alternative matrix to S y * w * , which can be defined as: since (y i + y j )/2 and (w i + w j )/2 are the sample means based on v i and v j , (i ≠ j) and the sample covariance matrix between y i and w j is derived as for any i and j, i≠ −j] , (i = 1, ..., n; j = 1, ..., n; i≠ j) be obtained by deleting v′ i and v′ j , and be based on the ordinary estimation method for covariance matrices. Then, if λ is fixed, is derived as the eigenvector that corresponds to the largest eigenvalue of ( ) ( ) Thus, for example, the penalty parameter λ and hyper parameter ζ in PNCCA and the conversion function can be optimized as When we use more number of rows of V for making the alternative matrix for S y * w * , we can extend this simple CV method to subset CV method. However, we only focus on the simple CV method in order to save the space of paper.

Numerical Study
In this section, we compare CCA, NCCA and PNCCA optimized with the proposed CV method through numerical study. Note that NCCA can be defined by the same form as PNCCA in (2.2) when we fix λ = 0. Let ∆ r (ρ) be an r×r matrix whose (i, j)th element is derived as ρ |i−j| The n×p 0 matrix X is generated from , where U is an n×p 0 matrix whose elements were generated independently from the standard normal distribution. Then, Y = (y 1 ,..., y n )′ are derived as follows: 1 r is a r-dimensional vector all of whose elements are 1 and ε i·r is generated independently from N r (0 r ,∆ r (0.5)) which is a r-dimensional multivariate normal distribution with mean 0 r and covariance matrix ∆ r (0.5). Here, δ controls the scale of the nonlinear relationship part.
Since NCCA and PNCCA need the converted values that are expressed as . (More details of K can be found in Green and Silverman (1994).) Since the 'arg max' operator is equivalent to the 'arg min' operator with the reversed sign, we select λ by using 'fminbnd' function in Matlab which is 'fminsearch' in Matlab with a specified region and we restrict the region to 1 to exp(20) in order to shorten the computation time. Furthermore, in order to reduce computational tasks, we calculate c ij in (3.3) for i = 1,...n−1 and j = i +1. In order to derive R * in (3.1), since we need Σ y * y * , Σ y * x * , Σ x * x * , Σ y * w * and Σ w * w * , we set n = 10, 000 and generate X for each p 0 and ρ x and standardize them. Then, from each transformation function (A), (B) and (C) and each parameter δ and q 0 , we obtain Y, which we also standardized.
Note that q 0 = 3 when the transformation function is in (B) and q 0 = 4 when the transformation function is in (C). In CCA, Σ y * y * , Σ y * x * and Σ x * x * can be derived as the sample variance matrix of the standardized Y, the sample covariance matrix of the standardized Y and X and the sample variance matrix of the standardized X, respectively. In NCCA and PNCCA, we convert X as above for each h and standardize the converted values. The results of conversion is derived as W. Then, Σ y * w * and Σ w * w * can be derived as the sample covariance matrix of standardized Y and W and the sample variance matrix of the standardized W, respectively. Using these matrices, we evaluated the results of each method.
In order to evaluate these methods, we fixed X and generated Y for 1, 000 repetitions. We used the standardized X, Y and W in each repetition. For each repetition with CCA, we obtain S yx , S yy and S xx . On the other hand, for each repetition with NCCA and PNCCA, we obtain S yw , S yy and S ww .
For each repetition with CCA in (1.1), we calculated the maximized correlation under certain conditions by using S yx , S yy and S xx instead of Σ yx , Σ yy and Σ xx , respectively. We denote the maximized correlation as 2 θ , the eigenvector that corresponds to the largest eigenvalue of 1 1 xx yx yy yx For each repetition with NCCA which can be defined as (2.2) with λ = 0, we calculated the maximized correlation under certain conditions and the optimized h, for which we used S yw , S yy and S ww instead of Σ yw , Σ yy and Σ ww , respectively. We denote the maximized correlation as 2 0 η , the eigenvector that corresponds to the largest eigenvalue of  Table 1-5. The reason of using * N R and * P R is that the purposes of the corresponding method are finding the nonlinear relationship. In Table 1 to 5, the bold and italic faces mean the biggest and second biggest values, respectively, in each situation.
First, we consider the results when using pattern (A), which are presented in Table 1-3. When ρ x becomes large, the result values of CCA become small, the results of PNCCA become large and the result values of NCCA also become large except when (n, p 0 ) = (30, 5) and (n, p 0 ) = (30, 8).
The result values of each method become large in almost all cases when δ becomes large except when (n, p 0 ) = (100, 3). In this pattern, we can change q 0 . Thus, next, we consider the result values when q 0 changes. When q 0 changes from 3 to 8, the result values of NCCA become large in almost cases. When q 0 becomes large, the result values of PNCCA become large in almost situations in (n, p 0 , δ) = (30, 3, 1), (n, p 0 , δ) = (50, 3, 1) and (n, p 0 , δ) = (100, 3, 1) and that of PNCCA become small in almost situations in (n, p 0 , δ) = (30, 3, 3), (n, p 0 ) = (30, 5) and (n, p 0 ) = (30, 8). Next, we consider the results when p 0 becomes large. In n = 50 and n = 100, the result values of NCCA and PNCCA become large when p 0 becomes large. The result values of PNCCA also become large when n = 30 and p 0 becomes large. In this connection, we focus on the results when n becomes large.
When n changes from 30 to 50, the result values of CCA almost all become small in (p 0 , q 0 ) = (3, 5) and (p 0 , q 0 ) = (3, 8), the result values of NCCA become large and the result values of PNCCA become large when p 0 = 3 and p 0 = 5. When (p 0 , q 0 ) = (8, 5) and (p 0 , q 0 ) = (8, 8) and n changes from 30 to 50, the result values of PNCCA almost all become small. The result values of NCCA also become small when n changes from 30 to 50 in almost all situations when (p 0 , ρ x ) = (8, 0.8).          When n changes from 50 to 100 and p 0 = 8, the result values of PNCCA almost always become small. The result values of PNCCA also become small when n changes from 30 to 100 except when (p 0 , ρ x ) = (8, 0.95). The result values of NCCA become large when n changes from 30 to 100. Next, we consider the results when using pattern (B), which are presented in Table 4 When p 0 = 5 and δ becomes large, the result values of CCA also almost all become small, except when ρ x = 0.8. Next, we compare the result values when p 0 and n both become large. When p 0 becomes large, the result values of PNCCA become large. The result values of CCA become small when p 0 changes from 3 to 5 except when (δ, ρ x ) = (3, 0.8). When p 0 changes from 3 to 8, the result values of CCA almost all become small. The result values of NCCA become large when p 0 changes from 3 to 5 in (n, δ) = (50, 3) and (n, δ) = (100, 3) and when p 0 changes from 3 to 8 in n = 50 and n = 100. When p 0 changes from 5 to 8 and δ = 3, the result values of NCCA almost all become large. In contrast to this, the result values of CCA become small when p 0 changes from 3 to 5 in n = 30. Moreover, when n changes from 50 to 100 and p 0 = 3 and it changes from 30 to 100 and p 0 = 8, the result values of CCA become small. The result values of NCCA become large except when (p 0 , ρ x ) = (3, 0.95) and (p 0 , δ) = (8, 3), when n changes from 50 to 100. The result values of PNCCA almost always become large when n becomes large and p 0 = 3 and p 0 = 5. When (p 0 , δ) = (8, 1) and n changes from 30 to 50 and 30 to 100, the result values of PNCCA also become large. When n changes from 50 to 100 and p 0 = 8, the result values of PNCCA become small.
Finally, we consider the results with pattern (C), which are in Table 5. When ρ x becomes large, the result values of CCA and PNCCA become small and large, respectively. When ρ x becomes large, the result values of NCCA almost always become large in p 0 = 3, p 0 = 5, (n, p 0 , δ) = (50, 8, 3) and (n, p 0 ) = (100, 8). When δ becomes large, the result values of NCCA and PNCCA become large. When δ becomes large, the result values of CCA become small when (n, p 0 ) = (100, 8) and p 0 = 3, but not when (p 0 , ρ x ) = (3, 0.5). Next, we compare the results when p 0 becomes large and n becomes large. When ρ x becomes large, the result values of PNCCA almost always become large. The result values of NCCA become small when p 0 changes from 3 to 5 and n = 30. Moreover, when n becomes large, the result values of NCCA and PNCCA almost always become large.
Based on these results, we recommend using PNCCA with the proposed optimization method in order to find a nonlinear relationship.

Conclusion
In the present paper, we considered finding a nonlinear relationship between random vectors. CCA (Hotelling, 1936) can find only a linear relationship between random vectors, based on the correlation between linear combinations of them. The use of conversion functions allows a nonlinear relationship to be found by using CCA on the converted variables. Hardoon et al. (2004) pointed out that this method has a critical issue and, to avoid this, Akaho (2000) proposed PNCCA when the conversion functions are the kernel functions. Although the result of PNCCA heavily depends on the penalty and other hyper parameters in the conversion function, there has been no optimization methods proposed for them until the present paper. The reason for this is that the evaluation method for the covariance matrix is not defined.
In order to optimize the penalty and other parameters in PNCCA, we considered the evaluating function in (3.1) and proposed using the simple CV method in Section 3. Using the two samples {y i , w i } and {y j , w j }, where i ≠ j, we define [ ] S and the results of PNCCA for each i and j, (i ≠ j), we then proposed the optimization method for the penalty and other hyper parameters in the conversion function based on the sum of evaluation (3.3). We can easily extend this method for subset CV method. Our numerical studies showed that PNCCA is almost always the best of the three we tested. Thus, we recommend using PNCCA, optimized by using the proposed simple CV method. Acknowledgment I would like to express my deepest gratitude to Prof. Hirokazu Yanagihara of Hiroshima University for his valuable comments.

Ethics
We consider there are no problem about any ethical issues.
Appendix: Using Proposed CV Method to Select Variables in y and x Using the optimized penalty parameterλ , the maximized value of ' yw a d Σ in (2.2) is estimated by using λ η , which coincides with the square root of the largest eigenvalue of ( )  x [2] are selected if it does not hold. Since we evaluate the covariance matrix by using the subset CV method, we conjecture that we may select another statistical estimation method based on the covariance matrix.