HYPERPARAMETER SELECTION IN KERNEL PRINCIPAL COMPONENT ANALYSIS

In kernel methods, choosing a suitable kernel is indispensable for favorable results. No well-founded methods, however, have been established in general for unsupervised learning. We focus on kernel Principal Component Analysis (kernel PCA), which is a nonlinear extension of principal component analysis and has been used electively for extracting nonlinear features and reducing dimensionality. As a kernel method, kernel PCA also suffers from the problem of kernel choice. Although cross-validation is a popular method for choosing hyperparameters, it is not applicable straightforwardly to choose a kernel in kernel PCA because of the incomparable norms given by different kernels. It is important, thus, to develop a well-founded method for choosing a kernel in kernel PCA. This study proposes a method for choosing hyperparameters in kernel PCA (kernel and the number of components) based on cross-validation for the comparable reconstruction errors of pre-images in the original space. The experimental results on synthesized and real-world datasets demonstrate that the proposed method successfully selects an appropriate kernel and the number of components in kernel PCA in terms of visualization and classification errors on the principal components. The results imply that the proposed method enables automatic design of hyperparameters in kernel PCA.


INTRODUCTION
Dimension reduction is an essential part of modern data analysis, where we often need to handle large dimensional data. The purpose of dimension reduction may be visualization, noise reduction and pre-processing for further analysis. Among others, the Principal Component Analysis (PCA), (Pearson, 1901) is one of the most famous methods to reduce the dimensionality by projecting data onto a lowdimensional subspace with largest variance.
Kernel Principal Component Analysis (kernel PCA) (Scholkopf et al., 1998) has been proposed as a nonlinear extension of the standard PCA and has been applied to various purposes including feature extraction, denoising and pre-processing of regression. Kernel PCA is an example of the so-called kernel methods (Scholkopf and Smola, 2002), which aim to extract nonlinear features of the original data by mapping them into a high-dimensional feature space Reproducing Kernel Hilbert Space (RKHS). This mapping is called feature map. A number of methods have been proposed as kernel methods, which include Support Vector Machine (SVM), (Boser et al., 1992), kernel ridge regression (Saunders et al., 1998), kernel canonical correlation analysis (Akaho, 2001;Bach and Jordan, 2002;Alam et al., 2010), A novel multiclass SVM algorithm using mean reversion and coefficient of variance (Premanode et al., 2013) and so on.
It is well known that the performance of a kernel method is dependent highly on the choice of kernel. For supervised learning such as SVM and kernel ridge regression, cross-validation is popularly used for choosing the hyperparameters of a kernel algorithm, such as parameters in a kernel (e.g., bandwidth of Gaussian RBF kernel), with the objective function of learning. On

Science Publications
JCS the other hand, no well-founded methods have been proposed in general for unsupervised learning such as kernel PCA and kernel canonical correlation analysis.
This study focuses on kernel PCA and proposes a method for choosing hyperparameters: Parameters in a kernel and the number of kernel principal components. In the case of standard linear PCA, the algorithm can be formulated as minimization for self-regression with reduced rank and cross-validation approaches have been proposed for choosing the number of components (Krzanowski, 1987;Wold, 1978). In contrast, while a similar regression formulation is possible for kernel PCA, the crossvalidation approach is not applicable straightforwardly for choosing a kernel in kernel PCA: The error of the regression is given by the RKHS norm of the feature space associated with the kernel and thus the cross-validation errors are not comparable for different kernels.
As detailed in section 2, the proposed method for choosing the hyperparameters of kernel PCA uses crossvalidation for the reconstruction errors of preimages in the original space. The pre-image of a feature vector is defined by an approximate inverse image of the feature map (Mika et al., 1999). Various methods have been already proposed to calculate the preimage of a feature vector, as explained in section 2.1 (Mika et al., 1999;Kwok and Tsang, 2003;Bakir et al., 2004;Rathi et al., 2006;Arias et al., 2007;Zheng et al., 2010). In the proposed method, given an evaluation data in the cross-validation, we compute the pre-image of the corresponding feature vector projected onto the subspace given by kernel PCA and then evaluate the reconstruction error of the evaluation point. A kernel and the number of components corresponding to the minimum average reconstruction error are chosen as the optimum ones. We demonstrate the effectiveness of this method experimentally with various synthesized and real-world datasets.

Kernel PCA
In kernel methods, the nonlinear feature map is given by a positive definite kernel, which provides nonlinear methods for data analysis with efficient computation. A symmetric kernel k(.,.) defined on a space x is called positive definite if for arbitrary number of points x 1 ,…, x n ∈x the Gram matrix (k(x i , x j )) ij is positive semidefinite. It is known (Aronszajn, 1950) that a positive definite kernel k is associated with a Hilbert space H, called Reproducing Kernel Hilbert Space (RKHS), consisting of functions on x so that the function value is reproduced by the kernel; namely, for any function f∈H and point x∈x, the function value f (x) is given by: where, 〈,〉 H in the inner product of H. Equation 1 is called the reproducing property. Replacing f with ( ) To transform data for extracting nonlinear features, the mapping Φ: x→H is defined by: Φ(x) = k(.,x). Which is regarded as a function of the first argument. This map is called feature map and the vector Φ (x) in H is called feature vector. The inner product of two feature vectors is then given by ɶ . This is known as the kernel trick, serving as a central equation in kernel methods. By this trick the kernel can evaluate the inner product of any two feature vectors efficiently without knowing an explicit form of either Φ(.) or H. With this computation of inner product, many linear methods of classical data analysis can be extended to nonlinear ones with efficient computation based on Gram matrices. Once Gram matrices are computed, the computational cost does not depend on the imensionality of the original space.
Kernel PCA (Scholkopf et al., 1998) conducts principal component analysis for the feature vectors. More precisely, given data points X i ∈x, i = 1, 2,..,n, kernel PCA outputs a set of principal functions by the following two-step procedure: (i) transform the data nonlinearly into the feature space H, i.e., X i → Φ(X i ), (ii) solve the linear PCA problem for the feature vectors, i.e., solve the directions in H for which the variance of {Φ(X i )} along those directions is maximized. The algorithm of kernel PCA is described as follows (Scholkopf et al., 1998). Let vector. The estimated covariance matrix is given by with the centered feature vectors. The principal directions g∈H are given by the unit eigenvectors corresponding to the largest eigenvalues and thus the problem is converted to solving the eigenequation:

JCS
By using the kernel trick, this problem is reduced to the generalized eigenproblem that finds where, M is the n×n centered Gram matrix defined by M = CKC with K ij = k(X i, X j ) and T n n n 1 C I 1 1 n = − . Here In is the identity matrix of size n and 1 n is the vector with n ones. The constraint α T Mα = 1 corresponds to the condition 〈g i , g h 〉 = δ jh , where δ jh is the Kronecker's delta.

Choice of Kernel
The result of kernel PCA obviously depends on the choice of kernel. It is often the case that the kernel has some parameters like the popular examples shown in Table 1. In such a case, these parameters may have strong influence on the results. To depict the influence, using Wine data (see section 3) we show the plots of the first two kernel principal components with different values of inverse band width parameter s in Gaussian RBF kernel and degree d and constant c in the polynomial kernel (Fig. 1). From the figure, we see that in both the kernels the results of kernel PCA depend strongly on the parameters and an appropriate choice is indispensable for the method to give reasonable low-dimensional representation of data.
It is known that the standard PCA can be formulated as a self-regression or reconstruction problem; namely, the first r principal components of centered data { } n d i 1 X ⊂ ℝ are equal to the projections BX i given by the reduced rank regression: where, A and B are d×r and r×d matrices, respectively. Based on this regression formulation, the cross validation approach (Stone, 1974) has been used for the standard PCA to choose the number of components (Wold, 1978;Krzanowski, 1987) by minimizing the above selfregression errors.
In a similar manner, the kernel PCA can be also formulated as the self-regression of the centered feature vectors. In fact, it is easy to see that the first r principal directions are given by: where, f j , g j ∈H with 〈g i , g l 〉 = δ jl . One might expect that this self-regression formulation could be applied to the cross-validation method for choosing a kernel in kernel PCA. This is not possible, however, because the above regression error is measured by the RKHS norm given by the kernel and thus the errors are not comparable among different kernels. The goal of this study is thus to propose a method of choosing a kernel (and the number of components) in kernel PCA by introducing a criterion comparable for different kernels.

MATERIALS AND METHODS
The proposed method for choosing a kernel and the number of components uses cross-validation by the comparable reconstruction errors in the original space. To evaluate the errors, we need to solve the pre-image of the feature vectors projected on the subspace given by the principal directions. We first give a brief review of pre-image methods.

Pre-Image of Kernel PCA
While many kernel methods provide their output in the form of feature vectors in the RKHS, in some problems we want to find a point in the original space. Mika et al. (1999), kernel PCA is applied to a denoising task, in which an image corresponding to the RKHS vector obtained by kernel PCA is used as a denoised version of the original image.
Given a vector f in RKHS H, it is in general not possible to find a rigorous pre-image, that is a point X in the original space such that Φ(X) = f holds exactly. We thus define an (approximate) pre-image of f by the minimize of: In the original paper, Mika et al. (1999) have used the fixed-point iterative method. Many other approaches have been also proposed to solve the pre-image problem. A non-iterative approach of distance constraint has been proposed by Kwok and Tsang (2003), while it is dependent on the choice of neighborhood. An approach of learning a pre-image map was developed by Bakir et al. (2004). To apply this technique, we need an additional regularization parameter. Some authors have extended these approaches in different ways (Rathi et al., 2006;Arias et al., 2007;Zheng et al., 2010). More recently, a two-stage closed-form approach has been also proposed (Honeine and Richard, 2011). These advanced methods, however, usually require some tuning parameters. We use the fixed-point method in our proposed method, since it has a simple form for Gaussian RBF kernel.
We here explain the fixed-point method for solving the pre-image problem in the kernel PCA setting. Let: X 1 , X 2 , …….. , X n m ∈ ℝ be the training data for kernel PCA and It is easy to see that: and by setting the derivative zero we obtain the fixed-point algorithm: In the case of polynomial kernels, the fixed point condition does not derive such an iterative form as the Gaussian RBF kernel. We thus use the steepest descent method for Equation 4 in our experiments on polynomial kernels in section 3.4.

Method for Hyperparameter Choice
For the objective function of cross-validation, we use reconstruction errors between a test point X and the corresponding pre-image Z of the projected feature vector l P (X) Φ ɶ given by kernel PCA. The reconstruction errors are measured by the distance of the original space m x = ℝ . By this approach, unlike the regression error in the RKHS, we can consider comparable errors for different kernels. The architecture of the proposed method is given in Fig. 2. The algorithm of the kernel choice in kernel PCA is given in Fig. 3. We describe the Leave-One-Out Cross Validation (LOOCV) for simplicity, but the extension to the general K-fold cross-validation is straightforward. By a similar algorithm we are able to select the number of principal components or any other hyperparameters.
In solving approximate pre-images, the fixed-point or the steepest descent method may be trapped by local minima. To avoid this problem, we use five initial points for the optimization algorithm and choose the best one. As shown in the next section, the obtained pre-images give appropriate results.
Note also that the fixed-point method may not work well for a very large inverse-bandwidth s, since the term of the nearest X i is dominant in the right hand side of Equation 5 so that Z t may stay at X i . In the experiments, we set a reasonable parameter range of s by checking the kernel PCA results with two components.

RESULTS
We apply the proposed method for choosing the parameters in a kernel and the number of principal Science Publications JCS components in kernel PCA for various datasets. Gaussian RBF kernel is used except section 3.4, where polynomial kernel is discussed. We use two synthesized and seven real-world datasets, which are summarized in Table 2. For the real world datasets, we standardize each variable of data before applying kernel PCA. In solving preimages, we take initial values from the uniform distribution on the interval [-1; 1]. The detailed discussions on the results will be shown in section 4.

Synthesized Data
We use two synthesized datasets to illustrate the effectiveness of the proposed method. Each dataset is of two dimension and have three clusters.
We prepare the inverse bandwidth parameters s∈{0. 05, 1, 5, 10, 25, 50} and s∈{1, 5, 10, 20, 50, 100, 200} for Synthesized data-1 and Synthesized data-2, respectively and calculate the LOOCV reconstruction errors by pre-images. To see the variations over sampling, we generate 100 samples for each case of data 1 and 2 and make boxplots. Figure 4 shows (a): Scatter plots of a sample of the original datasets, (b): The boxplots and (c,d): The scatter plots of first two kernel principal components with the best kernel bandwidths (c) and with other ones (d). We can see by comparing (c) and (d) that the proposed method chooses a hyper parameter that can separate three clusters clearly, which suggests the effectiveness of the method. Note that kernel PCA does not use the explicit information of the three clusters, while they are displayed with different colors and markers for visualization purpose.

Computational Cost
To illustrate the computational cost of the proposed method, the CPU time (in second) for six different sizes of data (n) and five numbers of components (l) using synthesized data-2 are shown in Table 3. The CPU time increases as the sample size is larger, since the computation of LOOCV and the optimization of pre-images is heavier for larger samples. The configuration of the computer is Intel (R) Core (TM) i7 CPU 920@ 2.67 GHz., memory 12.00 GB and 64-bit operating system. We have used 'kernlab' package in R program for implementation of the kernel PCA. Gaussian RBF kernel is used inverse band width s = 50.

Real World Problems
We first apply the proposed method to five datasets: Wine, Diabetes, BUPA liver disorders, Fertility and Zoo, the former three of which are taken from Izenman (2008) and available at the website of the book and the latter two are taken from UCI Machine Learning Repository (Bache and Lichman, 2013).
As the kernel PCA is an unsupervised method, the evaluation of results is not straightforward. Since kernel PCA is often used as a pre-processing technique for regression and classification, we evaluate the LOOCV classification errors with the k-NN classifier (k = 5) to see the appropriateness of the hyper parameters chosen by the proposed method. Note that we do not use the class labels for kernel PCA, but use them only for evaluating the classification errors.
We consider a set of inverse bandwidths s∈{0.05, 0.10, 0.25, 0.50, 0.75, 1.00, 10.00} and six numbers of kernel principal components l ∈ {2, 3, 4, 5, 8, 10} for each dataset. The LOOCV reconstruction errors used in the proposed method and the LOOCV classification errors for all the hyperparameters are shown in Table 4, from which we see that the selected hyperparameters attain the minimum or close to the minimum classification error for all the datasets. This suggests that the proposed method provides appropriate hyperparameters that maintain the cluster structure effective for the classification tasks.
We next apply the proposed method for larger datasets in dimensionality and sample size. USPS data (Song et al., 2008) consists of 16×16 gray scale images of handwritten digits and thus the dimensionality is 256. The original dataset has 2007 images, but we draw 100 images from each of five digits 1, 2, 3, 4, 5 and add Gaussian noise with mean 0 and standard deviation 0.01. The dataset is referred to as USPSG-500. We take seven inverse bandwidths s∈{0.0001, 0.001, 0.0025, 0.005, 0.0075, 0.01, 0.025} and eight numbers of kernel principal components l∈{2, 4, 8, 16, 32, 64, 128, 256}. The LOOCV reconstruction errors in the proposed method are shown in Table 5, in which the minimum is attained at s = 0.01 and l = 64. The kNN (k = 5) misclassification rates estimated with LOOCV are also listed in the table.
We next apply the proposed method to the nutritional value of food, which is not for classification. The dataset has 961 food items with six nutritional components as attributes (Izenman, 2008). We consider seven values of inverse bandwidths s ∈ {0.001, 0.1, 0.5, 0.75, 1, 5, 10, 100, 200} and five numbers of components l ∈ {1, 2, 3, 4, 5, 6}. The results are displayed in Table 6. The smallest LOOCV reconstruction error is attained at s = 0.5 and l = 2. Since, unlike classification tasks, it is not straightforward to evaluate the performance of the proposed method, we show the scatter plots of the first two kernel principal components using three values of inverse bandwidths s∈{0.001, 0.5, 200} in Fig. 5.

Polynomial Kernel
We use the proposed method for choosing the hyperparameters in the polynomial kernel. Using Wine dataset, we consider seven values of offset parameters c ∈ {0.1, 0.5, 1, 5, 10, 25, 50}, two values of degree d ∈ {2, 3} and four numbers of kernel principal components l ∈ {2, 3, 4, 5}. The results are given in Table 7. We observe that the smallest LOOCV reconstruction error is attained in the area close to the minimum classification error.

DISCUSSION
While kernel PCA has been applied in various areas of machine learning, such as dimensionality reduction, feature extraction, de-noising and so on (Scholkopf and Smola, 2002;Rathi et al., 2006;Hofmann, 2007;Zheng et al., 2010;Feng and Liu, 2013), in most cases the kernel and the number of features are chosen in a heuristic way. Recently, multi-kernel PCA (Ren et al., 2013) has been also proposed, which applies combination of multiple kernels instead of choosing one. It is well known, however, that the multi-kernel approach results in a computationally heavy algorithm, which may need advanced optimization technique. The method proposed in this study, in contrast, is based on the reconstruction errors in the original space, which can be regarded as a natural extension of the aim of the standard linear PCA. The required computation is simply cross-validation with a basic optimization algorithm such as the fixed-point or gradient method.
We provide detailed discussions on the experimental results for real-world data sets in section 3. For classification data sets, we can see from Table 4 and 5 that the hyperparameter (bandwidth parameter in Gaussian RBF kernel and the number of principal components) gives the best or close to best LOOCV classification error: The best for Wine data and the second or third best for the other 5 data sets. In all cases, we observe that the chosen hyperparameters are close to the best parameters for the classification error. These experimental observations imply that the proposed method gives appropriate hyperparameters, with which the low dimensional features obtained by kernel PCA represent effective information of data.
From Table 6 and Fig. 5, we can see that the hyperparameter chosen by the proposed method provides the features with clearer structure than the other two hyperparameters used in (a) and (c). For this data set, Izenman (2008) provides detailed analysis on the results of kernel PCA with a hand-tuned bandwidth parameter: A meaningful "curve" structure is observed in the result of two-dimensional kernel PCA. As shown in Fig. 5, our method automatically chooses such a hyperparameter that accords with the observation in Izenman (2008).
We can also observe from Table 7 that the proposed method chooses the hyperparameters for kernel PCA with polynomial kernel so that the corresponding LOOCV for classification error attains the third best. This accords with the observation on the other cases with Gaussian RBF kernel and demonstrates the appropriateness of the proposed method.
Regarding the computational cost of the proposed method, the proposed method needs to solve the preimage problem for each of the data, which may cause a computational issue for large data set. Table 3 shows that the computational time increases roughly quadratically with respect to the sample size. To reduce the computational cost, it may be possible to use only a part of data for evaluating reconstruction errors in choosing hyperparameters.

CONCLUSION
We have discussed the kernel PCA and proposed a method for choosing hyperparameters, optimal kernel (parameters in a kernel) and the number of kernel principal components, through the LOOCV for the reconstruction errors of pre-images. We have made empirical studies using synthesized examples and realworld datasets. For evaluation of the proposed method, in addition to visualization, we used classification errors for the projected data onto the subspace chosen by the method, if the data set is provided for a classification task. We have observed that for all the datasets classification performance of the kernel PCA chosen by the proposed method is the best or close to the best Science Publications JCS among the candidates of hyperparameters. The experimental results imply that the proposed method successfully provides an automatic way of finding such hyperparameters that give appropriate low-dimensional representation of data by kernel PCA.
There are also limitations of the proposed method. First, the optimization such as fixed-point and steepest descent method for computing the pre-image has possibility of being trapped by local optimum. Applying other pre-image methods to alleviate this problem will be an important future research. Second, since our method uses the cross-validation with pre-image optimization, it may be time-consuming for large datasets. One possible approach is to use a part of data for evaluating reconstruction errors and it is also an interesting future direction to develop a more efficient way of hyperparameter choice for kernel PCA. Third, the reconstruction errors in the proposed method assume that the original space admits a metric, while kernel PCA can be applied to more general data spaces including nonmetric spaces. It is also among our future studies to consider hyperparameter choice applicable to kernel PCA for non-metric spaces.

ACKNOWLEDGEMENT
This study has been supported by JSPS Grant-in-Aid for Scientific Research 22300098 and MEXT Grant-in-Aid for Scientific Research on Innovative Areas 25120012.