On Multiresponse Semiparametric Regression Model

The modeling between predictors and response in statistics sometimes deals with more than one response or multiresponse situation. Furthermore, it can be happen that some predictors have linear relationship with the responses and the others predictor have unknown relationship. To overcome this modeling problem we proposed multiresponse semiparametric regression model. This model has more than one response and contains both parametric and nonparametric model. This study focuses on how to estimate parameter in multiresponse semiparametric regression. The weighted penalized least squares method is used to fit the model. This method produce partial spline estimator for nonparametric model and by applying some assumptions the estimator is polynomial natural spline. The performance of this estimator depends on smoothing parameter. So, we also proposed G criteria as modification of generalized cross validation in the context of multiresponse semiparametric regression to choose the optimal smoothing parameter. Using simulation data, it can be shown that this model can work well to describe relationship between some predictors and several responses.


INTRODUCTION
Statistical modeling especially in regression analysis sometimes deals with two or more response variables. For example, bank manager wants to know impact of credit size on company performance. Company performance can be measured based on hours of activity, labor, asset, revenue and benefit. Another example is from sport. A coach wants to measure impact of exercise duration on quality of fitness. Fitness can be measured using endurance, speed, power, strength, flexibility and balance. In addition, in the field of emission gas analysis, an engineer wants to know how impact of vehicle age on emission gas, which can be measured using concentration of CO2, CO, NOx and SOx.
Several authors have developed multiresponse statistical modeling. Wegman (1981); Miller and Wegman (1987) and Fessler (1991) developed algorithm for spline smoothing. Yee and Wild (1996) used additive model for multireponse data from exponential family. Soo and Bates (1996) developed multiresponse spline regression by using Gauss-Newton algorithm as method for estimating model's parameter. Wang et al. (2000) developed bivariate spline nonparametric regression using assumption that covariance matrix is unknown and they proved that the joint estimates have smaller posterior variances than those of function-by-function estimates and are therefore more efficient. You et al. (2007) used two stages estimation on nonparametric seemingly unrelated regression.

JMSS
Semiparametric regression model is also called partially linear model. This model is fusion between parametric and nonparametric regression. It is more flexible than fully parametric or nonparametric model because some of the predictors are linear and the rest being modeled nonparametrically. Engle et al. (1986) used this model firstly and for several decades it has been developed extensively and its application spread in many field. Further detailed about application of semiparametric model can be seen in Hardle et al. (2000). Lin and Carroll (2001) used semiparametric model for data clustering by developing semiparametric partially generalized linear model. Next, Qin and Zhu (2008) and Qin et al. (2009) used semiparametric model for longitudinal data. You and Zhou (2009) used polynomial spline for semiparametric regression in panel data. In addition, Ruppert et al. (2009) reviewed semiparametric regression development and its application during 2003-2007. The estimator in semiparametric regression can be obtained by penalized least square method. This estimator is solution of optimization problem below Equation 1: where, R(f) and J(f) indicate goodness of fit and smoothness level respectively. The quantity λ controls the tradeoff between goodness of fit R(f) and smoothness J(f). The quantity λ is also called smoothing parameter. Wahba (1990) has showed that if the goodness of fit is sum of squares residual and smoothness level is integral of second derivative square, then the estimator which minimize (1) in the context of semiparametric regression is partial spline. However, this estimator still depends on smoothing parameter λ. If λ value is small (λ→0), then the estimator tend to interpolate the data, fitting every data point exactly. Conversely, if λ value is large (λ→∞), then the estimator tend to be forced linear regression. So, the optimal smoothing parameter must be chosen to get the best estimator. Several authors had developed method for choosing the optimal smoothing parameter, such as Wahba (1990) that proposed generalized cross validation method. Wang (1998) also proposed Unbiased Risk (UBR) criteria to choose the optimal smoothing parameter.
Motivated by applied of multiresponse statistical modeling and the result of partial spline as in Wahba (1990), in this study we proposed multiresponse semiparametric regression model. Regarding to this model, the study focused on how to estimate parameter of the model. After that, we consider G criteria as method to choose the optimal smoothing parameter. This criteria is modification of generalized cross validation in the context of multiresponse semiparametric regression. Then, parameter estimation and choosing of optimal smoothing parameter will be demonstrated using the simulation data.

Statistical Model
Suppose there are n sample random and each sample is observed for r response and some predictors. Suppose y ji denote response jth and observation ith, with j = 1, 2, …, r and i = 1, 2, …, n. It is assumed that response y ji have linear relationship with some predictor and unknown relationship with one predictor. Denote ji ji1 ji2 jip x (x , x ,..., x ) = % as vector dimension 1×p is predictor variable and have linear model with response y ji as well as t ji ∈[a,b] as predictor variable have unknown relationship with response.
Based on above data construction, multiresponse regression model is constructed below Equation 2: ε ji , i = 1, 2,…,n are random error independently with mean zero and variance 2 j σ . There is correlation between error ε ji and ε ki . T p j j j1 j2 jp ( , ,..., ) R β = β β β ∈ % is p × 1 parameter model as parametric component. Due to unknown relationship predictor t ji and response y ji is, so f j (t ji ) is unknown function as nonparametric component and is only assumed smooth and continuous on interval [a j , b j ]. Now, model (2) can be stated in matrix notation Equation 3: j2 jn y y , y , , y , y y , y , , y , j 1, 2, ,r

Estimation Method
The estimation method of the model's parameter is the weighted penalized least square. Based on this method and model (3), it is defined a goodness of fit and smoothness J(f) respectively: is weighted for goodness of fit and it is from variance covariance matrix between response and its element is defined:

Parameter Estimation
It would be shown that solutions of optimization (5) are estimator for β % and f % . They are Equation 6 and 7: Where: To show the result of (6) and (7), it was started from properties of element in Hilbert space. If f j ∈ H, j = 1,2,…,r for some Hilbert space H, then f j can be stated f j = f j0 + f j1 where f j0 ∈ H 0 and f j1 ∈ H 1 . The f j1 component

JMSS
is orthogonal projection of f onto H 1 in space H. Suppose that φ j1 , φ j2 ,…,φ jm are basis for H 0 sub space, such that for each function f j0 ∈ H 0 can be stated Equation 8: Next, a n × m matrix T j is defined, with: Meanwhile, consider ψ j1 , ψ j1 ,…,ψ jn as basis for H 1 . So, for each f j1 ∈ H 1 , it can be expressed Equation 9: Due to f j = f j0 +f j1 , then for every f j ∈H, it can be stated: In matrix notation, the Equation (10) can be stated: Generally, for all j, the nonparametric component can be presented Equation 11: It should be noted that for quantity J(f ) % and j = 1,2,…, r: can be presented: Next, based on Equation (5), (11) and (12), it is defined Q( , , ) β α γ % % % equal to: Where: 1 n 2 n r n diag( I , I , , I ) Λ = λ λ λ L (13) can be done by differentiating over each parameter that will be estimated. Firstly, it will be derived estimator for f % . For this purpose, the z y X = − β % % % is defined as constant. So, Equation (13) becomes:

Minimizing of Equation
It is defined that M = B∑ + nΛ, so Equation 14: The estimator of f can be stated Equation 15: Where: In other side, substitute Equation (11) From (15) and (16), the multiresponse semiparametric regression can be written as Equation 17: Where: Next, consider special condition for nonparametric estimator f % when assumption for f is hold following condition: then the function f which minimize weighted penalized least square would be: which is Natural Spline (NS) degree 2m-1. It is denoted by NS 2m-1 . From Wahba (1990), natural spline m n is a real-valued function on [a, b] defined with the aid of n knot 1 n a t t b −∞ ≤ < < < < ≤ ∞ L with the following properties: Where:

Science Publications
JMSS π k = The polynomial of degree k or less C k = The class of function with k continuous derivatives For more detailed the reader can read that reference.
The following process shows that if f holds the above assumption, then estimator of f is natural spline degree 2m-1. For t∈[a j , t ji ]: As a result: Based on (i), (ii) and (iii), it can be inferred that f j (t) ∈ NS 2m-1 .

Numerical Example
As example for multiresponse semiparametric regression, simulation data was generated using the following equation: This equation generates data for multiresponse semiparametric regression with three responses, linear parametric component and sinus pattern for nonparametric component. Predictor variable x ji , j = 1,2,3 as parametric predictor was generated from Normal (0,1). Predictor variable t ji , j = 1, 2, 3 as nonparametric predictor was from Uniform(0,2). Random error   T  T  T T  1  2  3 ( , , ) ε = ε ε ε % % % % was from Multivariate Normal (0,Ω). ε 1i , ε 2i , ε 3i are correlated with corr(ε 1i , ε 2i ) = corr(ε 1i , ε 3i ) = corr(ε 2i , ε 3i ) = ρ = 0.5. Error variance 2 j σ was generated from Unoform(0.1, 0.2), j = 1, 2, 3. Figure 1 show partial scatter plot of above simulation. This figure is used to confirm what relationship between response and parametric predictor as well as response and nonparametric predictor for each response. The scatter plot shows that the relationship between parametric predictor (x) and response is linear as well as relationship between nonparametric predictor (t) that relatively follows the pattern of sinus function. Figure 2 shows three dimension surface plot to describe relation between predictor and response in one picture. This surface plot is composed by combining predictor x and predictor t as horizontal axis as well as response y as vertical axis. We just present surface plot of response 1 due to limitation space. The surface plot of response 2 and 3 has the similar pattern with response 1.
Next, based on this data, the parametric and nonparametric estimator will be computed by implementing R software. Due to the limitation of space, the result of computing is presented simply in the following tables and figures. Table 1 shows various smoothing parameters and their G score. We can see the changing of G score from small (line 1) to large smoothing parameter (line 11) As mentioned before that the partial spline estimator depends on the smoothing parameter. The following figures shows impact of smoothing parameter on the shape of surface plot between predicted value of response as vertical axis as well as predictor x and t as horizontal axis. We just also present surface plot of response 1 to avoid too many figure with the same information. Figure 3 shows that for small smoothing parameter (line 1 of Table 1), the spline estimated imply the predicted value tend to interpolate the actual value of response. We can compare surface plot in Fig. 2 and surface plot in Fig. 3 that show their shape is almost the same. Figure 4 shows that for large smoothing parameter (line 11 of Table 1), the predicted value tend to be forced to fit the linear regression. It can be seen, as long both horizontal axis, the predicted value have the linear pattern. This mean both parametric and nonparametric have the linear pattern. We also conclude that for large smoothing parameter, the partial spline estimator tend to be forced linear regression fit.   Table 1 shows that the optimal condition is reached when the G score is minimum, equal to 0.320519 (line 6 of Table 1) and the smoothing parameters of response 1, response 2 and response 3 are 2.667644e-08, 6.223154e-09 and 2.962001e-09 respectively. Using the optimal smoothing parameter, the corresponding estimator for parametric component β is presented in Table 2.  Figure 5 shows the surface plot of predicted value of each response using the optimal smoothing parameter.

DISCUSSION
There are several issues regarding the multiresponse semiparametric regression. The first one is about estimator of nonparametric component. In this study, the nonparametric estimator as in Equation (6) is general form of smoothing spline, which is produced by weighted penalized least square method. However, if we apply some assumptions for function f, the estimator will be special form of spline smoothing, such as polynomial natural spline in this study. In other words, the different assumption of f and its norm will produce different form of spline. We did not discuss further about this, but further discussion can be found in Wahba (1990). Another method also can be used to estimate parameter, such as penalized likelihood. Nevertheless, this method requires distribution assumption of residual.
The second issue is smoothing parameter especially method to choose the optimal smoothing parameter and how many smoothing parameter must be chosen. In this study, we modify generalized cross validation to choose the optimal smoothing parameter in the context of multiresponse semiparametric regression. Actually, there are other methods to select the optimal smoothing parameter, such as Unbiased Risk (UBR), Generalized Maximum Likelihood (GML) or Cross Validation (CV). Wang (1998) did comparison of these method. In our case, the model involves the multiple smoothing parameters and this situation will imply the complicated of optimization problem. Therefore, to make easier the optimization problem, we choose GCV methods that relatively easy to be modified and optimized.

CONCLUSION
Nonparametric and parametric estimator in multiresponse semiparametric regression model can be obtained by using weighted penalized least square. More specific, estimator for nonparametric component is partial spline function. Especially, if nonparametric component hold some assumption, then the kind of spline function is polynomial natural spline. This estimator depends on smoothing parameter and this imply that the predicted values also depend on smoothing parameter. However, the optimal smoothing parameter can be chosen by using G criteria. Numerical example shows that the model can be applied well using simulation data. The problem remaining is to find statistical properties of the estimator and to apply this model in real life problem.