Simulation Study for Evaluating a New Three Stages Procedure for Selecting the Right Multivariate Regression Model

Problem statement: This article considers the analysis of multivariat e regression experiment that is used frequently in variety of applications re earch such as psychiatric epidemiologic studies. Our study concerned with multivariate regression model in which the responses were correlated in particula r ways for both standard and non-standard multivariat e model structures. Our objective is to find reliab le procedure that can be used to guide the selection o f the best multivariate regression model that has t e right covariance structure and in the same time has t e right multivariate model structure for both standard and non-standard multivariate model struct u es. Approach: In this study, we were proposing and evaluating a new three stages procedure that co uld be used to guide the selection of the best multivariate regression model that has the right co variance structure and in the same time has the rig ht multivariate model structure using bootstrap simula tion procedure. Results: The simulation results indicated that the performance of the new procedure in identifying the right multivariate regression model that has the right covariance structure and i n the same time the right multivariate model struct ure from both standard and non-standard multivariate mo del structures was excellent overall. Conclusion/Recommendations: We recommended using the new procedure as stander tools to guide the selection of the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure.


INTRODUCTION
The multivariate linear regression model composed of multiple correlated dependent variables for each subject, in addition to a set of predictor variables. Multivariate linear regression allows researchers to fit a single model for each response, taking into account the correlation among the multiple responses on a given subject. The basic assumptions of multivariate regression are multivariate normality of the residuals, homogenous variances of residuals conditional on predictors, common covariance structure across observations and independent observations. When these assumptions are satisfied, the coefficients will be unbiased, the least-squares estimates will have minimum variance and the relationships among the coefficients will reflect the relationships among the predictors. In general, this is what REG procedure of the SAS System is set up to do. When we deal with the multivariate linear regression model a companion to the estimation problem is the model selection problem, which consists of choosing an appropriate model from a class of candidate models to characterize the data under study. The covariance structures of the observed multiple responses makes multivariate linear regression data analysis different from univariate multiple linear regression data in term of the prediction of an individual response component given some or all of the remaining components (McCullagh, 2006).
Although the MIXED procedure of the SAS System is used as tools for fitting mixed effects and repeated measures models, it is also a very useful tool for fitting multivariate regression. The most advantages of using the MIXED procedure instead of stander multivariate procedure are MIXED uses observations have incomplete responses, Mixed has the ability to deal with non-stander (e.g., multiple design) multivariate models and MIXED enables researchers to fit correlated error model with different covariance structure. The MIXED procedure of the SAS System has different selections for modeling the covariance structure. The MIXED procedure of the SAS System can be used to develop either Maximum Likelihood (ML) or Restricted Maximum Likelihood (REML) estimates in order to complete the analysis of the multivariate regression, where REML estimation are generally preferred to ML. A lot of effort is usually needed to decide what the suitable covariance structure of the data is at the beginning of the statistical analysis. Statisticians often use information criteria such as AIC (Akaike, 1974), BIC (Schwarz, 1978), CAIC (Bozdogan, 1987), HQIC (Hannan and Quinn, 1979) to guide the selection of the covariance structure in mixed models (AL-Marshadi, 2007;Keselman et al., 1999;Littell et al., 2000;Singer, 1998) Many studies have investigated performance of those information criteria in selection of the covariance structure considering repeated measures models (Keselman et al., 1999;Ferron et al., 2002;Gomez et al., 2005;Guerin and Stroup, 2000). One study compared the following, AIC, BIC, AICC and RIC to select stander multivariate regression model with the stander covariance structure (Azari et al., 2006). A simulation study by Beal (2005) compared different information criteria methods in SAS for selecting the right multiple linear regression models for large sample size. Another research by Seghouane (2006) had developed and compared a new small sample model selection criterion for multivariate regression models to other known criterion with the stander covariance structure. A new paper proposed new criterion deals with selection of variables in multivariate linear regression models with fewer observations than the dimension by using Akaike's Information Criterion (AIC) (Yamamura et al., 2008). A psychiatric epidemiologic study proposed multivariate linear regression as the preferred method when the multiple informant outcome data are continuous using MIXED procedure in SAS (Goldwasser and Fitzmaurice, 2001). In our resent simulation study, we were concerned with multivariate regression model in which the responses were correlated in particular ways for both standard and nonstandard multivariate model structures. Our goal in that study was evaluating six model selection criteria in SAS using MIXED procedure to guide the selection of the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure for both standard and nonstandard multivariate model structures. The study indicated that the percentages of identifying the right multivariate regression model from both standard and non-standard multivariate model structures were very low overall, except for specific models that involve indicator variable (AL-Marshadi, 2009b).
In the current study we are still concerned with the multivariate regression model in which the responses were correlated in particular ways for both standard and non-standard multivariate model structures. Our objective is to find reliable procedure that can be used to guide the selection of the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure for both standard and non-standard multivariate model structures using MIXED procedure in SAS.
In general form, the mixed effects linear model can be written as (McCulloch and Searle, 2001;Littell et al., 1996): Where: β = p×1 vector of fixed effects U = q×1 vector of random effects e = n×1 vector of residuals X = n×p design matrix for fixed effects Z = n×p design matrix for random effects U~N(0,G), e~N(0,R), Y~N(Xβ,V) and V = ZGZ´+R When V is known, the Best Linear Unbiased Estimators (BLUE) of estimable functions h β of the fixed effects in (1) are given by: In most applications V is unknown. Therefore, it is estimated from the data where estimators based on (2) are not generally BLUE. Various procedures were proposed for testing hypotheses on fixed effects in mixed models with unknown V , most of which assume that V is estimated by the REML method (Fai and Cornelius, 1996;Giesbrecht and Burns, 1985;Kenward and Rogers, 1997). Standard error estimates based on (3) are biased downwards when V replaced by its estimate (Kacker and Harville, 1984). Fixed effects are estimated based on (2), with V replaced by a plug-in REML estimate. Null hypotheses of the form 0 H : h 0 β = are tested by: when rank(h)>1in general, the test statistics in (4) only have approximate F-distribution. The approximate denominator degree of freedom υ of F-distribution can be determined using one of the four different methods implemented in MIXED procedure of SAS. The four methods of the approximations are residual method, containment method (this is the default in MIXED), extended Satterthwaite (1941) method of Giesbrecht and Burns (1985); Fai and Cornelius (1996) and Kenward-Roger method. Kenward and Roger (1997) found good performance of their method across a number of designs. Also, Guerin and Stroup (2000) recommended using the Kenward-Roger method as standard operating procedure. Therefore, Kenward-Roger method was considered in this study for approximating the denominator degrees of freedom.

MATERIALS AND METHODS
The following model reflects the standard multivariate linear model: Where: Y = n×r matrix of r response variables measured on n subjects X = n×p design matrix for explanatory variables β = p×r matrix of regression coefficients e = n×r matrix of residuals whose rows are iid normal, i.e., the rows of e~N(0,Σ) In the standard multivariate linear regression model the response distribution is Y~N(XβI n ,⊗Σ), where the parameter space consists of the coefficient matrix β of order p×r plus the covariance matrix, Σ∈PD r , the cone of symmetric positive semi-definite r-matrices. This multivariate regression setting is called the standard multivariate regression model because both components of the parameter space are unconstrained. In the following we give an example showing the format of the standard multivariate linear model (REG format) and its relationship to the MIXED format. The example considers the format with two response variables and one explanatory variable in addition to an intercept term for three subjects: To use the MIXED format, we need to write Y, β and e as vectors and rearrange X accordingly as follow: The MIXED format in matrix notation is Y X e = β + ɶ ɶ ɶ ɶ , which is a special case of the mixed model (1) (Singer, 1998). The situation of interest in this study is one in which the responses are correlated in particular ways. Different covariance matrix structures of ∑ were used to simulate correlated error models for the simulated study data.
In this study we are proposing and evaluating a new three stages procedure which could be used to guide the selection of the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure.
The new procedure is based on three stages using the MIXED procedure. At the first stage, the right multivariate model structure will be determined with a new model selection criterion using the bootstrap simulation procedure. The new model selection criterion will be called Variance Information Criterion (VARIC). At the second stage, the right covariance structure will be determined with the model selection criteria available in MIXED procedure using the bootstrap simulation procedure considering the right multivariate model structure that was determined in the first stage. At the third stage, the right multivariate model structure and right covariance structure that determined in the first and second stage will be used to fit the right multivariate model using MIXED procedure. Also, the study involves comparing the performance of the model selection criteria that are available in MIXED procedure using the bootstrap simulation procedure to determine the right covariance structure.
The first stage of the procedure concerns with evaluating our new model selection criterion in terms of its ability to identify the appropriate multivariate model structure with the help of the bootstrap simulation procedure where the corresponding value of the Variance Information Criterion (VARIC) calculated as the variance of the absolute value for the residual of the multivariate regression model when the multivariate model was fitted with the unstructured covariance structure using MIXED procedure. We may use the new information criterion to guide the selection of the right model structure such as selecting the model structure with the smallest value of the new information criterion.
The bootstrap simulation procedure for the first and second stage involves using the bootstrap technique (Efron, 1983;1986) and the Multiple Comparisons with the Best (MCB) procedure (Hsu, 1984) as tools to help the information criterion in identifying the right multivariate model structure in the first stage and identifying the right covariance structure in the second stage. The idea of the new approach can be justified and applied in a very general context, one which includes the selection of the right multivariate model structure and the selection of the right covariance structure (AL-Marshadi, 2007;2009a). The idea of using the bootstrap to improve the performance of a model selection rule was introduced by Efron (1983;1986) and is extensively discussed by Efron and Tibshirani (1993). Recent studies applied the bootstrap technique with different approaches to select the best model in different context (AL-Marshadi, 2007;2009a;Uraibi et al., 2009).
In the context of multivariate regression models, (5), the algorithm for using parametric bootstrap in our bootstrap simulation procedure for the selection of the right multivariate model structure in the first stage can be outlined as follows: Let the observation vector O i is defined as follows: • Generate the bootstrap sample on case-by-case using the observed data (original sample) i.e., based on resampling from (O 1 ,O 2 ,…,O n ). The bootstrap sample size is taken to be the same as the size of the observed sample (i.e., n). The properties of the bootstrap when the bootstrap sample size is equal to the original sample size are discussed by Efron and Tibshirani (1993) • Fit all the class of candidate multivariate regression model structures, which we would like to select the right multivariate model structure from, to the bootstrap data with the unstructured covariance structure, thereby obtaining the bootstrap value of the new information criteria VARIC * for each multivariate model structure • Repeat the first and the second steps (W) times • Statisticians often use the information criteria in MIXED procedure into guide the selection of the true model structure such as selecting the model structure with the smallest value of the information criteria (Keselman et al., 1999;Littell et al., 2000;Singer, 1998). We will follow the same rule in our information criteria, but we have the advantage that our information criterion has (W) replication values result of the bootstrapping of the observed data (from the first three steps). To make use of this advantage, we propose using MCB procedure (Hsu, 1984) to pick the winners (i.e., selecting the best set of models or single model if possible), when we consider the bootstrap replicates of the information criteria, that is produced by each of the model structure, as groups. The value of W = 10 was used for this study as suggested in our pervious simulation study (AL-Marshadi, 2009a). The general linear mixed effects model approach was used to pick the winners using MCB procedure in MIXED procedure in order to accommodating the violation of the equal variances assumption that was exist in the analysis of this study as suggested in (AL-Marshadi, 2008).
The simulation setup of the experiment is described below: There are seven correlated response variables (y 1 , y 2 , y 3 , y 4 , y 5 , y 6 and y 7 ) which are related to two predictor variables (x 1 and x 2 ) with five different multivariate model structures and seven different covariance structures for the seven correlated response variables. The multivariate model structures of the simulated experiment are described as follow: • The first multivariate model structure is a standard multivariate model structure which fits seven intercepts (one for level of responses), seven slopes for x 1 and seven slops for x 2 (plus the elements of the covariance matrix of the multiple responses) as follow: MIXED procedure is a very useful tool for fitting multivariate regression in which users find five model selection criteria available, which give users tools can be used to select an appropriate covariance structure for a multivariate regression model (Littell et al., 1996). The five model selection criteria are: The second stage of the procedure concerns with comparing the five information criteria available in MIXED procedure in terms of their ability to identify the right covariance structure with the help of the bootstrap simulation procedure considering the right multivariate model structure that was determined in the first stage. The multivariate model structures that were considered in the first stage involve both standard and non-standard multivariate model structures i.e., from multivariate model structures with both "single design" such as the first three model structures and "multiple design" such as the last two model structures that were considered in this study. The algorithm for using the bootstrap simulation procedure for the selection of the right covariance structure in the second stage was applied in similar way as the one explained in the first stage. Seven covariance structures were considered in the second stage. The seven covariance structures were Independent Errors (VC), Compound Symmetry (CS), Heterogeneous Compound Symmetry (CSH), First-Order Autoregressive (AR(1)), Heterogeneous First-Order Autoregressive (ARH(1)), Banded Main Diagonal (UN(1)) and Unstructured (UN).
The multivariate regression analyses for the first multivariate model structure design can be implemented by the following example SAS code (Singer, 1998): PROC MIXED DATA = one; CLASS time; MODEL y = time time*x 1 time*x 2 /noint notest ddfm = kr; REPEATED time / type = UN subject = subject; The multivariate regression analyses for the second multivariate model structure design can be implemented by the following example SAS code (Singer, 1998): PROC MIXED DATA = one; CLASS time; MODEL y = time time* 1 x / noint notest ddfm = kr; REPEATED time / type = UN subject = subject; The multivariate regression analyses for the third multivariate model structure design can be implemented by the following example SAS code (Singer, 1998): PROC MIXED DATA = one; CLASS time; MODEL y = time time* 2 x / noint notest ddfm = kr; REPEATED time / type = UN subject = subject; Note: The class variable "time" in the first, second and third structure is used to identify the multiple responses.

The simulation study:
A simulation study of multivariate regression data was conducted to evaluate the new three stages procedure in terms of the percentage of number of times that it identified the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure.
Correlated multivariate normal data were generated according to MIXED format model. There were 35 scenarios to generate data involving five multivariate regression model structures and seven covariance structures with one setting of covariance matrix parameter values for each covariance structure and sample sizes 40 (n = 40 subjects). The 7 settings of covariance matrix parameter values are given in Table 1. For each scenario, we simulated 150 datasets. SAS code was written to generate the datasets according to the described setup using the SAS®9.1.3 package (SAS Institute Inc., 2008). We will consider the case when we have 12 subjects as an example to explain the process of generating the datasets. A 12 7×1 vector of standard normal random deviates were generated using SAS's NORMAL function. Denoted the vector: where, i 1, 2,3,...,12 = . Note that the 12 represents the 12 subjects and the 7 represents the 7 levels of time effect within each subject. Then the 12 7×1 vectors of residuals for model (5)  Therefore, the vector e i is defined as the rows of the residuals matrix, e, such that e~N(0, ) ∑ . The fixed portion of the model, Xβ, is added to the residuals matrix, e, according to the model structure to give the vector of response, Y. The first explanatory variable was considered as indicator variable with two levels and the second explanatory variable was considered as random variable generated from normal distribution with mean equal 30 and variance equal to 5. Each one of the 150 generated data sets was fitted to all the possible combination of the selected model structures and covariance structures for the two set of model structures and covariance structures mentioned before.
Then each one of the information criteria was calculated according to the process of the new three stages procedure in order to identify the best multivariate regression model that has the right covariance structure and in the same time the right multivariate model structure.
The 7 settings of the covariance matrix are given in Table 1 which can be categorized to seven covariance structures. The first one, (Setting No. 1) represents Compound Symmetry (CS) covariance structure.

RESULTS
The simulation results indicated that the procedure in the first stage selects the right multivariate model structure as member of the best subset hundred percent of the times from the class of candidate multivariate model structures for each of the covariance structures with the new Variance Information Criterion (VARIC). Table 2 summarizes results of the percentage of number of times that the procedure in the first stage selects the right multivariate model structure alone from the class of candidate multivariate model structures for each of the covariance structures with the new Variance Information Criterion (VARIC). Table 2 indicate that the new Variance Information Criterion (VARIC) showed very good performance in identifying the right multivariate model structure in the first stage except for specific models with Unstructured (UN) covariance structure, banded main diagonal (UN(1)) covariance structure and independent errors (VC) covariance structure that have slightly low percentage comparing to others which could be related to conversion difficulty.
The simulation results indicated that the procedure in the second stage selects the right covariance structure as member of the best subset hundred percent of the times from the class of candidate covariance structures for each multivariate model structure determined in the first stage as the right model structure with the five criteria. Table 3 present the percentage of number of times that the procedure in the second stage selects the right covariance structure alone from the class of candidate covariance structures for each multivariate model structure determined in the first stage as the right model structure with the five criteria. Table 3 indicate that although all the criteria showed very good performance in identifying the right covariance structure in the second stage, CAIC and BIC criteria have the best performance overall. Covariance

DISCUSSION
In our simulation, we considered multivariate regression models in which the responses were correlated in particular ways, looking at the performance of a new three stages procedure that could be used to guide the selection of the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure from both standard and non-standard multivariate model structures. The main result of our article is that the performance of the new three stages procedure in identifying the best multivariate regression model that has the right covariance structure and in the same time the right multivariate model structure was excellent overall in terms of identified the right multivariate model structure in the first stage then identifying the right covariance structure in the second stage for the multivariate model structure that was determined in the first stage. Finally, the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure is fitted in the third stage using MIXED procedure. Hence, the new procedure is recommended to be used to guide the selection of the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure from both standard and non-standard multivariate model structures, taking into consideration that if the MCB procedure suggested the best subset of model structures (covariance structures) contains more than one model structure (covariance structure), then we recommend selecting the right model structure (covariance structure) as the one with a simplest structure since the careful examination of simulation results showed that in such case the others model structures (covariance structures) in the best subset are just overfitted structures. In case of the first stage of the procedure, the overfitted model structures contain the predictors of the right model, plus any additional predictors. I.e., if the best subset of model structures consists of more than model structure then they could be for example the first multivariate model structure considered as the overfitted model structure and the second multivariate model structure considered as the right model structure. In case of the second stage of the procedure, the overfitted covariance structure contains the same parameters of the right covariance structure, plus any additional parameters. I.e., if the best subset of covariance structures consists of more than one covariance structure then they could be for example Heterogeneous Compound Symmetry covariance structure as the overfitted covariance structure and compound symmetry covariance structure as the right covariance structure. Also, the careful examinations of the whole simulation results reveal that proceeding with the second stage using other than the right selected model structure that was determined in the first stage could be resulted with misleading selection for the covariance structure or sometimes with conversion problem in MIXED procedure in the second stage therefore it is important to follow the right sequences suggested in the three stage procedure to insure selecting the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure.

CONCLUSION
The evaluation of the new three stages procedure indicate that its performance in identifying the right multivariate regression model that has the right covariance structure and in the same time the right multivariate model structure was excellent for both standard and non-standard multivariate model structures. Therefore, the new procedure is recommended to be used as stander tools to guide the selection of the best multivariate regression model that has the right covariance structure and in the same time has the right multivariate model structure.