Linear Regression Model Selection Based on Robust Bootstrapping Technique

Problem statement: Bootstrap approach had introduced new advancement i n modeling and model evaluation. It was a computer intensive metho d that can replace theoretical formulation with extensive use of computer. The Ordinary Least Squar es (OLS) method often used to estimate the parameters of the regression models in the bootstra p procedure. Unfortunately, many statistics practitioners are not aware of the fact that the OL S method can be adversely affected by the existence of outliers. As an alternative, a robust method was pu t forward to overcome this problem. The existence o f outliers in the original sample may create problem to the classical bootstrapping estimates. There was possibility that the bootstrap samples may contain more outliers than the original dataset, since the bootstrap re-sampling is with replacement. Conseque ntly, the outliers will have an unduly effect on th e classical bootstrap mean and standard deviation. Approach: In this study, we proposed to use a robust bootstrapping method which was less sensitive to ou liers. In the robust bootstrapping procedure, we proposed to replace the classical bootstrap mean an d st dard deviation with robust location and robus t scale estimates. A number of numerical examples wer e ca ried out to assess the performance of the proposed method. Results: The results suggested that the robust bootstrap me thod was more efficient than the classical bootstrap. Conclusion/Recommendations: In the presence of outliers in the dataset, we recommend using the robust bootstrap procedure as i ts e t mates are more reliable.


INTRODUCTION
Model selection is an important subject in the areas of scientific research, especially in regression predictions. Riadh et al. [1] proposed utilizing the bootstrap techniques for model selection. Bootstrap method which was introduced by [2] is a very attractive method because it can be utilized without relying any assumptions on the underlying population. It is a computer intensive method that can replaced theoretical formulation with extensive use of computer. There are considerable papers related to bootstrap method [3][4][5][6][7] .
Despite the good properties of the bootstrap method, it suffers numerical instability when outliers are present in the data. The bootstrap distribution might be a very poor estimator of the distribution of the regression estimates because the proportion of outliers in the bootstrap samples can be higher than that in the original data set [4] . Most of the bootstrap techniques use the Ordinary Least Squares (OLS) procedures to estimate the parameters of the model. It is well known that the OLS is extremely sensitive to outliers and will produce inaccurate estimates [8] . In this study, we propose using robust method in which the final solutions are not easily affected by outliers.

MATERIALS AND METHODS
Classical bootstrap based on the fixed-x resampling: Consider the general multiple linear regression model with additive error terms: The vector β is an unknown p×1 vector of regression coefficients and ε is the n×1 vector of error terms which is assumed to be independent, identically and normally distributed with mean 0 and constant variance, σ 2 . In regression setting, there are two different ways of conducting bootstrapping; namely the Random-x Re-Sampling and the fixed-x Re-Sampling which is also refer as bootstrapping the residuals. Riadh et al. [1] use the random-x Re-Sampling together with the OLS method in their bootstrap algorithm. In this study, the fixed-x Re-Sampling technique with OLS method is adopted. We call this estimator the Classical Bootstrap fixed-x Resampling Method (CBRM).
The CBRM procedure as enumerated by Efron and Tibshirani [3] is summarized as follows: Step 1: Fit the OLS to the original sample of observations to get β and the fitted values i iŷ f (x , ). = β Step 2: Obtain the residuals i i î y y ε = − and giving probability 1/n for each ε i value.
Step 3: Draw n bootstrap random sample with replacement, that is b i ε is drawn from ε i and attached to î y to get a fixed- Step According to Imon and Ali [5] , there is no general agreement among statisticians on the number of the replications needed in bootstrap. B can be as small as 25, but for estimating standard errors, B is usually in the range of 25-250. They point out that for bootstrap confidence intervals, a much larger values of B is required which normally taken to be in the range of 500-10,000. Riadh et al. [1] pointed out that the bootstrap standard deviation can be estimated as follows: where, MSR is the mean squared residual denoted as: The drawback of using the classical standard deviation and the classical mean in estimating the bootstrap scale and location in Eq. 2 and 4 is that it is very sensitive to outliers. As an alternative, a robust location and scale estimates which are less affected by outliers are proposed. The robust bootstrap location and robust scale estimates are given by (5) and (6) as follows: Robust Bootstrap Based on the Fixed-x Resampling (RBRM): Unfortunately, many researchers are not aware that the performance of the OLS can be very poor when the data set for which one often makes a normal assumption, has a heavy-tailed distribution which may arise as a result of outliers. Even with single outlier can have an arbitrarily large effect on the OLS estimates [8] . It is now evident that the bootstrap estimates can be adversely affected by outliers, because the proportion of outliers in the bootstrap samples can be higher than that in the original data [4] . These situations are not desirable because they might produce misleading results. An attempt has been made to make the bootstrap estimates more efficient. We propose to modify the CBRM procedure by using some logical procedure with robust Least Trimmed Squares (LTS) estimator, so that outliers have less influence on the parameter estimates. We call this estimator as Robust Bootstrap fixed-x Re-Sampling Method (RBRM).We summarized the RBRM as follows: Step 1: Fit the LTS to the original sample of observations to get β and the fitted values i iŷ f (x , ). = β Step 2: Obtain the residuals i i î y y ε = − and giving probability 1/n for each ε i value. Standardized the residuals and identify them as outliers if the absolute value of the standardized residuals larger than three. At this step, we built a dynamic subroutine program for the detection of outliers. This program has the ability to identify a certain percentage of outliers in each bootstrap sample.
Step 4: Fit the LTS to the bootstrapped values b i y on the fixed X to obtain b . β The percentage of outlier that should be trimmed depend step 2.
Step 5: Repeat steps 3 and 4 for B times to get b1 bB, The bootstrap scale and location estimates in Eq. 2 and 4 are based on Mean Squared Residuals which is sensitive to outliers. We propose to replace the Mean Squared Residual (MSR) with a more robust measure that is the Median Squared Residual (RMSR). The propose robust bootstrap location and robust bootstrap scale estimates are as follows: where, RMSR is the Median Squared Residual and for each observation i, i = 1,2,…….., n and for each b = 1, 2,…, B; compute: We also would like to compare the performance of (7) and (8) with the classical formulation of bootstrap standard deviation and location but based on Median Squared Residuals instead of Mean Squared Residuals. These measures are given by: The RBRM procedures commences with estimating the robust regression parameters using LTS method which trim some of the values from both size. This means that, some values from the data which are labeled as outliers are deleted. In this situation LTS ∧ β , will be either larger or smaller than β . In step 2, outliers might be present and it can be the candidate to be selected in Step 3. Since we consider sampling with replacement, each outlier might be chosen more than once. Consequently, there is a possibility that a bootstrap samples may contain more outliers than the original sample. We try to overcome this problem by determining the alpha value based on the percentage of outliers in the bootstrap resamples which are detected in Step 3. In this respect, we develop a dynamic detection subroutine program that can detect the proportion of outliers in each bootstrap resample.
Step 4 of RBRM includes the computation of y bootstap by using the LTS based on the first three logical steps. The LTS is expected to be more reliable than the OLS when outliers are present in the data, because it is based on robust method which is not sensitive to outliers. As mentioned earlier, the number of outliers that should be trimmed in the LTS procedure depends on the alpha value that correspond to the percentage of outliers detected. In this way, the effect of outliers is reduced. According to Riadh et al. [1] , the best model to be selected among several models, is the one which has the smallest value of location and scale estimates or the minimum scale estimate.

RESULTS
Several well known data sets in robust regression are presented to compare the performance of the CBRM and the RBRM procedures. Comparisons between the estimators are based on their bootstrap locations and scales estimates. We have performed many examples and due to space constraints, we include only three real examples and one simulated data. The conclusions of other results are consistent and are not presented due to space limitations. All computations are done by using S-Plus®6.2 for windows with Professional Edition.

Hawkins, Bradu and Kass Data:
Hawkin et al. [8] constructed an artificial three-predictor data set containing 75 observations with 10 outliers in both of the spaces (cases 1-10), 4 outliers in the X-space (cases 11-14) and 61 low leverage inliers (cases 15-75). Most of the single case deletion identification methods fail to identify the outliers in Y-space though some of them point out cases 11-14 as outliers in the Y-space.
We consider four models:       Stackloss data [8] : The Stackloss data is a well known data set which is presented by Brownlee [9] : The data describe the operation of plant for the Oxidation of ammonia to nitric acid and consist of 21 fourdimensional observations. The Stackloss (y) is related to the rate of operation (x1), the cooling water inlet temperature (x2) and the acid concentration (x3). Most robust statistics researchers concluded that observations 1, 2, 3 and 21 were outliers.
We consider four models:    Coleman data [8] : This data which was studied by Coleman et al. [10] contains information on 20 schools from the Mid-Atlantic and new England states. Mosteller and Tukey [11] analyzed this data with measurements of five independent variables. The previous study refer observations 3, 17 and 18 as outliers [8] .

---------------------------------------------------------
We consider fifteen models as follow:        Simulation study: A simulation study similar to that of Riadh et al. [1] is presented to assess the performance of the RBRM procedure. Consider the problem of fitting a linear model: In this study, we simulate a data set by putting: where, ε i is a random variable which possesses the distribution N(0, 0.04).      Then we started to contaminate the residuals. At each step, one 'good' residual was deleted and replaced with contaminated residual. The contaminated residual were generated as N (10,9). We consider 5, 10, 15 and 20% contaminated residuals and three models:  Table 7 and 8 show the results of CBRM and RBRM procedures. Graphical displays are used to explain why a particular model is selected. We only present the results for Model 1-3 of the simulated data at 5% outliers due to space limitations. The residuals plot before the bootstrap procedure is shown in Fig. 1.     It is important to note that the scale estimate-median based is smaller than the scale estimates-mean based. This indicates that the formulation of scale estimates based on median is more efficient than when based on mean. In this respect, the CBRM suggests that Model 4 is the best model. However, the results of Table 2 of the RBRM procedure signify that Model 1 is the best model. It can be seen that the scale estimate for Model 1 which is based on median is the smallest among the four models. It is interesting to note here that the overall results indicate that the scale estimate-median based of the RBRM procedure is the smallest. Thus, the RBRM based on median has increased the efficiency of the estimates.
It can be observed from Table 3 and 4 of the Stackloss data that the scale estimates based on median is more efficient than when based on mean for both CBRM and RBRM procedures. Similarly, the RBRMmedian based has the least value of the scale estimates. The CBRM indicates that Model 2 is the best model while the RBRM suggest that Model 1 is the best model. Nonetheless, the model selection based on RBRM-median based is more efficient and more reliable. These are indicated by its location and scale estimates which are the smallest among the models considered.
By looking at Table 5 of the Coleman data reveals that Model 14 of the CBRM is the best model, evident by the smallest value of the location and scale estimates. In fact, the location and scale estimate of Model 14 which is based on median is smaller than when based on mean. The results of RBRM in Table 6 signify that Model 1's location and scale estimates is the smallest among the 15 models considered. For this model, the RBRM median-based is more efficient than the RBRM mean-based. These are indicated by its location and scale estimates which are smaller than the RBRM mean based.
The results of the simulated data in Table 7 shows that Model 2 is the best model for all outlier percentage levels because the scale estimate of Model 2 is the smallest compared to other models. Nonetheless, the RBRM results of Table 8 suggest that Model 1 is the best model. Similar to that of the Hawkin, Stackloss and Coleman data, the RBRM median-based is more efficient than the RBRM mean-based. In fact the scale estimates of the RBRM median-based are remarkably smaller than the RBRM mean-based for all outlier percentage levels. From the results of the simulation study indicates that the RBRM median-based is more efficient and reliable procedure.
Here, we would like to explain further why Model 2 is selected by considering only at 5% outliers due to space constraint. By looking at Fig. 1, it is obvious that there are 5% outliers in the residuals before the bootstrap is employed. It can be seen from Fig. 2-4 that the number of outliers of the MSR in Fig. 2 equal to 18 while only 14 in Fig. 3. The median of the MSR in Fig. 4 is very large compared to Fig. 2 and 3. Among the three models in Fig. 2-4, the CBRM chooses Model 2 as the best model because the proportion of outliers of the MSR is less than the other two models.
On the other hand, the RBRM select M1 as the best model. By comparing Fig. 5-7 with Fig. 2-4, it can be seen that there is no outlier in the distribution of the Median Squared Residuals when we employed RBRM method, while apparent outliers are seen in the distribution of the Mean Squared Residuals, when the CBRM are employed. In this situation, the RBRM has an attractive feature. Among the three models being considered, the RMSR bootstrap resample of Model 1 is more efficient as it is more compact in the central region compared to the other two models. In this situation, Model 1 is recommended as the RMSR is more consistent and more efficient.

CONCLUSION
In this study, we propose a new robust bootstrap method for model selection criteria. The proposed bootstrap method attempts to overcome the problem of having more outliers in the bootstrap samples than the original data set. The RBRM procedure develops a dynamic subroutine program that is capable of detecting certain percentage of outliers in the data. The results indicate that the RBRM consistently outperformed the CBRM procedure. It emerges that the best model selected always corresponds to the RBRMmedian based that has the least bootstrap scale estimate. Hence, utilizing the RBRM median-based in the model selection, can improve substantially the accuracy and the efficiency of the estimates. Thus the RBRM median-based is more reliable for linear regression model selection.