The Performance of Latent Root-M based Regression

Problem statement: In the presence of multicollinearity, the estimati on of parameters in multiple linear regression models by means of Ordin ary Least Squares (OLS) is known to suffer severe distortion. An alternative approach was to use the modified OLS which was based on the latent roots and latent vectors of the correlation matrix of the independent and dependent variables. This procedur e is called the Latent Root Regression (LRR) which se rves the purpose to improve the stability of the estimates for data plagued by multicollinearity. Ho wever, there was evidence that the LRR estimators were easily affected by a few atypical observations that we call outliers. It is now evident that the robust method alone cannot rectify the combined pro blems of multicollinearity and outliers. Approach: In this study, we proposed a robust procedure for t he estimation of the regression parameters in the presence of multicollinearity and outliers. We called this method Latent Root-M based Regressio n (LRMB) because here we employed the weight of the M stimator in the weighted correlation matrix. Numerical examples and some simulation studies were presented to illustrate the performance of the newly proposed method. Results: Results of the study showed that the LRMB method i s more efficient than the existing methods. Conclusion/Recommendations: In order to get a reliable estimate, we recommend using the LRMB when both multicollinearit y and outliers are present in the data.


INTRODUCTION
Consider a multiple linear regression model: Where: Y = The n×1 vector of standardized dependent variables X = The nxk full rank matrix of standardized known constants Xβ = The k×1 vector of model parameters ε = The n×1 vector of random disturbances with ε~NID (0, σ 2 ) p = The number of independent variables n = The number of observations.
Using the least squares criterion, the estimator of β are found by minimizing the sum of squares residuals: According to the Gauss-Markov Theorem, the OLS estimators, in the class of unbiased linear estimators, have minimum variance that is they are Best Linear Unbiased Estimator (BLUE). Nonetheless, the presence of multicollinearity will produce inflated standard errors that will lead to misleading parameter inferences. To remedy this problem, Hawkins [1] , Gunst and Mason [2] , Gunst et al. [3] and Lawrence and Arthur [4] have introduced a new biased estimation procedures known as Latent Root Regression (LRR) to improve the precision of the regression estimates. The major advantage of LRR is that it is not only identifies the multicollinearities present in the independent variables, but also allows the researcher to distinguish between predictive and non predictive multicollinearity, hence appropriately adjust the OLS estimates for the non predictive multicollinearities. However, this technique is inefficient if the underlying disturbances are not normal, which may arise as a result of outliers. As an alternative, we may turn to robust methods which are not sensitive to the presence of outliers [5][6][7][8] . Nevertheless, robust method alone cannot overcome the combined problem of multicollinearity and outliers. In this study, we propose a Robust Latent Root Regression to rectify these two problems simultaneously by using Latent Root Regression based on robust weighted correlation matrix.

MATERIALS AND METHODS
The Latent Root Regression (LRR): The latent root regression utilizes the latent roots and latent vectors of the correlation matrix of the dependent and independent variables, denoted as A. The latent roots, λ j and latent vectors, γ j of A'A are defined by: Analysis of these latent roots and latent vectors enables one to: • Identify near singularities in X • Determine whether the near singularities have predictive value • Obtain the modified least squares estimates of parameters which adjust for non-predictive near singularities The OLS estimator in (2) can also be expressed in terms of these latent roots and latent vectors: Where: , ,..., = γ γ γ and the residual sum of squares given by: Gunst et al. [3] and Lawrence and Arthur [4] suggested small latent roots and latent vectors in which λ j ≤ 0.3 and |γ 0j | ≤ 0.1 which indicates the presence of non predictive singularities. But later, they discovered that a tighter cut-off value of λ j ≤ 0.2 and |γ 0j | ≤ 0.1 could improve the analysis.
Suppose now that the latent vectors γ 0, γ 1,…, γ p-1 correspond to non predictive near singularities. The non predictive multicollinearities are eliminated and only the predictive are retained. The above OLS estimator can be adjusted by setting 0 Then the modified least squares coefficients are: Where: with residual sum of squares, If all of the principal components for the correlation matrix of the dependent and independent variables are predictive, then none of the j 's α equal zero, the latent root estimator and the OLS estimator will be identical. It is well-known that the variance covariance matrix for the OLS estimator is given by ( ) infinity, that is β is subjected to very large variance. This inflation cause the estimation becomes less accurate and less precise, thus unstable. Huber [5] developed a group of estimators called Mestimators, which are based on the idea of replacing the squared residuals, 2 i e , with another function of the residuals, given by: where, ρ is a symmetric function with a unique minimum at zero. The robust M-estimates (ROBM) are calculated using Iteratively Re Weighted Least Squares (IRLS). In IRLS, the initial fit is calculated and then a new set of weights is calculated based on the results of the initial fit. The iterations are continued until a convergence criterion is met.
Robust latent root regression: Robust latent root regression incorporates resistance in the ordinary latent root regression. This is done by imposing weight to the correlation matrix of the dependent and independent variables, A'A. The pair wise Pearson correlation coefficient for the two variables is defined as: Where: The correlation coefficient, r in (9) is based on sample means x and y , respectively, which are known to be very sensitive to the presence of outliers. As an alternative, a robust location estimates which are less affected by outliers are proposed to replace x and y in (9). Following the idea of Mokhtar [9] , we propose using the weighted correlation coefficient between the dependent and the independent variables. We may use the weight from the final step of any robust estimators, but in this study the weight is confined to the final step of the robust M-estimation. The pair wise correlation coefficient in Eq. 9 is modified to obtain a weighted pair wise correlation coefficient, as follows; Where: In this study, we have chosen the Tukey's Biweight function in the M estimation technique [7,8] . By using (10) a robust weighted correlation matrix for dependent and independent variables, which originally denoted as A can be formulated. Based on this weighted correlation matrix, the latent roots and the latent vectors are computed and the latent root routines are then incorporated to estimate the parameters of the model. We call this method the Latent Root-M based Regression (LRMB) because here we have employed the weight of the M-estimator in the weighted correlation matrix. We would expect the modified method to be more robust than the OLS, ROBM and LRR.

RESULTS
Numerical example: In order to compare the performance of the LRMB with the other existing methods such as the OLS, LRR and ROBM, two real data sets are considered. The first data set presents the Palm Oil data which is taken from the Malaysian Palm Oil Board [10] . The dependent variable is the palm oil annual export (tonnes) while the independent variables are oil palm planted area (hectares) and crude palm oil production (tonnes). By incorporating the weight obtained from the final step of the ROBM estimator, yields the robust-weighted correlation matrix with the corresponding latent roots and latent vectors which are displayed in Table 1 and 2, respectively. The presence of outlier in the data was detected by using Robust Mahalanobis Distance (RMD) [7,11] The standard error, confidence interval length and the R 2 of the four methods are shown in Table 3.    The performances of these four estimators are further examined by applying these estimators to another data set which is taken from Gujarati (12) where consumption expenditure being the dependent variable while the independent variables are income and wealth. Table 4 exemplified the standard error, confidence interval length and the R 2 of the Gujarati's data. The confidence interval lengths for Table 3 and 4 are in square bracket.
Simulation study: A simulation study similar to that of Lawrence and Arthur [4] has been performed in order to compare the performance of the four estimators. The model used The parameter values β 0 , β 1 and β 2 were set equal to one. The explanatory variables x i1 and x i2 were generated as follows: where, z ij are independent standard normal random variables. The value of ρ 2 were chosen as 0.0, 0.5 and 0.95 and they represent the correlation between the two independent variables. Sample sizes, n of 25 and 50 (each corresponding to small and large sample) were examined. Four error disturbances were employed as follows: • Standard normal distribution • Cauchy distribution with median zero and scale parameter one • t-Student distribution with three degrees of freedom • Contaminated normal distribution where the underlying distribution is standard normal with probability 0.85 and normal with mean zero and standard deviation five with probability 0.15 The non-normal distribution, such as the Cauchy and student-t with 3 degrees of freedom are symmetrical bell-shaped with heavy tailed distribution which prone to produce considerable amount of outliers. These distributions were generated to investigate the effect of combined problems of multicollinearity and outliers on different estimators.
All the four methods were then applied to each of the sets of the generated data. In each simulation run, there were 1000 replications. Some summary statistics such as the bias, Standard Errors (SE) and Root Mean Squared Errors (RMSE) over 1,000 runs were computed and the results are exemplified in Table 5, 7, 9 and 11. Table 6, 8, 10 and 12 show the efficiency of the estimators by observing at the MSE ratios of two estimators. Values less than one indicate that the first estimator is more efficient than the second, values equal to one imply that both estimators are equally good, while values more than one indicate that the second estimator is more efficient than the first estimator.
The values in all Table 1-12 are for sample size 25 while values for n = 50 are shown in bold.    Table 6: MSE ratios of 6 pair wise estimators of 1 β and 2 β with disturbance distribution normal (0,1)

DISCUSSION
Here we discuss the results that we have acquired from the previous section. The result of Table 1 suggests that the oil palm planted areas and oil production are highly correlated. The presence of two outliers in this data were detected based on RMD. By the application of LRR techniques, the vector deletion criteria in which latent vectors are deleted if λ ϕ ≤0.2 and |γ 0j |≤0.1, leads to the deletion of the second latent vector from the robust-weighted correlation matrix of Table 2. By this deletion, the LRMB has substantially reduced the standard error of the estimates. It can be observed from Table 3 that the OLS estimates have been strongly affected by outliers and multicollinearity. This is indicated by its largest standard error among the four estimates, smaller R 2 value and negative coefficient of 2 β . Moreover, it possesses confidence interval length which is remarkably larger than the other intervals. The performance of the ROBM and the LRR are also not encouraging since their standard errors and confidence interval lengths are still relatively large. However, the RLMB can be considered the best method because it has the smallest standard errors and confidence interval length and the highest R 2 value than the other three estimators.     Let us now focus to the Gujerati's data. There is evidence that income and wealth for the Gujarati's data are highly correlated. This data has no outlier but has multicollinearity problem. Since this data has only multicollinearity problem, we expect that the performance of the LRMB is closed to the LRR. It is interesting to note that the results of Table 4 are consistent with the earlier findings except that the LRR and LRMB are equally good as expected because when there is no outlier and only multicollinearity exist, the LRMB become closer to LRR.
Next, we will discuss the simulation results obtained from the standard normal and heavy tail distributions whether they confirm the conclusion of the numerical examples. Table 5 shows that for standard normal disturbances with ρ = 0, all four methods are virtually indistinguishable with respect to the values of the bias, SE and RMSE. The performance of the OLS and the LRR are slightly better than ROBM and LRMB for small ρ-value.     When the multicollinearity is high (ρ = 0.95) as to be expected, the LRR give the best results followed by the LRMB, OLS and ROBM. This result is supported by Table 6, where for high correlation; the LRR is more efficient than LRMB indicated by the value of the MSE ratios which is greater than one. Similarly, the MSE ratios signify that the LRR is better than the OLS and ROBM for high value of ρ. Evidently, in this situation, the OLS is better than the ROBM. The LRR estimates emerge to be conspicuously more efficient in the presence of high multicollinearity with no contamination in the model.

Heavy tails distribution of the Disturbances;
Here we discuss the results of Cauchy, t with 3 degrees of freedom and contaminated normal. Let us first focus our attention to Table 7 and 8, for cauchy distribution.
The results in Table 7 show that when there is no multicollinearity (ρ = 0.0) for this type of data with only the presence of outliers, as can be expected the performance of the ROBM is similar to that of RLMB.
The OLS is as good as the LRR and their performance are less efficient than the LRMB and ROBM. For small correlation (ρ = 0.5), the LRMB is slightly better than the ROBM estimates and they are more efficient than the OLS and LRR. The presence of both outlier and high multicollinearity changes the situation dramatically. The biases and the RMSE of the OLS, LRR and ROBM estimates increase significantly. On the other hand, the LRMB is not affected by the outliers and multicollinearity, as shown by the biases and the RMSE which were decreasing and consistently the smallest among the four estimators. It is evident that the LRMB is the best estimator followed by the ROBM, LRR and OLS. The MSE ratios in Table 8 supported  the results obtained from Table 7 where for skewed data with small and no multicollinearity, the ROBM is fairly close to LRMB and their performances are much better than the LRR and OLS. The results of Table 8 signify that the LRMB seems to perform extremely well compared to ROBM, LRR and OLS for high multicollinearity, evidenced by the values of the MSE ratios which are less than one. The LRMB and ROBM are equally efficient when ρ is zero or low indicated by the MSE ratios which are equal to one.     The results of Table 9 and 10 illustrate the summary statistics for the t distribution with 3 degrees of freedom. Like the Cauchy distribution, the performances of the ROBM and LRMB estimator are equally good for small and no multicollinearity. Similarly, the RLMB and ROBM are slightly better than the OLS and RLL in such a situation. Nevertheless, when ρ = 0.95 and n = 25, the performance of the RLL is slightly better than the RLMB. It is interesting to note that when the size of the sample is increased to 50, the RLMB is better than the RLL. These are indicated by its bias and RMSE which are smaller than the RLL in this situation. Similar results are obtained by observing the MSE ratios in Table 10. The results of Table 11 and 12 for contaminated data are consistent with the finding obtained from the t distribution.

CONCLUSION
The OLS performs poorly in the presence of outliers and multicollinearity. The ROBM is not sufficiently robust compared with LRR and LRMB when the degree of multicollinearity is high. The LRR estimator is a better choice than the other estimators in eliminating the problem of multicollinearity. However, its performance was inferior to ROBM and LRMB when contamination occurs in the data. The empirical study shows that the LRMB has improved the accuracy of the estimates in the situation when both multicollinearity and non-normal disturbances are present. The results seem to suggest that the RLMB estimator may provide a robust alternative to the LRR.