Two-Step Robust Diagnostic Method for Identification of Multiple High Leverage Points

Problem statement: High leverage points are extreme outliers in the Xdirection. In regression analysis, the detection of these leverag e points becomes important due to their arbitrary large effects on the estimations as well as multico l inearity problems. Mahalanobis Distance (MD) has been used as a diagnostic tool for identification o f outliers in multivariate analysis where it finds the distance between normal and abnormal groups of the data. Since the computation of MD relies on nonrobust classical estimations, the classical MD can h rdly detect outliers accurately. As an alternativ e, Robust MD (RMD) methods such as Minimum Covariance D terminant (MCD) and Minimum Volume Ellipsoid (MVE) estimators had been used to identify the existence of high leverage points in the data set. However, these methods tended to swam p so e low leverage points even though they can identify high leverage points correctly. Since, the d tection of leverage points is one of the most important issues in regression analysis, it is impe rative to introduce a novel detection method of hig h leverage points. Approach: In this study, we proposed a relatively new two-st ep method for detection of high leverage points by utilizing the RMD (MVE) and RMD (MCD) in the first step to identify the suspected outlier points. Then, in the second step th MD was used based on the mean and covariance of the clean data set. We called this me thod two-step Robust Diagnostic Mahalanobis Distance (RDMD) which could identify high leverage points correct ly and also swamps less low leverage points. Results: The merit of the newly proposed method was investi gated extensively by real data sets and Monte Carlo Simulations study. The re sults of this study indicated that, for small sampl e sizes, the best detection method is (RDMD ) (MVE)-mad while there was not much difference between (RDMD) (MVE)-mad and (RDMD) (MCD)-mad for large sample sizes. Conclusion/Recommendations: In order to swamp less low leverage as high leverag e point, the proposed robust diagnostic methods, (RDMD ) (MVE)-mad and (RDMDTS) (MCD)-mad were recommended.


INTRODUCTION
Outliers are observations which break the pattern shown by the majority of the data set. They can be classified in the following categories: (1) Good leverage points: Observations which follow the same regression line as the other data in the data set although they fall far from the majority of the explanatory variables (2) Bad leverage points: Observations not only deviate from the same regression line as the other data in the data set but also fall far from the majority of explanatory variables, (3) Vertical Outliers or high y residual outliers: Observations which are not leverage points but have high response variables residuals [19] . Generally, those leverages that are far from the rest of the other x variables are high leverage points. It is now evident that outliers have some destructive effects on regression fitted line. Rousseeuw and Van Zomeren [25] pointed out that high leverages can affect the estimated slope of the regression line in Ordinary Least Squares (OLS), thus may cause more serious problems than other outliers which might only affect the estimated intercept term. Moreover, their presence in regression models may make some low leverage as high leverage and vice versa. These two concepts are called masking and swamping in linear regression [23] . Furthermore, the range of explanatory variables increases when they exist in regression analysis. Thus, the multiple coefficient determination statistics (R 2 ) which is a wellknown and popular measure of goodness-of-fit in the regression models will increase even by any changes of a single x variable [26] . In addition, high leverages may be the prime source of collinearity-influential observations whose presence can make collinearity and can destroy the existing collinearity pattern among the x variables [7] . In this respect, the identification of high leverage points to prevent their effect on linear regression becomes necessary.
Outlier detection is one of the most important tasks in data analysis. The outliers describe the abnormal data behavior, i.e., data which are deviating from the natural data variability. Various methods for detecting outliers have been studied [1,2,5,7,8,18,21,25] . One way to identify possible multivariate outliers is to calculate a distance from each point to a center of the data. An outlier would then be a point with a distance larger than some predetermined value. For a p-dimensional multivariate sample x i (i = 1,.., n), the Mahalanobis Distance (MD) is defined as: Where: T(X) = The estimated multivariate location which is usually the multivariate arithmetic mean C(X) = The estimated covariance matrix which is usually the sample covariance matrix The distribution of the MD with both the true location and shape parameters and the conventional location and shape parameters is well known [5] . If there are only a few outliers, large values of MD, indicate that the point x i is an outlier [2] . Any value of which the MD exceeds the cutoff 2 p,0.975 χ is considered as outliers where p is the number of explanatory variables [16] . Data sets with multiple outliers are subject to problems of masking and swamping [20] . Masking occurs when a group of outlying points skews the mean and covariance estimates toward these points and the resulting distance of the outlying point from the mean is small. While, swamping occurs when a group of outlying points skews the mean and covariance estimates toward these points and away from other inlying points and the resulting distance from the inlying points to the mean is large. Mahalanobis Distance is known to suffer from masking problems [24] . Mahalanobis Distances give a one-dimensional measure of how far a point is from a location with respect to a shape. Utilizing MD, we can find the points that are unusually far away from a location and call those points outlying. A large body of diagnostic tools is available in the literature for detection of high leverage points in linear regression [4,11,12,27] . Mahalanobis Distance (MD) is one of these well-known multivariate methods for detecting high leverage points as well. Although it is a reliable diagnostic tool for detecting high leverage points, it suffers from masking problem. Most of the classical diagnostic methods fail to identify the multiple high leverage points due to their masking effects [14] . Problems of masking can be resolved by using robust estimates of shape and location, which by definition are less affected by outliers. Outlying points are less likely to enter into the calculation of the robust procedures, so they will not be able to influence the parameters used in the MD. The inlying points, which all come from the underlying distribution, will completely determine the estimate of the location and shape of the data. Several robust estimators of multivariate location and scatter have been proposed, such as Maronna's pioneering paper on multivariate M-estimation [17] , the Minimum Volume Ellipsoid (MVE) and the Minimum Covariance Determinant (MCD) estimators by Rousseeuw [22] . For a thorough overview of robust multivariate estimation, one can refer to the article by Maronna and Yohai [18] .
The Minimum Covariance Determinant (MCD) method of Rousseeuw [22] aims to find h observations out of n whose covariance matrix C has the lowest determinant. In the Minimum Volume Estimator (MVE), proposed by Rousseeuw [22] , an ellipsoid of the smallest volume with a subset of p objects (noncontaminated data) is constructed. In one of the proposed iterative algorithms, n+1 object is selected iteratively at random in each of iterations and their mean and covariance are determined. Then, the ellipsoid containing exactly p data objects is found by deflating or expanding the data covariance. The steps of the algorithm are repeated until the subset of p objects yielding the smallest volume of the covariance ellipsoid is found.
Finally the robust MD distance can be written as: where, T R (X) and C R (X) are robust location and shape estimate such as MCD or MVE. By using a robust location and shape estimate in the RMD, outlying points will not skew the estimates and can be identified as outliers by large values of the RMD. Unfortunately, using robust estimates gives RMDs with unknown distributional properties [25] . The use of 2 p,0.975 χ quantile as cutoff point for RMD will prone to declare some good and low leverage as high leverage point-sand often lead to identifying too many points as outliers [25] .
To develop robust multivariate estimators, Rousseeuw and Leroy [23] first proposed to detect outliers by RMD and then find the estimates by using the reweighted least squares regression when the weight function is a hard rejection function. Specifically, the latter proposal consists of discarding those observations whose RMD exceeds a certain fix threshold value. Previously, the MVE was commonly used as initial estimator for these procedures. In the context of linear regression, many estimators have been proposed that aim to reconcile high efficiency and robustness. Typically, these methods are also two-stage procedures [6,10,15,22,28,29] . Let us consider a k variables regression model as: The weight matrix W = X (X T X) −1 X T is the orthogonal projector matrix onto the model space, or hat matrix which is traditionally used as a measure of leverage points in regression analysis. If a diagonal entry W ii of W is large, changing y i will move the fitted surface appreciably towards the altered value. Therefore, W ii is said to measure the leverage of the observation y i . Different cutoff points exist in the literature for the hat matrix to find high leverage points such as twice-the-mean-rule (2 k/n) by [11] , thrice-themean-rule (3 k/n) [27] when k and n are the number of variables and observations respectively and three interval range of Huber [12] (observations with 0.2< W ii <0.5 are risky to consider in analysis and those with W ii ≥0.5 should be avoided when W ii is diagonal elements of hat matrix).
The hat matrix may fail to identify the high leverage points because of the effect of high leverage points in leverage structure [7] . Hadi [7] introduced another diagnostic tool as follows: where, w ii = x i T (X T X) −1 x i is the diagonal element of W and the i-th diagonal potential p ii can be defined as: where, X (i) is the data matrix X without the i-th row. He proposed a cutoff point for potential values (p ii ) as Median (p ii ) +c Mad (p ii ) where Mad = median |p iimedian(p ii )|/0.6745 and c can be taken as constant values of 2 or 3. Observations exceeding Hadi's cutoff point is considered as high leverage points. But this method also can't detect all of the high leverage points. Imon [13] introduced another diagnostic tool as generalized potentials for the whole data set as follows: Let consider that D is deleted group from data set, those which suspected as outliers (the choice of this deletion group is very important since the omission of this group determines the weights for the whole data set). R is the remaining set after deleting d<(n-k) therefore it contains (n-d) cases. If we assume that the suspected data are the last d rows of X and Y so the weight matrix W = X (X T X) −1 X T can be written as: are symmetric matrices of order (n-d) and d respectively.
Then Imon [14] introduced generalized potentials for all members in a data set which are defined as: We should notice that there isn't any finite upper bound for pii * 's and the derivation of the theoretical distribution of them are not easy. He introduced the same cutoff point as potential values Median (p ii * ) + c Mad (p ii * ) for the generalized potential as well. Habshah et al. [6] developed a new method for determining outlying points in multivariate data set by combining the RMD (MVE) method for detecting the suspected group (D group) in generalized potential method which is proposed by [14] . This method which is called DRGP (MVE) is also a two-step method for high leverage point detection. In their methods, the mad cutoff point has been used in the first and second steps.
However, this method can identify more swamped low leverage points. According to Werner [28] , "A successful method of identifying outliers in all multivariate situations would be ideal, but is unrealistic". By "successful", he means both highly sensitive, the ability to detect genuine outliers and highly specific, the ability not to swamp regular points as outliers. Therefore a practical and efficient robust detection method of high leverage points (outliers in Xdirection) is the method which is sensitive to detect genuine high leverage points and specific, thus it swamps less low leverage as high leverage.

MATERIALS AND METHODS
In this study, we propose a two-step diagnostic tool for detecting multiple high leverage points which can detect less swamped low leverages. In order to improve DRGP (MVE) performance proposed by [6] , we follow the idea of Rousseeuw and Leroy [23] in developing robust multivariate estimators and propose a relatively new method for high leverage points identification which is called two-steps Robust Diagnostic Mahalanobis Distance (RDMD TS ). In the first step, the RMD (MCD) or RMD (MVE) method is used to detect the suspected outlier group which will be deleted from the data set resulting in the clean data for the next step. In the second step, we apply the MD for the entire data set that based on the mean and covariance matrix of the clean data set which was obtained from the first step. Therefore, Two-Steps Robust Diagnostic Mahalanobis Distance (RDMD TS ) is written as follows: where, T 0 (X) and C 0 (X) are the mean and covariance matrix of the clean data set. Two different cutoff points are considered, namely the 2 k,0.975 χ where k is the number of explanatory variables and a new proposed one, that is Median (RDMD TS ) +c Mad (RDMD TS ). The procedure of this method can be summarized in the following algorithm.

Numerical Examples:
The two well-known data sets which are frequently referred to in the study of the identification of influential observations, high leverage points and outliers are considered in this study. It is important to note here that we changed the cutoff point of mad which is used by [6] to chi-square in the first step of the examples and also in the simulation study.

Stack loss data:
Here we consider the stack loss data [3] that have been extensively analyzed in the statistical literature. This three-predictor data set (Air flow, Cooling water inlet temperature and Acid concentration) contains 21 observations with five influential observations; three of them which (cases 1, 3 and 21) are high leverage outliers. One of the influential observations (case 4) is an outlier and another one (case 2) is a high leverage point. Table 2 illustrates the DRGP (MVE), DRGP (MCD), (RDMD TS )(MVE),  (RDMD TS )(MCD), MD and their corresponding cutoff points. Another useful detection tool is proposed by Rousseeuw and Van Driessen [24] as DD plot. In this plot, the classical MD i is plotted vs. robust MD i . The low leverage points should cluster below the cutoff point lines and the high leverage points will be separated from the bulk of the data and thus, will be located in the upper area of the cutoff points. The DD plot of stack loss data set is shown in Fig. 1a  Simulation study: In order to investigate the merit of our newly proposed method, we designed a Monte Carlo simulation experiment. In this study, we compared the Robust Diagnostic Mahalanobis Distance (RDMD TS ) with other existed methods, with sample sizes equal to 20, 40, 60, 100 and 200. The first 100 (1-α) % observations of the three regressors from these sample sizes are produced from Uniform (0, 1) and the remaining 100α% observations are constructed as high leverage points. The high leverage points are generated with unequal weights,  Table 3.

DISCUSSION
Let us focus our attention to the results of Hawkin's data presented in Table 1. The RMD (MCD) and the RMD (MVE) can detect 1-10 data as outliers. In addition to that RMD (MCD) identifies observations 11-14, 47 and 53 as outliers while RMD (MVE) swamp observations 11-14 and 53(not shown due to space limitations). Although these robust methods are more powerful than MD which can just detect 2 outliers, that is cases 12 and 14, they still can be improved so that their performance as high leverage detection tool is more powerful. As proposed in the second step of the (RDMD TS ), we should find the mean and covariance matrix of the clean data set for both RMD (MCD) and RMD (MVE) after deleting the suspected outlier group. Finally we can find the distance of the whole data set with this clean mean and clean covariance matrix for the x variables only. It is obvious from Table 1 that both of our proposed method and Habshah et al. [6] method can detect 14 high leverage points from both mad and chi-square cutoff points. However the values of (RDMD TS ) are further from their corresponding cutoff points compared to DRGP. Thus, this new method enhances the chance to detect these 14 observations as high leverage points.   Let us now focus to the Stack loss data where the RMD (MVE) can detect 4 outliers and another outlier which is case 2. Furthermore, RMD (MCD) can detect 4 outliers and cases 2, 13, 14, 20 as outliers as well. The RMD (MVE) and RMD (MCD) are not presented due to space constraint. After deleting the outliers from the data set and utilizing the mean and covariance matrix from the cleaned data set in the first step, the (RDMD TS ) can identify exactly 4 high leverage points. The DRGP (MCD) and DRGP (MVE) of Table 2 also can identify these 4 high leverage points. Like the results of Hawkin's Data, similar conclusion can be drawn from this example regarding higher chances of (RDMD TS ) for detection of high leverage points. The results of Table 2 show that (RDMD TS ) can detect these 4 high leverage points easily. Due to MD masking problem, it cannot detect any high leverage points.
By looking at Fig. 1 and 2, it is obvious that MD couldn't identify any high leverage points while the other 4 robust methods, can identify 4 high leverages easily.
Next, we will discuss the simulation results whether they confirm the conclusion of the numerical examples that our proposed method performs better than the DRGP and MD method. It can be observed from Table 3 that for small sample size, the (RDMD TS ) based on MCD or MVE with chi-square cutoff points swamp more low leverage points compared to (RDMD TS ) based on MCD or MVE with mad cutoff points. Nevertheless as soon as the number of sample sizes increases this cutoff point performs better and with this cutoff point we can find less low leverage but still it shows more low leverage than (RDMD TS )-mad. It is obvious from the results of Table 3 that (RDMD TS ) (MVE)-mad outperforms (RDMD TS ) (MCD)-mad in identifying less low leverages in small sample sizes.
In large sample sizes such as 200 (with 20 or 25% high leverage points) both of these two methods (RDMD TS ) (MVE)-mad and (RDMD TS ) (MCD)-mad are equally good and do credible job in detecting high leverage points. To compare (RDMD TS )-mad and DRGP based on MCD or MVE, we can say that the number of low leverage points which is identified are less when our newly proposed methods are used. When the sample size are 100 or 200 and 20 or 25% high leverage points are added, (RDMD TS )-mad can detect the exact high leverage points with no low leverage points while DRGP swamps some low leverages. When the number of sample size and high leverage points are very small, DRGP swamp less low leverage points compared to (RDMDTS)-mad (20 sample size and 5% high leverage points). When the number of high leverage points and the number of sample size increases, the (RDMD TS )-mad overcome DRGP in detecting less low leverages.

CONCLUSION
The presence of high leverage points affects all least squares models, which are extensively used in data exploration and modeling. In multivariate cases the identification of high leverage points is much more difficult. Furthermore it is difficult to detect outliers in p-variate data when p>2, as one can no longer rely on visual inspection. Among all outlier detection tools, Mahalonobis Distance is more powerful to detect a single outlier. This approach is not applicable for multiple outliers because of the masking effect, by which multiple outliers do not necessarily have large Mahalonobis distance value. It is better to use distances based on robust estimators of multivariate location and scatter [23] . In regression analysis, the robust distances are computed from the explanatory variables which allow us to detect high leverage points. The main insight behind this study is to introduce a two step robust diagnostic methods based on Robust Mahalanobis distance. This relatively new method not only can detect exactly the high leverage points but also it can identify less number of low leverage points than the existing methods such as Diagnostic Robust Generalized Potential. To investigate the superiority of our new method, a Monte Carlo simulation is carried out. The results of this study indicate that for small sample sizes, the best detection method is (RDMD TS ) (MVE)-mad whereas there is not much difference between (RDMD TS ) (MVE)-mad and (RDMD TS ) (MCD)-mad for large sample sizes. Therefore, when the sample size is very small such as 20 and the number of high leverage is 5% of the data set, it is better to use DRGP (MVE) which can detect less low leverage points.