Least Absolute Deviation Regression and Least Squares for Modeling Retention Indices of Set Compounds Food and Pollutants of the Environment

Correspondins Author: Khadija Amirat Department of Chemistry, Laboratory of Environmental security and Food, Badji Mokhtar Annaba University, Annaba, Algeria kadijatoumi@yahoo.com Abstract: Considering the importance of the statistical analysis of regression in modeling based separately on study for Quantitative structure retention indices on Carbowax 20 M (I Cw20M ) and OV-101 columns (I OV-101 ) relationships (QSRR) are determined for 114 pyrazines. The detection of influential observations for the standard least squares regression model is a problem which has been extensively studied. Least Absolute Deviation regression diagnostics offers alternative dicapproaches whose main feature is the robustness. Here a nonparametric method for detecting influential observations is presented and compared with other classical diagnostics methods. With have been applied for modeling separately retention indices of the same set of (89 pyrazines of Training and 25 of Test) eluted on Columns OV-101 and Carbowax-20M, using theoretical molecular descriptors derived from DRAGON Software and validating the results in the state approached graphically by Probability plot of the error and approached tests statistics of Anderson-Darling, in finished by the confidence interval thanks to robustness concept to check if errors distribution is really approximate.


Introduction
Since the 1970 the environment term is used to indicate the global Ecologic context, i.e., the whole of the conditions physical, chemical, biological climatic and geographic conditions, in which are developed living conditions and humans being in particular. Air, earth, water, natural resources, flora, fauna, people and their social interactions are included.
The volatile heterocyclic constitute a significant family of odorous molecules, particularly interesting in the field of chemistry of the flavours and the odor can be regarded as a local pollution and a limited harmful effect to the bordering population of the potential sources. They represent more than one quarter of the 5 000 volatile compounds characterized up to now in our food Pyrazines are heterocycles very present in our food. More than 80 derived from pyrazines are identified in a great number of cooked food, as bread, meat, torrefied coffee, the cocoa or hazel nuts; they are aromatizing compounds (Li et al., 2014;Buchbauer, 2000). Stanton and Jurs (1989), have used QSRR methodology to develop Models to link structural features of 107 pyrazines differently substituted, to their retention indices obtained up on two different polarities columns (OV-101 and Carbowax-20M). The equations have been calculated with the help of multilinear regression, the choice of the explanatory variables (topological, electronic and physical properties) being achieved by progressive elimination (Small and Jurs, 1983), among the 85 individual Molecular descriptors obtained for each whole molecule. The retention Indices (IR) obtained on each column are treated separately, while by drawing from the same sets of descriptors. The calculated models with 6 explanatory variables provide high standards errors (S = 23 units of index -u.i. -on OV-101 and S = 36.33 u.i .up on Carbowax-(20 M) which do not predict good predictive capacities for these models, which let to suppose nonlinear relations between descriptors and property (IR) studied (Mebarki et al., 2016).
A large number of other estimation methods aimed at achieving robustness have been suggested and a considerable body of literature has developed. See for example, Gonin and Money (1989;Dodge, 1987) and the references therein. Generally the robust estimators in the literature can be classified as M-estimators, Lestimators, or R-estimators. Probably most attention has been paid to the Lestimators, for other type estimators, Judge et al. (1985).
The robustness of Least Absolute Deviation method in relation with influential observations and its susceptibility to leverage point which are largely studied in literature (Dodge, 1987;1997). We propose non parametric method Least Absolute Deviation (LAD) to detect the influential observations (aberrant and affect leverage) in comparison with least squares method.
The tests of normality as whereas theory-driven methods include the normality test such Anderson Darling test. However, seier classified the test of normality into major categories test, empirical and normality distribution of the observed data.
The Durbin-Watson statistic is conditioned on the order of the observations (rows). Minitab assumes that the observations are in a meaningful order, such as time order. The Durbin-Watson statistic determines whether or not the correlation between adjacent error terms is zero. To reach a conclusion from the test, you will need to compare the displayed statistic with lower and upper bounds in a table. If D > upper bound, no correlation exists; if D < lower bound, positive correlation exists; if D is in between the two bounds, the test is inconclusive.
The objective of this work aims at using QSRR methodology, in the approach Method Least Absolute Deviation/Least Square (LAD/OLS), to model retention indices of (114) pyrazines (113 taken from Stanton and Jurs (1979) (1) and one compound (2-VinylPyrazine) taken from Mihara and Enomoto (1985), the molecular descriptors are only calculated starting from the chemical structure of the compounds. The linear statistical model for fixed effects will be examined relationships between retention index and different descriptors for two columns [(between retention indices of non polar column (OV-101) and descriptor of Connectivity indices (are among the most popular topological indices (it is a descriptor of Structure-Activity Analysis), descriptor of Geometrical descriptors (representation of a molecule involves the knowledge of the relative positions of the atoms in 3D space) and descriptor of 3D-Molecule Representation of Structures based on Electron diffraction (3D-MoRSE); for relationships between retention index of polar column (CRW-20M) and descriptor of Connectivity indices (are among the most popular topological indices), descriptor of 2D autocorrelations (are molecular descriptors which describe how a considered property is distributed along a topological molecular structure) and descriptor of 3D-MoRSE (3D-Molecule Representation of Structures based on Electron diffraction)] by two robust methods for the evaluation of regression parameters starting from robust coefficients of regression most popular by the appendices. We have based ourselves on comparison between the two methods, application field (DA) will be discussed using Williams diagram which presents residues of standardized prediction according to the levers values (hi) (Eriksson et al., 2003;Tropsha et al., 2003). We present the state approached graphically by Probability plot of the error and approached statistics tests (Anderson-Darling), in finished by the confidence interval of compatibility at normal law to validated results of approached state between two methods for a risk α = 5% (Nornadiah and Yah, 2011;Damodar et al., 2009).

Methodology
The Data Set Molecular software Hyperchem 6.03 (AL-Noor and Asmaa, 2013) is used to represent the molecules, by employing semi-empirical method AM1 (Dewar et al., 1985;Holder, 1998) to obtain final geometries. The implied compounds in this study have the general structure 1.

Descriptor Generation
The optimized geometries are transferred in software dragon from data-processing software version 5.4, for calculation of 1320 descriptors while operating on 89 pyrazines of test; subsets of descriptors are chosen by genetic algorithm, these descriptors can be separate in four categories: Topological, geometrical, physical and electronic descriptors have accounts of way and molecular indices of connectivity included. The geometrical descriptors included sectors of shade, the length with the reports/ratios of width, volumes of van der Waals, the surface and principal moments of inertia. The calculated descriptors of physical property included the molecular refringency of polariz ability and molar. The electronic descriptors included most positive and most negative described by Kaliszan.

Regression Analysis
The analysis of the multiple linear regressions was carried out with two methods by software Matlab (2009) for (Least Absolute Deviation) and Minitab (16) for (OLS).
We considers the multiple model of regression wich is given by (Berlin, 1982): Detection of meaningless statements and with action leverage according to the method of least squares is a problem which is largely studied. Diagnosis by the Least Absolute Deviation regression offers alternative approaches whose principal characteristic is robustness. In our study a non-parametric method to detect the meaningless statements and point's lever is applied and compared with the traditional method of diagnosis (least squares).

Least Squares OLS Method
This is carried out with software Minitab 16, method OLS with is applied to multiple regression which consists in defining the β estimate which minimizes:

Least Absolute Deviations (LAD) Method
The analysis of linear regression multiple is carried out with software Matlab (2009), by using the Least Absolute Deviations (LAD) method, which is one of the principal alternatives to the method of least squares when it is a question of estimating parameters of regression on, which minimizes the absolute values but not the values with square of the term of error. Least Absolute Deviation Method applied to the multiple regression consists in defining the β estimates which minimize Jureckova, 2000, Dodge, 2004):

Results and Discussion
An ideal model is one that has a high R value, a smallest value of standard error, starting from independent variables. The best models found has 3 descriptors for each stationary phase by using the software Moby Digs are given below.
The criterion for identifying a compound as an outlier is that compound is diministed by three or more of six standard statistical tests used to detect outliers in regression analysis. These tests were (1) residual, (2) standardized residual, (3) Studentized residual, (4) leverage, (5) DFFITS, (6) Cook's distance. The residual is the difference between real value and the value predicted by the regression equation. The standardized residual is the residual divided by difference models of regression equation. The Studentized residual is the residual of forecast divided by proper model difference.
Leverage allows for the determination of a point the influence.
DFFITS describes difference in the fits of the equation caused by displacement of a given observation and Cook's distance describes the change of a model coefficient by the displacement of indicated point.
The definition of each descriptor is given Table 2. The coefficient of multiple determinations (R 2 ) indicates the amount of variance in data is a explained by the model. The standard error of regression coefficient is given in each case and n indicates of molecules involved in regression analysis procedure. The best tree parametric model was constructed using: [OV-101: Modified Randi connectivity index (XMOD) (is a molecular descriptor proposed as the sum of atomic properties, accounting for valence electrons and extended connectivities in the Hdepleted molecular graph using a Randic connectivity index-type formula), Folding Degree Index (FDI) (is the largest eigenvalue of the distance/distance matrix, normalised dividing it by the number of atoms nAT. This index tends to one for linear molecules (of infinite length) and decreases in correspondence with the folding of the molecule. Thus, it can be thought of as a measure of the folding degree of the molecule because it indicates the degree of departure of a molecule from strict linearity) and (Mor06v) (3D-MORSE-signal 06/weighted by atomic Vander Waals volumes (Mor06v) (3D-MoRSE) (3D-Molecule Representation of Structures based on Electron diffraction) descriptors are based on the idea of obtaining information from the 3D atomic coordinates by the transform used in electron diffraction studies for preparing theoretical scattering curves.3D-MoRSE the descriptors are calculated for five different atomic properties w: the unweighted case (u), atomic mass (m), the van der Waals volume (v), the Sanderson atomic electro negativity (e) and, the atomic polarizability (p). (CRW-20M: Reciprocal Distance Randi-type Index (RDCHI) (is defined on the analogy of the Randic connectivity index X1, where the vertex degrees are substituted by the row sums of the reciprocal distance matrix. Moreover, the reciprocal distance squared Randictype-index RDSQ is obtained from the RDCHI index substituting the exponent-1/2 with 1/2.), Geary Autocorrelation -log 1/weighted by atomic polariz abilities (GATS1p) (2D autocorrelations calculated by DRAGON are spatial autocorrelations calculated on a H-depleted molecular graph weighted by atom physico-chemical properties (i.e., the atom weightings w) and include: Autocorrelations GATS calculated by the Geary coefficient) and 3D-MORSE-signal 02/weighted by atomic masses (Mor02m)]. Geary autocorrelation -log 1/weighted by atomic polarizabilities Mor02m

The Best Models
3D-MORSE-signal 02/weighted by atomic masses Using a significance level of 0.05, the Anderson-Darling normality test (Fig. 1)

Column RCW -20 M Column OV -101
The diagnostic statistics joined together in Table 3 make it possible to make comparisons and to draw several conclusions.
All relevant statistical parameters are reported in Table 3.
Values of R 2 and R 2 adj attest the good fitting performances of the model which, moreover, is very highly significant (great value of the Fisher parameter F).
The model is robust, the difference between R² and Q² is small (0.05% of Colum OV-101 and 0.22% of Colum CRW-20M). The model demonstrates a very good stability in internal validation while bootstrapping confirms the internal (Q²bOO) predictivity and stability of the model. SDE Pext is a little bit different from SDEP. The model works slightly worse in external prediction than in internal prediction. The matrix of correlation Table 4, obtained using the order Correlation of software MINITAB, shows that the descriptors are more or less correlated between them (r≥0,39 for a p = 0,045<α = 0.05).

Correlation Matrix between Retention Indices and the Selected Descriptors
All the descriptors respectively are correlated with the retention index of the CRW-20M phase except the GATS1p descriptor is correlated less and with the retention index of phase OV -101 descriptor (XMOD) is correlated and the Descriptors (FDI, Mor06v) less correlated.
The Least Squares method of estimation of parameters of linear (regression) models performs well provided that the residuals are well not behaved. However, models with the disturbances that are prominently non-normally distributed or follow a normal distribution But it disturbance and contain sizeable outliers fail estimation by the Least Squares method. An intensive research has established that in such cases estimation by the Least Absolute Deviation (LAD) method performs well.

Multiple linear Regression Comparison Robust Regression of OLS and Least Absolute Deviation
We will try More particularly 2 estimate methods for   The advantage large of the Least Absolute Deviation (LAD) method is robustness, i.e., that the estimators are not impact by the extreme values, (they are known as "robust"). It is thus particularly interesting to use the method Least Absolute Deviation LAD if one is in the presence of aberrant values in comparison with Least Squares (OLS) method.

Comparison of Hyperplanes of Regression
The model has been estimated by first by Least Squares (OLS,) and then by Least Absolute Deviation, Running the least squares and Least Absolute Deviation regression yields the estimates given in Table. Column OV-101 Column CRW -20M All the variables for the two models is strongly statistically significant in the two columns with method least squares and the method Least Absolute Deviation (Table 4-7).
We noticed that calculated of β least squares are not very different for the regression with β the Least Absolute Deviation on the two columns, except, calculated. β 1 and β 3 least squares is almost the same ones as for the regression with β 1 and β 3 Least Absolute Deviation on column OV-101 (Table 4-7).
Thus it is relevant to remake a verification in presences of aberrant values using the following phases (Fig. 3): Hyper plane of regression can radically vary with the change of hyper plane coefficients.

Graphical Comparisons of Alternative Regression Models
The application field has been discussed with the help of Williams diagram.

Column CRW-20M
Column OV-101 The analysis of the residues shows that the observations (82 68 14 1) raised residues in the two estimates and the observations (72, 2) raised residue with the Least Absolute Deviation estimate and lever by least square also observation (2, 4) raised residue and influential observations in the two estimates in the whole of validation on column OV -101 and column CRW -20 M the observations (1, 7, 85) raised residues in the two estimates, the observation (86) raised residues with the Least Absolute Deviation estimate and lever by least square also observation (2,3) raised residues and influential observations with Least Absolute Deviation but it with the least squares estimate the observation (2) influential observation butthe observation (3) lever whole of validation.
After elimination of the collective aberrant points between the two methods and after the secondary treatment one has the observation (83) raised residues in the two estimates also the observation 2influential observation in the whole of validation in the two estimates on column CRW -20 M and on column OV -101 the observations (1,69) raised residues in the two estimates and the observation 81 the observations raised residues in the least squares estimate also observation (2) influential observation in the least squares estimate.
Thus finally the models in which the meaningless statements were removed become: Column OV-101 We noticed besides that calculated β can approach that regression with β Least Absolute Deviation on the two columns into precise calculated (β 1 and β 3 ) least squares are almost the same ones as for regression with (β 1 and β 3 ) Least Absolute Deviation and on the order same with (β 0 and β 2 ) on OV 101 and calculated β 1 least squares are almost the same ones as for regression with β 1 Least Absolute Deviation on CRW -20 M and on the order same with (β 1 , β 3 and β 4 ).
The analysis of the residues shows that in this case All the observation of Least Absolute Deviation method between (-2, 2), but it the analysis of the residues of least squares method shows that the observations , CRW-20 M: Training-(46)] the Least Absolute Deviation estimate given good result On the other hand estimate least squares Fig. 4:

Graphical Comparisons of Alternative Regression Models
Column CRW-20M

Column OV-101
We notice no change of the coefficients of the right-hand side after feeding of the aberrant point what translates the line is stable which expresses that the Least Absolute Deviation method born not sensitive to the presences of the aberrant values thus we report that the Least Absolute Deviation method is a stable method and more robust.
To conform the approach between the two methods and to deduce the robust method between them, There is a set of tests of normality (of standard errors or residues…) indeed, thanks to robustness concept, we can used simple techniques (descriptive e.g. Statistics, technical graphs) to check if the distribution of data is really approximate.
Any test is associated a risk known as of first species years works us, we will adopt it risk α = 5%.

Comparisons of the Tests of Normality of the Errors between Method Least Absolute Deviation and Least Squares in Approached State
Software Minitab 16 proceeds automatically in estimating two principal parameters of the normal law

Least Absolute Deviation Method
A normal distribution with the two columns appears to fit your data sample fairly well.
The plotted points form a reasonably straight line.

Test of Anderson-Darling
In our work, one finds us that Anderson-Darling

Interval of Confidence
The interval confidence and the risqe a constitute a complementary approach thus (an estimate approach) the most used interval confidence is interval confidence has 100(1-a) = 95%. The data may be compatible with the hypothesis also that the limited values of the interval are center which expresses the mean and the median which verifies position 95% that the 50th percentile for the population the center of the acceptance zone the null hypothesis.
Completely all the graphic and statistical tests is accepted data of the approached state between the two methods especially the test of Anderson-Darling the value of the Least Absolute Deviation method closer to least squares method and Interval of The value of confidence these result is formed L approximate of two method.

Conclusion
PYRAZINes are compounds naturally presents in food and taking part in their odour, contray to their biodegradation, pyrazine formation has been intensively studied.
Modeling of retention indices of 114 pyrazines (89 Training and 25 Test) eluted out of two columns various OV -101, the best tree parametric model was constructed using.
The Column of OV-101 and CRW-20M by two methods Least Absolute Deviation and least squares are based on the following comparisons.
The comparison of the equations of the hyper planes: L'equations of least squares is closer to Least Absolute Deviation after elimination of the aberrant points for the β 2 (Least Absolute Deviation) ≅ β 2 (least squares) and the other coefficient remaining with the same order for column OV-101 for the column CRW-20 M the β 1 (Least Absolute Deviation) ≅ β 1 (least squares) and the other coefficient remaining with the same order after the secondary treatments for the checking of presence of aberrant values (training: 1, 2, 14, 68, 72, 82 test: 2, 4) (training: 1, 7, 85, 86, test: 2, 3) on column (OV -101) and (training: 1, 7, 85, 86, test: 2, 3) for the CRW-20M-column) and to be able to compare them By using the following stage.
Graphic comparison: The applicability is discussed using the diagram of Williams in dependence. Lastly, it is noted that Least Absolute Deviation is a robust estimator not sensitive to the presences of the aberrant values thus we report that the Least Absolute Deviation method is a stable and robust method.
Used test of normality's of the errors by graphic and statistical test. One applied compatibility with the normal law, but using the degree α = 0.05. Too one confirmed approached graphically by Probability plot of the error One notes that the test to accept the assumption of normality is that of Anderson-Darling, in finished by the confidence interval with one p-been worth sup 0.1 on the columns. It general this study is shown that results by the two estimates theoretical (equation) and graph give good results expressed by the models.