Factor Analysis, Target Factor Testing and Model Designing of Aromatic Solvent Effect of the Formyl Proton Nuclear Magnetic Resonance Chemical Shift in Para Substituted Benzaldehydes

Problem statement: The variations of formyl proton Chemical Shifts (CS) of p-substituted benzaldehydes in aromatic solvents were investigated. The validity of several physical solvent and empirical solvent scales was examined. Also, to predict dipolarity-polarizability (π) solvent scale for some aromatic solvents. Model designing was also achieved to rationalize the aromatic solvent effect on the formyl proton CS. Approach: The previously recorded formyl proton CS for p-Xbenzaldehydes, with X were NMe2, OMe, OC3H7, H, Br, CHO and NO2 in benzene, toluene, p-xylene, m-xylene and mesitylene were subjected to Factor Analysis (FA). Target Factor Testing technique (TFT) was performed for several solvent scales namely: Unity, the intrinsic aromatic solvent induced shift of TMS (IASISTMS), f(n), f(d), (n-1)/(n+2), (d-1)/(d+2), ET(30) and π. Iterative TFT was applied to predict unmeasured (π) solvent scale for ethyl benzene, n-butyl benzene, sec-butyl benzene, tert-butyl benzene and isopropyl benzene. Results: It has been found that two factors were responsible for the variation in the formyl proton CS. The unity, f(n), (n-1)/(n+2), IASISTMS, ET(30) and π were real factors. Model designing of the formyl proton CS in benzene, toluene, p-xylene, m-xylene and mesitylene were achieved. The models with lowest root mean square error (RMSE) have shown that Unity is a consistent term. The other term was either IASISTMS or π. Iterative TFT predicted new π values for ethylbenzene, n-butylbenzene, sec-butylbenzene, tert-butylbenzene and isopropylbenzene respectively. Conclusion: FA has revealed that two real factors are responsible for the variation of formyl CS in benzene, toluene, p-xylene, m-xylene and mesitylene solvents. TFT has shown to be a powerful technique in predicting new values of the π solvent scale. Model designing for the formyl proton CS have revealed that the IASISTMS, π and Unity are the best empirical solvent scales and were better than any physical solvent scales in reproducing the formyl CS. The IASISTMS reflects the dipolarity-polarizabilty of the aromatic solvent. The cofactor of the solvent scale was found to correlate with the σp substituent parameter.


INTRODUCTION
It is well-known that a solvent exerts an effect on many solvent dependent properties. These solventsolute interactions can be related to physicochemical scales by constructing a solvent model (Koppel and Palm, 1972). Two types of solvent scales are generally used for modeling: (a) a physical solvent scale such as the dielectric function, the refractive index function or a modified function of both; or (b) the empirical solvent scales, which are derived from a solvent dependent process. There are many empirical solvent scales. A recent comprehensive review (Katritzky et al., 2004) lists 183 solvent polarity scales. We cite the most popular of them, i.e., Reichardt (1979) solvent polarity parameter E T (30) and Taft's solvent dipolaritypolarizability scale π * (Kamlet et al., 1977).
The solvent effect is widely observed on the 1 H (Bukingham et al., 1960), 13 C (Engler and Laszlo, 1971;Eliasson et al., 1982) and 19 F (Ager and Phillips, 1972;Dayal and Taft, 1973) NMR chemical shifts. Fowler et al. (1971) constructed solvent models for several non-aromatic solvent dependent properties. The modeled NMR data were for 1, 1-fluoroethylene and chloroform mainly in non-aromatic solvents. The NMR solvent effects were better modeled by a single solvent scale. Neither an extra physical nor empirical solvent scale would improve model quality. A quantitative solvent property relationship treatment  of 45 different solvent scales using the CODESSA programme for 350 solvents has enabled direct calculation of predicted values for any scale for any previously unmeasured solvent. Principal Component Analysis (PCA) for 40 solvent scales of 40 solvents has been carried out (Katritzky et al., 1992). The results allowed a comparison of both solvent scales and characterization of individual solvents. However, aromatic solvent empirical scales have received very limited investigation and there have been missing π * values for these aromatic solvents. The original article (Kamlet et al., 1977) for the π * solvents also carried few aromatic solvents. Bertra'n and Rodri'guez (1983a) measured the chemical shift of formyl proton of several p-substituted benzaldehydes in different aromatic solvents. Bertra'n and Rodri'guez (1983b) aims were to derive a scale called the intrinsic aromatic solvent induced shift of the TMS (IASISTMS) proton and gauge the effect of TMS on the linear correlations of the proton of Aromatic Solvent Induced Shift (ASIS). They did not model the aromatic solvent in terms of empirical or physical solvent scales. Factor Analysis (FA) also called PCA and Target Factor Testing (TFT) techniques have proven to be successful in tackling several chemical problems (Malinowski, 2002;Fadhil, 1992;Fadhil, 1993;Altun, 2005;Altun and Koseoglu, 2006). TFT allows the testing of the validity of each solvent scale individually before constructing the model. TFT could also be used to predict unknown values for solvent scales from the experimental data under investigation.
The aim of this study is to target factor test several physical and empirical solvent scales, model the aromatic solvent effect of the formyl proton in para substituted benzaldehydes and predict new π * scale values for unreported aromatic solvents.

MATERIALS and METHODS
The formyl proton CS for p-substitutedbenzaldehyds in ppm data were taken from (Bertra`n and Rodri`guez, 1983a). The formyl proton CS in a given solvent was referenced with respect to cyclohexane.
Factor analysis (Malinowski, 2002) can be used to analyze large data sets without relying upon preconceived chemical model. The method is based on expressing a data matrix D into a product of two matrices R and C plus an error matrix E: In Eq. 1: D = An r×c matrix R = An r×n matrix C = An n×c matrix E = An r×c matrix composed of experimental errors In other words, each element of the data matrix is assumed to have the form: where, the sum is taken over n factors, e ik is the residual error unaccounted for by the factor model and d ik ( n) is the reproduced data point based on n factors. The decomposition is readily accomplished by singular value decomposition, which yields: Where: U and V = Matrices whose columns contain unitlength eigenvectors associated, respectively, with the R and C matrices S = A diagonal matrix containing the normalization constants for each pair of row-column eigenvectors The elements of S are the square roots of the eigenvalues. An eigenvalue, λ j , represents the variation in the data attributed to the associated eigenvectors. The largest set of eigenvectors (λ l , λ n ), also called primary factors, accounts for the principal components, where the smallest set ((λ n+1 to λ s ), also called secondary eigenvalues, account for experimental errors: λ 1 >λ 2 >……λ n >λ n+1 ……>λ s primary | secondary (4) (components) | (errors) In eq. 4, s represents r or c, whichever is smaller. The sum of the smallest set of eigenvalues equals the sum of squares of the error (e 2 ik ): In Eq. 5, d ik (n) represents a data point reproduced using only n primary eigenvectors.
The Residual Standard Deviation (RSD) is defined by the left side equality in Eq. 6: s r c 2 j ik j n 1 i 1 k 1 e RSD (r n)(c n) (r n)(c n) The right side is the result of applying Eq. 5. The denominator in Eq. 6 represents the degrees of freedom. Expressing RSD in terms of eigenvalues affords a computationally efficient way to evaluate this important quantity.
If a reasonable accurate estimate of the standard deviation is known prior to the factor analysis the number of primary factors can be determined by direct composition to that obtained from Eq. 6. When such information is not available the problem becomes acerbated.
In factor analysis technique, Malinowski and McCue (1981) have defined two functions to detect the significant number of factors in the data matrix. They called them the indicator function IND (Malinowski, 2002) and the reduced eigenvalue function REV (McCue and Malinowski, 1981;Malinowski, 1987) defined by Eq. 7 and 8 respectively: Where: RE = The real error λ j = The jth smallest eigenvalue (eigenvalue due to error) r, c and n = The number of rows, columns such that (r>c) and primary factors in the data matrix respectively IND function is a function of secondary eigenvalues, the number of rows and columns in the data matrix and the significant number of factors. Hence, the behavior of IND function varies with the number of factors. The number of factors is gradually increased and the corresponding IND function is observed. As the number of factors is increased, the IND function is decreased in value and reached a minimum when the significant number of factors is achieved. REV is a function of secondary eigenvalues and will remain fairly constant for error eigenvalue.
Recently Malinowski (2009) has developed a successful method called Determination of Rank by Median Absolute Deviation (DRMAD) to determine the number of principal factors responsible for a data matrix by direct application to the RSD obtained from principal component analysis. The MAD was defined as: Where: x = Represents RSD MAD = Ideally suited to determine the set of error eigenvalues If a primary eigenvalue is added to the set of error eigenvalues, the RSD will become much larger than true RSD. The resulting RSD will be an outlier that can often be identified by MAD analysis. Factor level (n) representing the dividing line between the primary and secondary sets of eigenvalues . A zero in the test column of DRMAD indicates that RSD based on n factors is an outlier. Unity indicates that the associated RSD is not an outlier of secondary set. TFT (Malinowski and Howery, 1980) involves the following matrix transformation: They are called abstract matrices, because they represent a purely mathematical solution of the problem. The target testing is described as follow: T is the target transformation vector generated from a least-square operation involving the principal factor analysis solution and the individual target being tested as a vector R test . If the test vector R test is real factor the predicted vector R predicted obtained from last equation will be reasonably similar to the test vector i.e. it will lie within the experimental error. Otherwise the tested vector will be rejected. The criterion upon which a tested vector is being accepted or rejected was developed also by Malinowski and Howery (1980). TFT was achieved by monitoring the SPOIL function. SPOIL function was defined as in Eq. 11: Where: RET = The real error in target factor REP = The real error in predicted target factor EDM = The real error from the data matrix According to Malinowski and Howery a SPOIL value between 0 and 3 is an indication of an acceptable factor and a SPOIL value greater than 3 is not acceptable.
FA was performed for the covariance matrix of the formyl chemical shift. Standardization was not applied. FA and TFT calculations were performed using FACTANAL computer programme (Malinowski). DRMAD test was performed on MATLAB code computer programme (Malinowski, 2009).

Number of factors:
The solvent shifts of formyl proton Chemical Shifts (CS) of seven p-X-benzaldehydes in 10 aromatic solvents are shown in Table 1.
Bertra'n and Rodri'guez (1983a) have noticed that plotting the formyl CS for a given p-X-benzaldehyde against δH x versus δH H for the unsubstantiated benzaldehyde in 12 aromatic solvents gave a bilinear plot, such that the monoalkyl benzenes form a separate linear plot (for instance, the correlation coefficient for δH NO2 versus δH H was 0.987). The remaining solvents form a separate straight line. For that reason, we performed FA to a subgroup of solvents, namely benzene, toluene, p-xylene, m-xylene and mesitylene. Our data matrix is composed of the formyl CS in the above solvents. The p-X-benzaldehydes with X were NMe 2 , OCH 3 , OC 3 H 7 , H, Br, CHO and NO 2 . Results of the FA for this matrix are presented in Table 2.
The IND function initially decreased as the number of factors increased, but started to increase as the number of factors exceeded two. The REV function decreased sharply as the number of factors increased from one to two then stabilized as the number of factors exceeded two. The DRMAD test gives Unity value at two factors. The three methods give a conclusive result that two primary eignevalues are necessary to account for the factor space of the formyl Aromatic Solvent Induced Shift (ASIS). This conclusion is further confirmed by the value of the Real Error (RE) and RSD functions at two factors, which are close to the experimental error 0.005 (Bertra'n and Rodri'guez, 1983a).
Target factor testing: TFT was performed for several solvent scales at two factors for the same data matrix as the FA results mentioned above. Table 3 lists these solvent scales together with their values. Results of testing are presented in Table 4. The Unity (U) which is equal to one for each solvent, f (n) the refractive index, (n 2 -1)/ (n 2 +2) function and the IASISTMS gave the lowest SPOIL values.
This indicates that these solvent scales can be classified as primary factors. The IASISTMS was derived empirically from linear correlation of the ASIS of a group of sensor protons in two fixed aromatic solvents. Several solute systems were used, namely p-X-benzaldehydes, camphor, α-Br-camphor, 5-Xfurfurals, p-X-acetophenones and methyl ketones (Bertra'n and Rodri'guez 1983b).

Model designing of formyl chemical shift in benzene, toluene, p-xylene, m-xylene and mesitylene:
In order to construct an empirical model to elucidate the aromatic solvent induced shift, we must choose solvent scales with acceptable SPOIL values. In cases of solvent scales combined with other solvent scales (except Unity), orthogonality of the combined solvent parameters were observed. The combined solvent scales in the designed models and their root mean square errors (RMSE) are presented in Table 4. Taking in consideration the experimental error is 0.005, only three models gave a RMSE close to the experimental error. One of these models involved the Unity and the IASISTMS scale. The second model involved Unity and the dipolarity-polarizability solvent scale π * . The third model was constructed by using the dipolarity-polarizability scale π * and f (n) solvent scales. The fourth model involved the π  and (n 2 -l)/(n 2 +2) solvent scales. Models with a RMSE higher than twice the experimental error were not considered to be successful models for reproducing the formyl CS. It is not a surprise to have model (1) with the lowest RMSE as we commented earlier on the derivation of the IASISTMS. In order to investigate the nature of the IASISTMS scale, we correlated it with π  for the same set of solvents used in this study. The correlation coefficient was -0.988, indicating that the IASISTMS solvent scale has a dipolarity-polarizability character. Table 5 presents details of the models with a RMSE less than twice the experimental error. π * prediction for monoalkyl benzene solvents: Kamlet et al. (1977) derived π * for a large number of solvents. However, π * for a limited number of monoalkylated benzenes is available. Four of the monoalkylated benzenes in Bertra'n and Rodri'guez's (1983b) study have no measured π * , namely ethylbenzene, nbutylbenzene, sec-butylbenzene and tert-butylbenzene. Results from Table 4 have proved that π * is an acceptable target in modeling the 1 H substituent chemical shifts. Therefore, we could use these 1 H substituent chemical shifts in the above solvents to predict π * for ethylbenzene, n-butylbenzene, sec-butylbenzene and tertbutylbenzene by using TFT technique.   Thus, we constructed a data matrix of the 1 H substituent chemical shifts for the previously investigated solvents i.e., benzene, toluene, p-xylene, m-xylene, isopropylbenzene, mesitylene and a monoalkylbenzene solvent with unknown π * . Not only does target factor testing have the ability to test the validity, it also predicts unmeasured values of π * for a certain solvent in the tested factor array. The tested array for π * involved the known values for benzene, toluene, p-xylene, m-xylene, mesitylene, isopropylbenzene and a free floated value of π * which is equal to zero, for the monoalkylbenzene solvent of the unknown π * value. The technique then predicts π * at the correct number of factors, in our case, two. The predicted value of π * for the monoalkylbenzene solvent is then fed back to the tested factor array and another iteration process is conducted until a self consistent value of π * is reached. This usually takes four cycles of iterations. The SPOIL value for the tested π * array is lowest as the π * of monoalkylbenzene solvent approaches the self consistent value. The monoalkylbenzene solvent with the predicted π * is replaced by another one and the free floating iterations are repeated. Table 6 shows the results of the above iterations for the monoalkylbenzenes solvents with unknown π * . The last predicted π * TFT is considered to be the π * TFT for the given monoalkylbenzenes solvent. To verify these predicted π * TFT , they were correlated with the IASISTMS solvent scales, the R 2 % and F-Fischer being 91.8 and 33.63 respectively for monoalkylated benzene solvents ethylbenzene, nbutylbenzene, sec-butylbenzene, tert-butylbenzene and toluene. However, when isopropylbenzene was included the R 2 % and F-Fischer deteriorated to 86.5 and 25.6 respectively. Thus, we tried to drive a new value of π * for isopropylbenzene by performing TFT for the following set of solvents: Benzene, toluene, p-xylene, m-xylene, isopropylbenzene and mesitylene. The predicted π * TFT was 0.45 for isopropylbenzene. When this value was used instead of the derived π * by Kamlet et al. (1977) (0.41), the R 2 % and F-Fischer for the correlation between π * TFT and the IASITMS became 91.6 and 43.9 respectively for monoalkylbenzene solvents including isopropylbenzene. This correlation coefficient is similar to when isopropylbenzene was excluded from the set of monoalkylbenzenes.

Model designing of formyl proton chemical shift in monoalkylated benzenes solvents:
In order to investigate the efficiency of π * TFT , the IASISTMS and π * Regression solvent scales for monoalkylated benzenes in modeling the formyl proton CS, we constructed three models to represent each regression of the formyl proton CS on one of the above solvent scales. Statistical results are presented in Table 9. The R 2 % and F-Fisher results indicate models which use π * predicted by TFT (π * TFT ) are superior to other models. Thus, we have given the slope and intercept for that model only. The sensitivity of the π * TFT scale as given by the (a) parameter in Table 5 and 9 were correlated with Hammett's σ p + substituent constant. Equation 15 gives the statistical results for such correlation of the σ p + with the (a) parameter derived from the benzene, toluene, pxylene, m-xylene and mesitylene sets of solvents-the original sets used to drive π * by TFT for monoalkylated benzene. Equation 16 gives the statistical results for the correlation of σ p + with the (a) parameter derived from Toluene, n-butylbenzene, sec-butylbenzene, isopropylbenznene and tert-butylbenzene: The term which includes the (a) parameter of Table 5 and 9 represents the ASIS because of the electric field which is due to the interaction between the solute and the solvent molecules. The correlation (vide supra) indicates that the intensity of the electric field is not only dictated by the solvent polarity, but also the polarity of substituent i.e., σ p + constant. This may be attributed to the polarization power effect of the substituent which may influence the overall polarity of the solute molecule.

CONCLUSION
FA has revealed that two real factors are responsible for the variation of formyl CS of psubstituted benzaldehydes in benzene, toluene, pxylene, m-xylene and mesitylene solvents. TFT has shown to be a powerful technique in predicting new values of the π * solvent scale for ethylbenzene, isopropylbenzene, n-butylbenzene, sec-butylbenzene and tert-butylbenzene. Model designing for the formyl proton chemical shifts in benzene, toluene, p-xylene, mxylene and mesitylene solvents have revealed that the IASISTMS, π * and Unity are the best empirical solvent scales and were better than any physical solvent scales in reproducing. the formyl CS. The highly significant correlation between IASISTMS and π * implies that the ASISTMS reflects the polarity-dipolarizability character of the aromatic solvent. The polarization power of the substituent and the solvent polarity play a significant role on the intensity of the electric field interaction between the solvent and the substituted p-benzaldehyde.

ACKNOWLEDGMENT
We are grateful to Professor E.R. Malinowski for providing us FACTANAL and DRMAD programmes.