Determination of Optimum Values of Descriptors to Set Filters for Synthetic Tri-Pyrrole Derivatives (Prodiginines) Against Multi Drug Resistant Strain of Plasmodium Falciparum

Department of Chemistry, Vidya Bharati College, Camp, Amravati, Maharashtra, India Department of Chemistry, Faculty of Science, University of Qom, Qom, Iran Laboratoire Chimie des Matériaux, Université Mohammed Premier, Oujda-60000, Morocco Creighton University, Omaha, NE, USA Division of Organic Chemistry, CSIR-National Chemical Laboratory, Pune 411008, India Department of Pharmaceutical Chemistry, Kind Saud University, Alkharj, Saudi Arabia

Prodiginines (Mahajan et al., 2012;2013;Masand et al., 2013b;Papireddy et al., 2011), the oligopyrrole derivatives with a characteristic conjugated system, are promising antimalarial agents (Fig. 1). These compounds have the ability to inhibit Plasmodium falciparum (P. falciparum) at very low concentrations. They show marked clearance of the protozoa parasite and can be effectively administered orally. Despite these advantages, search for a potent prodiginines with good Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) profile and improved ease of synthesis has resulted in limited success (Mahajan et al., 2012;2013;Masand et al., 2013b;Papireddy et al., 2011). The progress can be expedited using the contemporary method of drug designing like QSAR, Molecular docking and Pharmacophore modelling. Of the above mentioned methods, QSAR is an established technique with good success in last few decades and is utilized in our research (Huang and Fan, 2011;Myint and Xie, 2010;Scior et al., 2009;Tropsha, 2010).

Fig. 1. Synthetic prodiginines used in present study
A typical QSAR study involves establishment of correlation between structure and activity (Mahajan et al., 2012;2013;Masand et al., 2012a;Masand et al., 2013a;2012b;2013b;Rastija et al., 2013). Different characteristics or attributes of chemical structure are expressed in terms of numerical entities termed as molecular descriptors (also known as parameters or features). One or more molecular descriptors are used to build statistically robust linear regression equation. A properly validated QSAR equation is considered more useful if it is derived using descriptors that represent maximum useful information with minimum overlap and are interpretable in terms of structural features (Chirico and Gramatica, 2011;Gramatica, 2013;Gramatica et al., 2012;2013;Martin et al., 2012;Mitra et al., 2010;Roy and Mitra, 2012;Saha and Roy, 2012;Tropsha, 2010). Unfortunately, limited number of validated QSAR equations with the ability to guide for the development of new drugs or modification of existing drugs are utilized, due to the following reasons (Chirico and Gramatica, 2011;Doweyko, 2008;Gramatica, 2013;Gramatica et al., 2012;2013;Martin et al., 2012;Mitra et al., 2010;Roy and Mitra, 2012;Saha and Roy, 2012;Tropsha, 2010) (i) Difficulty in interpretation of QSAR equation in terms of structural features; (ii) The calculation or estimation of descriptors is very complex or resource consuming; (iii) computational facilities/resources like advanced and specific softwares may not be available to organic chemist to calculate descriptors that are mentioned in QSAR equation. (iv) The organic chemist may not be well skilled or trained in QSAR; (v) In addition, important descriptors having good correlation with activity might get missed in QSAR equations due to some reasons.
To overcome the difficulties, many researchers use inverse-QSAR (i-QSAR). In i-QSAR, the molecules are optimised using a set of physico-chemical properties or theoretical descriptors, which are obtained or derived using a well known marketed drug as 'reference' (Brown et al., 2006;Faulon et al., 2005). This approach has certain limitations like (i) proper selection of drug is an exigent and tricky process (ii) the drug should have similarity in structural or shape with the molecules of data set in hand (iii) For some diseases, no marketed drugs are available whereas for some diseases, a lot of marketed drugs are available. (iv) The physico-chemical properties or theoretical descriptors which are associated with one chemo-type of drug may not be possible to calculate or estimate for other chemo-type of molecule. (iv) The physico-chemical properties or theoretical descriptors associated with one chemo-type of drug may not be possible to calculate or estimate for other chemo-type molecules. After determining the values of different descriptors, the problem then lies in constructing a viable molecule from these descriptors. This is the real limiting factor of most inverse-QSAR methods, because most of the descriptors are not reversible.
A good solution is to determine the optimum value of useful and information rich descriptors, during the QSAR equation development. The most striking advantage in determining optimal values of different descriptors is that the 'most active' compound in the given data set may or may not fit to optimum values of all the descriptors. This optimization is not based on single 'reference' drug as in i-QSAR. In this case a data set is used to derive a set of physico-chemical properties or theoretical descriptors to optimize the molecules. Thereby, increasing the chances of finding better alternatives to visible 'most active' and potential compounds outside the present data set. This approach can be viewed as 'Hybrid-inverse QSAR'. It could significantly accelerate the discovery of novel small molecules with specified chemical properties.
Literature survey reveals (Buchwald and Yamashita, 2014;Gidskehaug et al., 2008;Hansch et al., 2004;Jager and Kooijman, 2009;Kubinyi, 2002) that a well established method to determine the optimum value of any descriptor is to derive non-linear equation, especially bilinear or biexponential or parabolic equation. These functions assume that the relationship between descriptor and the activity is non-linear with the vertex of curve representing the optimum value.
In our previous work, we successfully performed CoMSIA, GUSAR and QSAR analyses for antimalarial activity of synthetic prodiginines. The objectives of the present study are (i) to determine optimum value of easily interpretable descriptors used in the QSAR equation (ii) to determine optimum value of some other useful descriptors having good correlation with activity but not included in the reported QSAR equations and (iii) to compare the performance and ability of linear, parabolic, bilinear and biexponential QSAR models to determine the optimum values of descriptors.

Data Set
The experimental in vitro Inhibitory Concentrations (IC 50 ) expressed in nanomolar units of forty three synthetic prodiginines against the Chloroquine (CQ) resistant strain Dd2 are selected from a recent publication (Papireddy et al., 2011). The data set includes prodiginines with different substituents like varying length of alkyl chain, substituents at different positions of benzene ring etc. Table 1, provides the experimental data. The values were converted into the logarithm units, (-log 10 IC 50 = pIC 50 ) for molecular modelling purpose.
The structures were drawn using ACD Chemsketch 12 freeware and were converted into 3D structures. This was followed by geometry optimization using a molecular mechanics method implemented in the program VegaZZ, using Gasteiger partial charges and Tripos force field (Mahajan et al., 2012;2013;Masand et al., 2013b). The optimized structures (β-isomer) were uploaded onto the e-DRAGON server to calculate myriad number of 1D-, 2Dand 3D-molecular descriptors (Fig. 2). Before QSAR model development, descriptors with constant or nearly constant (for 80% molecules) values were discarded. Genetic Algorithm (GA) available in QSARINS (Chirico and Gramatica, 2011;2013;Chirico and Gramatica, 2011;2013) was used to select optimum number and set of descriptors to build statistically sound multi linear regression equation. Matlab and BuildQSAR were used to build bilinear, biexponential and parabolic equations. In addition, Microsoft excel was used for different statistical functions.
A good number of statistical parameters like R, R 2 , R 2 adj , S and F were calculated along with R 2 cv (R 2 LOO ) for internal validation and to check the robustness of the model.

Results and Discussion
In the present study, we derived and compared the linear, biexponential, bilinear (two equations) and parabolic equations. These equations are listed in Table 2 and 3. The equations provide useful correlation between activity and many easily interpretable useful descriptors like Sv (Sum of atomic van der Waal's volumes), Sp (Sum of atomic polarizabilities), X1v (first order valence connectivity index, to represent the steric factor), ALOGP (Ghose-Crippen Octanol-water coefficient) and nAT (number of atoms).

Comparison of Different Models
For some descriptors viz. nAT, Sv, Sp and ALOGP, non-linear models are either superior or equivalent to the linear model. Whereas, for rest of the descriptors, the fitting of the non-linear models is better than the linear model. This indicates that the relation between the activity and the selected descriptors is non-linear in nature. In other words, non-linear model can better explain the variation of activity. Among non-linear models, bilinear Equation 1 (based on Kubinyi formula) fits better than the rest, with biexponential models being least fit in nature for many descriptors. None of the model satisfies the recommended threshold value (>0.85) of CCC, though for some models, it is close to it.
In the above models, the symbols have their usual meanings. Increasing the number of congeneric compounds in the data set as well as the range of biological data might result in better statistical fitting. In many cases, the substantial fitting of the equation (R 2 >0.60), though not outstanding, is satisfactory. This proves that there prevails the optimum value of lipophilicity, number of atoms, number of bonds and X1v (to represent steric factor). Thus, the selected descriptors, for which the optimum values are determined, represent the overall descriptor space.
A comparison of values of descriptors for four most active (highlighted as bold and italic) and four least active (highlighted as bold and italic) compounds justify the importance of optimum values of descriptors (Table  3). The value for selected descriptors for four most active molecules selected as representatives are close to optimum values whereas reverse is true for the four least active molecules. Thus, the optimum values of these descriptors could be helpful in finding a good "lead prodiginine" for anti-malarial activity.
Interestingly, the values of descriptors for the 'most active' compound 30 in the present data set are close to optimum values of many descriptors. However, it does not match with the optimum values of all the descriptors. This confirms that the appropriate lead/drug optimization using only most active or single drug as 'reference' is not a perfect method. In present case, a plausible reason for this could be the ability of the molecules (prodiginines in present case) to attain different conformations and tautomeric forms. Prodiginines possess azafulvene-pyrrole tautomerism due to the three pyrrole rings joined by -CH= link. As prodiginines can form four different tautomeric forms, the tautomeric form, which is energetically favoured in solution, may not be the 'bioactive tautomeric form' which shows interaction with the specific receptor and is responsible for the pharmacologic activity of this group.
Prodiginine may interact with different receptors in different tautomeric forms. In addition, prodiginines can exist in two conformations viz. α and β isomer, which have been discussed in our previous work (Reference). Another possible reason is satisfactory fitting (R 2~0 .60) for most of the developed models.

Variation of Activity with Various Parameters
Herein, the activities of some more active and less active molecules from the dataset in terms of various descriptors like lipophilicity/hydrophobicity, number of rotatable bonds, steric factor etc. for which the optimum value, determined using bilinear Equation 2, has been derived and discussed. For optimum value determination, parabolic and bilinear Equation 1 can also be use, but, these have some serious drawbacks, like (1) the parabolic approach forces the data into a symmetrical parabola, resulting in deviations between the experimental and parabola-calculated data.
(2) The ascending slope is curved and conflicts with the observed linear data. (3) The bilinear equation provides better optimum value only if the dataset is large in size with wide spread variation in activity value. The bilinear Equation 2 does not confined to such limitations. Therefore, in the present work, it has been used for optimum value determination.
We here clarify that we have though discussed the effect of individual descriptor, but the combined or converse effect of other factors/descriptors do have additional influence on the activity profile of these compounds.

nBT (Number of Bonds)
The optimum value for the number of bonds from the bilinear equation (Table 2)

Conclusion
In summary, the present study reveals that the nonlinear models should be developed to determine optimum values of the descriptors. A good lead compound (prodiginine in the present work) can be identified and optimized if the optimum value of lipophilicity, sum of atomic van der Waal's volumes, sum of atomic polarizabilities, first order valence connectivity index, number of atoms, number of benzene-like rings and number of rotatable bond are used correctly and efficiently. The "ready to use" optimum/desirability values will be useful to the medicinal chemists in developing novel prodiginines with good anti-malarial activity profile.