On Application of Carhart Atom Pairs to Predict Anticonvulsant Activity

About 50 million people in the world suffer from epilepsy, especially in childhood, adolescence and old age. Available treatment fails to control epilepsy in up to 30% of affected people. In developing countries, however, the amount of patients that do not receive adequate treatment climbs up to 75%. Moreover, the new generation of antiepileptic drugs (AEDs) causes important central and peripheral side effects, including ataxia, diplopia, dizziness, headache, nausea, allergies and sedation. A mathematical model previously developed by Bagchi and Maiti, involving Carhart atom pairs and similarity measures, is applied in the prediction of anticonvulsant activity of two sets of compounds which have shown to be active in the Maximal Electroshock Seizure (MES) test, meaning that their mechanism of action can be at least partially explained through sodium channels blockade. Nine structurally heterogeneous molecules define the first set of compounds, with Carhart similarities to carbamazepine ranging from 0.005 to 0.593. The second set is defined by four benzodiazepines derivatives with Carhart similarities to THIQ-10c ranging from 0.533 to 0.570. A new, more specific, model is constructed based on the one from Bagchi and Maiti and a pharmacophore previously identified in our laboratory through an active analog approach. Applied to both sets of compounds, our model shows smaller average percentage error and average absolute error in the prediction than the one form Bagchi and Maiti and smaller SD as well. Accuracy and precision in the prediction also increases compared to those obtained when using bare similarity coefficients as relative activity indicators.


INTRODUCTION
Despite the continuous development of 3D and 4D QSAR methodologies, 2D-based descriptors still remain among the most widely used descriptors in pharmaceutical research today [1][2][3][4][5] . Intuitively one would think that since 3D descriptors capture more features of the molecular structure than 2D descriptors and therefore, they would probe themselves more successful at activity selection and prediction. However, many examples in literature show that this is not always the actual case [6][7][8] . Particularly interesting might be the approach of Schuffenhauer et al. who have suggested that the two types of descriptors are complementary in nature [8] . Although their physicochemical meaning is not always clear, topological descriptors have some advantages that should be taken into account; among these: they have a low computational cost and they can be easily calculated for all the existing, new and in-development chemical structures [9,6] .
There are, essentially, two major types of molecular descriptors [10] . The first of them has been defined as "holistic". That is to say, this type of descriptor could be associated to an approach that does not consider a system as the mere sum of its parts. On the contrary, it assumes the individual parts interacting all together in the system they belong-to generate unique properties that may be unpredictable from a partial, fragment-based approach. Descriptors of this class are numbers that usually represent some important physical property of the whole molecule. These descriptors are either measured or calculated through algorithms. Examples that could be mentioned among many others are: octanol-water partition coefficient [11] , Kier Shape Index [12] , the shape indices of Hopfinger [13] and the Index of Electrotopological state [14] .
The other descriptor category consists of the so called chemical substructures [1,10] . These are pieces of information strung together that may include atomic species, bond types and connectivity of two or more atoms. They could be seen as subgraphs of the graph that is associated to a molecular structure, labeled or colored in a way that reflects the information mentioned above. Some examples that fall in this category are: Atom Pairs (AP) [15,16,1] , Topological Torsions (TT) [1,10] and Atom Sequences (AS) [16] . Such descriptors are particularly useful in retrieval systems, data analysis and compound classification [17,18] . Our present work focuses on this latter category of descriptors.
Atom pair definition: In 1985 Carhart et al. [15] defined an atom pair as a substructure composed of two nonhydrogen atoms and an interatomic separation.
(atom 1 description)-(separation)-(atom 2 description) Separation means how apart they are, measured as the number of atoms in the shortest bond-by-bond path that contains atoms 1 and 2. Description of each atom tells its chemical type, the number of non-hydrogen atom attached to it and the number of bonding electrons it bears. Only the more common types of atoms are represented explicitly: C, O, N, S, F, Cl, Br, I, P, Si and B. Any other atom type is represented through the symbol "Y". An asterisk or dot following an atom symbol represents the presence of a bonding electron. Xn following an atom symbol indicates the presence of n non -hydrogen neighboring atoms. For any given chemical structure denoted by j, the number of AP derived form the structure is calculated by: n being the number of non-hydrogen atoms present in the compound j. Carhart et al. defined an Euclidean-distance-based similarity measure between generic structures s and t as: (2) where n(i,j) is the number of AP of type i derived from structure j. D(s,t) is referred in literature both as the Hamming distance or the sem-theoretic form of Euclidean distance [16,18] . S(s,t) can take values between 0 (completely non -similar structures) and 1 (identical structures). Latter, Carhart et al. applied this concept to structure activity relationships through similarity probe and trend vector analysis [10,15] .
The Bagchi and Maiti model: In 2003, Bagchi and Maiti [19] used Carhart AP and Carhart similarity in the development of a mathematical model that was successfully applied to the prediction of the ordination of activities -relative to a reference drug-of antituberculosis drugs. This model can be described as follows: A denotes activity relative to the standard drug; S denotes similarity between the standard drug and its analog (as defined by Carhart et al.). The factor between brackets takes into account, only, heteroatom APs common to the standard drug and its analog; the influence of all APs is expressed only through the similarity factor. Bagchi and Maiti gave more importance to the APs containing heteroatoms on the basis that heterocyclic compounds are usually biologically active compared to carbocyclic compounds and that heterocyclic compounds are composed of heteroatom APs. They also assume that heteroatom APs with the highest path lengths have greater influence over activity; as they cover a major portion of the molecule, coincidence regarding this type of APs implies greater probability of similar activity. Following this idea, i denotes the path length of the ith heteroatom AP, n i denotes the difference in quantity of ith heteroatom AP between the standard drug and the analog that is being tested (note that if the ith heteroatom AP appears more times in the tested structure, n i becomes negative and the second term between brackets turns positive, adding to the activity). n i denotes the number of ith heteroatom APs present in the standard drug. The factor between brackets could be barely related to the pharmacophore concept (see below: Pharmacophore definition).
Bagchi and Maiti validated their model not only with a set of known antituberculosis drugs, but with a little set of diuretic drugs as well. Both set of compounds were formed by structurally similar drugs. On this basis we decided to verify is Bagchi and Maiti model could be applied to predict the activities (or at least, the ordination of activities relative to a lead compound) of anticonvulsant drugs with mechanisms of action involving sodium channels blockade. Our objectives can be summarized as follows:  [20,21] . Available treatment fails to control epilepsy in up to 30% of affected people. However, in developing countries about three-fourths of patients don´t receive the treatment they need [21] . Moreover, even the new generation of antiepileptic drugs (AEDs) causes important side effects, including ataxia, diplopia, dizziness, headache, nausea, sedation, allergies, blood dyscrasias and hepatotoxicity [22] . Thus, new AEDs with better safety, less toxicity and higher efficacy in difficult-to-control patients are urgently needed. All over the world the social consequences of epilepsy are often more difficult to overcome that seizures themselves. Fear, misunderstanding and even superstition are just some of the issues faced by epilepsy patients. In developing countries like Cameroon, India and Indonesia, at present, epilepsy is considered as related to evil, demoniac forces. At present, in both China and India, common opinion is that epilepsy is a valid reason for annulling marriage [23] .
Pharmacophore definition: Medicinal Chemistry usually faces the problem of how to generate libraries of promissory compounds starting from fragmentary but relevant information. The word pharmacophore, derived from Greek, means "medicine carrier". It refers to a substructure that embodies essential elements that must be present in a compound to elicit a specific physiological response, that is to say, to have a desired activity. Typically, a pharmacophore consists of a list of atom types that have been determined to be significant to the activity, along with the geometrical relation between them (including inter-atomic distances). Significance to the activity includes chemical functions such as charge, hydrogen bonding propensity, hydrophobicity or aromaticity [24,25] .
At present and in the past, our laboratory has identified pharmacophores related to anticonvulsant activity [26,27] through fitting in SYBYL 6.6 [28] on Silicon Graphics Octane. A model of the identified pharmacophore can be seen in Fig. 1.
In the present work we employed information from this identified pharmacophores to generate a new mathematical model on the basis of the one from Bagchi and Maiti (which already incorporated, indirectly, the pharmachophore notion).
On similarity determination: In general, there are two classes of similarity coefficients: association and distance coefficients. The essential difference between them is that the latter considers the common absence of certain structural features as the evidence of similarity between two chemical structures, whereas the former does not. Examples of association coefficients that can be mentioned are Tanimoto, Dice and Cosine coefficients. Examples of distance coefficients are Hamming and Euclidean distances [16,18] . Carhart et al. and therefore, Bagchi and Maiti -as can be seen in expression (3)-used for similarity determination the radio between the Hamming distance and the total number of features (APs) derived from the two chemical structures being compared. However, work from Chen et al. [16] demonstrates that Tanimoto coefficient showed better performance than Euclidean distance in 2D fragment-based similarity searching, whereas James et al. [19] have suggested that Hamming and Euclidean distances are useful in "relative" distance comparisons (i.e., the distance of two molecules to the same target) but not for "absolute" comparisons (between two independent pairs of molecules). This means Hamming distance might not be the best choice for the similarity factor if one wants to use Bagchi and Maiti model as a tool in the search of new leads, since in that case one would use the model to compare, in an independent manner, each of the tested analogs with the drug chosen as reference. Hamming distance, then, would be the most appropriate choice in optimization (which is the original use proposed by Bagchi and Maiti), since for this purpose obtaining an ordination of relatives activities between the possible analogs would be good enough.
At the moment of constructing our model on the basis of the Bagchi and Maiti one, we evaluated how the performance of our model varies when utilizing three different similarity coefficients: the one proposed by Carhart et al.; the binary form of the Tanimoto coefficient (in which features are confined to dichotomous values: 0 indicating the absence of a particular type of atom pair, 1 indicating its presence) and the algebraic form of the Tanimoto coefficient. The descriptions of these coefficients are given, respectively, in expressions (2), (4) and (5).
a is the number of unique fragments in compound A; b, the number of unique fragments in compound B; c, the number of unique fragments shared by compounds A and B.
where n A,,i is the number of fragment i in compound A; n B,i is the number of fragment i in the compound B. (5)

MATERIALS AND METHODS
Construction of the Set of Compounds: Two sets of compounds which have shown activity in the MES test were constructed. Set A (Fig. 2) is defined by nine anticonvulsant drugs previously used in our laboratory to identify a pharmacophore which seems to be responsible of the anticonvulsant activity through sodium channels blockade [26] . Carbamazepine was chosen as the reference drug of this set, provided that its measured activity at the MES test was the greatest among the group (ED 50 = 37 mol kg 1 ip). Set A can be referred as a heterogeneous set (and therefore, it can be utilized to test the efficacy of Bagchi and Maiti model as a predictive tool not only in the optimization of a compound also in the rational search of new lead compounds). Calculated ranges of Carhart Similarity Coefficient, the binary form of the Tanimoto Coefficient and the algebraic form of the Tanimoto Coefficient for set A were, in that order, 0.005 to 0.760, 0.009 to 0.586 and 0.001 to 0.804.
Set B (Fig. 2) is defined by four benzodiazepine derivatives tested in different experimental seizure models by De Sarro et al. [29,30] . THIQ -10c was chosen as reference drug, with a measured activity of 5.17 mol kg 1 in the MES test. The similarity coefficients determined by (2), (4) and (5) were calculated for compounds of this set, obtaining, in that order, similarity ranges of 0.51 to 0.54, 0.29 to 0.34 and 0.57 -0.60. Therefore, set B can be referred as a homogeneous set and can be used to verify the efficacy of the model in the optimization of compounds of known activity. defines if the relative activity estimated by the model for each structure is greater or smaller than the Carhart Fig. 2: Schematic representation of compounds that define set A. Structural diversity can be appreciated, as confirmed by the wide range of similarity coefficients similarity calculated from the AP. Each one of the terms composing the summatory can be interpreted as the contribution of the ith type of AP to the activity of the whole structure. The sign of each of the this terms is given by the ni factor, which is the only one among all the factors involved in the model that can take values either greater or smaller than zero. Since the reference drug of each set is chosen as the most active structure in it, the types of APs present in the reference drugs are thought as desired features involved in the interaction of the molecule and its site of action. This may as well be the fundamental hypothesis of our model. Therefore, if the number of APs of the ith type is greater in the reference drug than in the tested compound, then the tested molecule is lacking a desired feature and the fraction of the activity that can be explained through that particular element will be smaller in the tested compound than in the reference one. In the other hand, if the number of ith type elements is greater in the tested compound than in the reference drug, then the tested compound has more of a desired characteristic and a greater probability of this feature being expressed at the time of interacting with the site of action. On this basis we believe the presence and the place that Bagchi and Maiti have given to ni in their model is accurate. However; have all the types of atom pairs equal importance in determining the activity of a particular structure? Introducing i , the length of the ith element, in the numerator of (6) Bagchi and Maiti gave more importance to long atom pairs, while ni, being in the denominator gave low importance to those types of AP that have a high frequence of occurrence in the reference drug.   Since the reference is the drug with maximum activity, we believe that the more times a particular type of AP appears in the reference structure, the more important is the contribution of that type of AP to the activity. Thus, our model includes the factor ni not in the denominator but in the numerator of the second term between brackets. We also believe that, being in knowledge of an identified pharmacophore, we should give more attention not to the longest types of AP but to the types of AP that accomplish the characteristics of the pharmacophore. These would be those AP that include heteroatoms and a separation of six, since six is the maximum number of atoms comprised in the possible AP derived from the identified pharmacophore. On this consideration we propose the following model: which is an specific form of the more of the general expression: that could be applied not only to anticonvulsants but also to compounds of any given activity, provided that they interact with the same site of action. A is the activity relative to a reference drug, which is chosen as the most active structure of the working set. S is a coefficient of similarity. In this work we tested our model with both distance and association similarity coefficients: (2), (4) and (5). ni represents the number of ith AP in the reference drug; ni stands for the difference of ith AP between the reference drug and the tested drug; i is the number of atoms included in the ith AP and is the number of atoms involved in the longest AP of the identified pharmacophore. b and c are constants. Depending on the value assigned to it, b may moderate the influence of the similarity on the estimated activity. Although c main purpose is to normalize the second term between brackets, keeping its value between zero and one, eventually it may as well serve to compare the relative influence of different aspects of the pharmacophore. Lets say, for instance, the identified pharmacophore -essential for the occurrence of a given activity-has both an electrostatic and an hydrophobic component. The electrostatic component may be represented by the coincident heteroatom AP (or by certain kinds of common heteroatom AP) between the reference and the tested compounds. The hydrophobic component may require the introduction of another term in the model, which took into account the coincident non-heteroatom AP or, if aromaticity were an important feature, the non-heteroatom AP that included electrons. In that case the magnitude taken by c in each one of the terms would be an indicator of the importance of the electrostatic component relative to the hydrophobic one. E.g., if the constant c for the electrostatic term were 100 and the constant c for the hydrophobic term were 50, then the hydrophobic contribution can be though as twice important in activity determination as the electrostatic. This idea is widely applied in QSAR models, where the normalized coefficients associated to each independent variable (or descriptor) are compared in order to quantify the relative importance of the descriptors included in the model.

Application of both models:
The AP set derived of each of the thirteen compounds was constructed. Table  1 shows the complete sets of AP for carbamazepine and THIQ-10c. Once in knowledge of all the AP sets the similarity coefficients of Carhart and Tanimoto (this latter in its binary and algebraic form) were calculated as specified in expressions (2), (4) and (5); the values of these coefficients for each structure are shown in Table  3. In order to determine the common heteroatom APs, the APs of each one of the structures was compared with the APs of the correspondent reference drug. Table  2 shows the types of AP common to carbamazepine and phenytoin and the occurrence of each type in each one of these structures. It is also shown the calculated value for the negative term between brackets for the two models; c was chosen as equal to 100 in (7). Different values for b were evaluated for each set of compounds: 0.5, 0.6, 0.75 and 1. c was given value of 100 in set A and values of 50 and 100 in set B. Average percentage error and average absolute error of the predicted value of relative activity were calculated in all cases. We present the results for the model of Bagchi and Maiti and for the forms of our model that proved more accurate. We also calculated carbamazepine similarity to THIQ-10c for further analysis in set B, as explained in Results and discussion. Finally, bare similarity coefficients were compared to the relative activity in order to verify if the mathematical models improved the prediction that could be inferred using just the similarity coefficients as activity indicators.

RESULTS AND DISCUSSION
When applied to set B (structurally homogeneous) the Bagchi and Maiti model confirmed its usefulness as an optimization tool, since it predicted the correct ordination of relative activities in the same manner Bagchi and Maiti predicted the ordination of relative activities for diuretics and antituberculosis drugs. The actual ordination of relative activities of THIQ-10c, CFM2S, CFM11, CFM2 and carbamazepine is: THIQ-10c > CFM2S > CFM2 > CFM11 > carbamazepine, while the order predicted by Bagchi and Maiti model is THIQ-10c = CFM2S > CFM2 > CFM11 > carbamazepine. However, as can be seen in Table 4, the values of relative activity predicted are quite different from the actual values (average percentage error and average absolute error -not including carbamazepine-in set B equal to 69.9% and 0.20, in that order; when including carbamazepine this parameters rise, respectively, to 95.3% and 0.21).
Three forms of our model showed good in the prediction of absolute relative activities as well, as can be appreciated in Table 4. These forms of the model are: S being the algebraic form of the Tanimoto coefficient; S being the binary form of the Tanimoto coefficient; and S being in this case the binary form of the Tanimoto coefficient.
The average percentage errors in the prediction were, for this models and not including carbamazepine, 16.7% (9) 29.2% (10) and 31.5% (11). The average absolute error calculated were 0.09, 0.15 and 0.16, in that order. When considering carbamazepine, the average errors were, respectively, 44.7, 25.5 and 25.4%, while the average absolute errors were 0.11, 0.12 and 0.12. These values may be very good as a rough prediction, since a percentage error of 50% in a relative activity of 0.10 implies a predicted relative activity of 0.05 or 0.15, When comparing bare similarity measures to relative activities, percentage and average errors were considerable higher, meaning the models improved the     Table 6 shows the results obtained from applying Bagchi and Maiti model to Set A. Based on their low similarities to the reference drug, we discarded ethosuximide and topiramate. Being similarity a factor of much weight in both Bagchi and Maiti and our model, we considered the two models would fail to predict activity of molecules with high relative activity but very poor similarity, as is the case of topiramate. We believe this responds to a logical criteria: molecules with very dissimilar structures, even if they act through the same mechanism, probably interact with different receptors or at least with different zones of their molecular receptor or site of action. Not only this type of molecules have great probability of not being retrieved by a similarity based method but, if actives, they could be used to propose and alternative interaction site or mechanism of action. Table 5 shows the quite different values for several molecular descriptors widely used in QSAR analysis, when calculated for topiramate and carbamazepine; this may be a further explanation of topiramate and ethosuximide exclusion from set A. The series of descriptors used for comparison was calculated with Dragon Academic version [31] .
Obtained average percentage error and average absolute error in the prediction when applying Bagchi and Maiti model were, respectively, 61.4% and 0.27. When we applied the following form of our model: S being the Carhart similarity coefficient, we reduce the average percentage error to 41% and the absolute error to 0.24. Bare similarity, considering Carhart similarity as indicative of activity, predicted relative activity with a percentage error of 55.1% and an absolute error of 0.25.

CONCLUSION
We confirmed the applicability of Bagchi and Maiti model in the determination of relative activities ordination of a series of analogues referred to a lead drug. Our model seems to be adequate for the rough prediction of relative activity, therefore it may be useful as an approximation tool in the search of new leads and in drug optimization. Almost all tested forms of our model reduce the percentage error and the absolute error that are obtained when utilizing both bare similarity and Bagchi and Maiti model for the prediction. In every case, the predicted value is closer to the measured value when comparing the prediction by a particular form of our model to the bare similarity coefficient used in that particular form of our model. For instance, if one considers the bare Tanimoto binary coefficient as indicative of relative activity for the set B, predicted values have an average percentage error of 28.3%, which could be thought as quite a good rough prediction. However, if we apply either the forms (10) or (11) of our model, both including Tanimoto coefficient in its binary form as similarity coefficient, this average percentage error reduces, in that order, to 25.5 and 25.4%, meaning our model always improves the accuracy on the prediction; the standard deviation (SD) also decreases in both cases, meaning the proposed models increase the precision in the predictive capability. The same behavior appears when comparing average percentage error using bare algebraic form of the Tanimoto coefficient as indicator of relative activity (average percentage error: 109.4%) with the predicted values applying form (9) of our model, which includes that same form of the Tanimoto coefficient as similarity coefficient (average percentage error: 44.7%) The optimal form of our model still needs to be determined, as well as the most efficient similarity coefficient to be used in it, although it appears that, as suggested by James et al. [19] , association coefficients like Tanimoto coefficient may have a slight advantage over distance coefficients since three of the four best working forms of our model employ Tanimoto similarity coefficients and only one employs the one proposed by Carhart et al.
We believe optimizing the model would imply the next steps to be followed: * testing a wider set of molecules. * inclusion of a term that takes into account the nonheteroatom AP that may be present in the pharmacophore. * improving the selection of types of common heteroatom AP on the basis of a better study of the identified pharmacophores or new identified pharmacophores. * generation of a new type of AP that considers not atoms but bioisosters (meaning that the type of AP C*X3-4-O*X1 will be associated, for instance, to the AP C*X3-4-N*X2, since O*X1 and N*X2 are functional groups that work as bioisosters).