An Enhancement of Bayesian Inference Network for Ligand-Based Virtual Screening using Features Selection

: Problem statement: Similarity based Virtual Screening (VS) deals with a large amount of data containing irrelevant and/or redundant fragments or features. Recent use of Bayesian network as an alternative for existing tools for similarity based VS has received noticeable attention of the researchers in the field of chemoinformatics. Approach: To this end, different models of Bayesian network have been developed. In this study, we enhance the Bayesian Inference Network (BIN) using a subset of selected molecule’s features. Results: In this approach, a few features were filtered from the molecular fingerprint features based on a features selection approach. Conclusion: Simulated virtual screening experiments with MDL Drug Data Report (MDDR) data sets showed that the proposed method provides simple ways of enhancing the cost effectiveness of ligand-based virtual screening searches, especially for higher diversity data set.


INTRODUCTION
Over the past few decades, drug discovery companies use combinatorial chemistry approaches to create large and diverse libraries of structures, therefore large array of compounds are formed by combining sets of different types of reagents, called building blocks, in a systematic and repetitive way (Willett et al., 1998;Walters et al., 1998). These libraries can be used as a source of new potential drugs, since compounds in the libraries can be randomly tested or screened to find a good drug compound.
By increasing the capabilities of testing compounds using chemoinformatics technologies such, as High-Throughput Screening (HTS), it is possible to test hundreds of thousands of these compounds in a short time (Waszkowycz et al., 2001;Miller, 2002). Computers can be used to aid this process in a number of ways, for example, in the creation of virtual combinatorial libraries, which can be much larger than their real counterparts. These virtual libraries can be virtually screened either by docking into the active site of interest or by virtue of their similarity to a known active. Recently, searching chemical databases using computer instead of experiment has been called virtual screening technique (Eckert and Bajorath, 2007;Sheridan, 2007;Geppert et al., 2010).
Many virtual screening approaches have been implemented for searching chemical databases, such as, substructure search, similarity, docking and Quantitative Structure-Activity Relationships (QSAR). Similarity searching is the simplest and one of the most widely used techniques for ligand-based virtual screening in drug discovery programme.
There are many studies in the literature associated with the measurement of molecular similarity (Sheridan and Kearsley, 2002;Maldonado et al., 2006). However, the most common approaches are based on the 2D fingerprints, with the similarity between a reference structure and a database structure computed using association coefficients such as the Tanimoto coefficient (Walters et al., 1998;Leach and Gillet, 2003).
The effectiveness of ligand-based virtual screening approaches can be enhanced by using data fusion (Willett, 2006;Feher, 2006). Data fusion can be implemented using two different approaches Sheridan et al., 1996). The first, similarity fusion, involves searching for a single reference structure using multiple molecular descriptors. The similarity scores or ranking for each descriptor are combined to obtain the final ranking of the compounds in the database. The second approach is a group fusion in which multiple reference structures with a single similarity measure were used to search the database. The group fusion has been found to be generally more effective than the similarity fusion.
In more recent studies, Bayesian inference network (BIN) was introduced as a promising similarity search approach (Abdo and Salim, 2009;Chen et al., 2009;Abdo et al., 2010). The retrieval performance of Bayesian inference network was observed to be improved significantly when multiple reference structures were used or more weights were assigned to some fragments in the molecule structure. Unfortunately, such information is unlikely to be available in the early stages of a drug discovery programme, when just a single weak lead is available (Abdo and Salim, 2011;. Features Selection (FS) is a process of selecting a subset of features available from the data for application of a learning algorithm. The best feature subset contains the least number of features that most contribute to accuracy and efficiency. This is an important stage of preprocessing and is one of the two ways of avoiding high dimensional space of features (the other is feature extraction).The current molecule's fingerprint consists of many features, not all of it have the sme importance and remove some features can enhance the recall of similarity measure .
In this study, we enhance the screening effectiveness of Bayesian inference network using feature selection approach. In this proposed method, a few relevant features were filtered from molecular 2D fingerprint features. A set of active known references and random unknown molecules were used as a test data for each class of the data set. Only the subsets of selected features were used in calculating similarity score.

MATERIAL AND METHODS
This study has compared the retrieval results obtained using three different similarity based screening models. The first screening system was based on the tanimoto (TAN) coefficient which has been used for ligand-based virtual screening for many years and has been considered as a reference standard. The second model was based on the basic BIN (Abdo and Salim, 2011), that uses the Okapi (OKA) weight which found to perform the best in their experiments, which we shall refer to as conventional BIN model. The third model, our proposed model, is BIN based on feature selection model which we shall refer to as BINFS model. In what follows, we give a brief description of each one of these three models.
Tanimoto-based similarity model: Tanimoto used the continuous form of the tanimoto coefficient, which is applicable to non-binary data of fingerprint. S K,L is the similarity between objects or molecules K and L using Tanimoto is given by Eq. 1: For molecules described by continuous variables, the molecular space is defined by an M×N matrix, where entry w ji is the value of the jth feature (1 ≤ j ≤ M) in the ith molecule (1 ≤ i ≤ N). The origins of this coefficient can be found in a review paper.

Conventional BIN model:
The conventional Bayesian inference network model, shown in Fig. 1 is used in molecular similarity searching. It consists of three types of nodes: compound nodes as roots, fragment nodes and a reference structure node as leaf. The roots of the network are the nodes without parent nodes and the leaves are the nodes without child nodes. Each compound node represents an actual compound in the collection and has one or more fragment nodes as children. Each fragment node has one or more compound nodes as parents and one reference structure node as child (or more in case of multiple references are used). Each network node is a binary value, taking one of the two values from the set {true, false}. The probability that the reference structure is satisfied given a particular compound is obtained by computing the probabilities associated with each fragment node connected to the reference structure node. This process is repeated for the whole compounds in the database. The resulting probability scores are used to rank the database in response to a bioactive reference structure in the order of decreasing probability of similar bioactivity to the reference structure.
To estimate the probability associating each compound to the reference structure, we need to compute the probability in the fragment and reference nodes. One particular belief function called OKA has the most effective recall (Abdo and Salim, 2011). This function was used to compute the probability in the fragment nodes and is given by Eq. 2: Where: α = Constant and experiments using the Bayesian network show that the best value is 0.4 (Abdo and Salim, 2009;Chen et al., 2009) ff ij = Frequency of the i th fragment within j th compound reference structure cf i = Number of compounds containing i th fragment |c j | = The size (in terms of number of fragments) of the j th compound |C avg | = The average size of all the compounds in the database and m is the total number of compounds To produce a ranking of the compounds in the collection with respect to a given reference structure, a belief function from In Query, specifically the SUM operator, was used. If p1, p2,..., pn represent the belief at the fragment nodes (parent nodes of r) then the belief at r is given by Eq. 3: Where: n = The number of the unique fragments assigned to r reference structure p i = Value of the belief function bel(f i ) in i th fragment node BIN model based on feature selection: This model of BIN is based on using subset of molecule's features. To achieve this objective, two steps were used. First, we prepare training data that consists of known active molecules queries and unknown molecules. For each activity class (for 1, 2 and DS3) 10 different sets of 10 active compounds were randomly selected as reference set (Query) and it was appended by 307548 unknown molecules as train data, so the size of training data is 307548 molecules and test data is 102516 molecules which represents either DS1, DS2 or DS3. This step was done for all activity classes for each data set separately. In each class we used different reference sets of 10 active compounds that belong to that class. The second step is responsible for generating subset of molecule's features. To achieve this goal, a classifier column (that required by features selection algorithms) is added, the value of this column is 1 for all first 10 rows (represent the reference queries) and 0 for the rest of rows (that represent the unknown compounds). This column represents the label or classifier that is used by feature selection algorithm. The train data is used as input to SPSS Celemtine software that implements Principle Component Analysis (PCA) features selection algorithm. The result of this step is a vector or row of selected feature numbers that we used as input to the main data set to rearrange the entire data based on it.

Experimental design:
The searches were carried out on the MDL Drug Data Report (MDDR) database. The 102516 molecules in the MDDR database were converted to Pipeline Pilot ECFC_4 fingerprints and folded to give 1024-element.
For the screening experiments, three datasets (DS1-DS3) were chosen (Hert et al., 2006) from the MDDR database. The dataset DS1 contains 11 MDDR activity classes, with some of the classes involving actives that are structurally homogeneous and with others involving actives that are structurally heterogeneous (i.e., structurally diverse). The DS2 dataset contains 10 homogeneous MDDR activity classes and the DS3 dataset 10 heterogeneous MDDR activity classes. Full details of these datasets are given in Table 1-3. Each row of a table contains an activity class, the number of molecules belonging to the class and the class's diversity, which was computed as the mean pairwise Tanimoto similarity calculated across all pairs of molecules in the class using ECFP6. The pair-wise similarity calculations for all data sets were conducted using Pipeline Pilot software.   For each data set (DS1-DS3), the screening experiments were performed with 10 references structures selected randomly from each activity class and the similarity measure obtains activity score for all of its compounds. Then we sort these activity scores in a descending order and the recall of the active compounds provides a measure of the performance of our similarity method. By recall of active compound, we mean the percentage of the desired activity class compounds that are retrieved in the top 1 and 5% of the resultant sorted activity scores.

RESULTS
Our purpose is to identify different retrieval effectiveness of using different search approaches. In this study, we tested TAN, BIN and BINFS models on    3.000 6.000 1.000 3.000 4.000 3.000 Table 6: The recall is calculated using the top 1% and top 5% of the DS3 data sets when ranked using the TAN, BIN and BINFS 1% 5% the MDDR database using three different data sets (DS1-DS3). The results of such searches of (DS1-DS3) are presented in Table 4-6, respectively, using both cutoff 1% and 5%. Each row in a table lists the recall for the top 1% and 5% of sorted ranking when averaged over the ten searches for each activity class. The similarity method with the best recall rate in each row is strongly shaded and the recall value is boldfaced and the shaded cell results are listed in Table 7 (e.g., the results shown in the bottom rows of Tables 4-6 form the lower part of results in Table 8).
The results of the Kendall analyses for (DS1-DS3) are reported in Table 7 and describe the top1% and top 5% ranking for the various search models. Table 4-6 enables one to make comparisons between the effectiveness of the various search models. However, a more quantitative approach is possible using the Kendall W test of concordance (Siegel and Castellan, 1988). This test shows whether a set of judges make comparable judgments about the ranking of a set of objects; here, the activity classes were considered as judges and the recall rates of the various search models as objects. The output of such a test is the value of the Kendall coefficient and the associated significance level, which indicates whether this value of the coefficient could have occurred by chance. If the value is significant (for which we used cut-off values of 0.01 or 0.05) then it is possible to give an overall ranking of the objects that have been ranked.

Visual inspection of the recall values in
In Table 7, the columns show the data set type, the recall percentage, the value of the coefficient, the associated probability and the ranking of the methods. Some of the activity classes may contribute disproportionally to the overall value of mean recall (e.g., low diversity activity classes). Therefore, using the mean recall value as evaluation criterion could be impartial to some methods but not others. To avoid this bias, the effectiveness performance of different methods have been further investigated based on the total number of shaded cells for each method across the full set of activity classes, as shown in the bottom row of Table 4-6.
Inspection of DS1 search in Table 4 shows that BINFS produced the highest mean value compared to the TAN and BIN. In addition, according to the total number of shaded cells in Table 4, BINFS is the best performing search across the 11 activity classes in terms of mean recall. Table 7 shows that the values of the Kendall coefficient, for DS1 (top1% and 5%) are 0.058 and 0.132 respectively and for DS3 (top1% and 5%) are 0.04 and 0.09 respectively, are significant at the 0.01 level of statistical significance. Given that the result is significant, we can hence conclude that the overall ranking of the different procedures is BINFS>BIN>TAN and BINFS>TAN>BIN for DS1 and BINFS>TAN>BIN for DS3. The good performance for BINFS method is not restricted to DS1 since it also gives the best results for the top 1 and 5% for DS3.
The DS3 searches are of particular interest since they involve the most heterogeneous activity classes in the three data sets used and thus provide a tough test of the effectiveness of a screening method, Table 6-7 show that BINFS gives the best performance of all the methods for this data set at both cutoffs.

CONCLUSION
This study has further investigated the enhancement of BIN using feature selection for ligandbased virtual screening. Simulated virtual screening experiments with MDDR data sets showed that the proposed techniques described here provide simple ways of enhancing the cost effectiveness of ligandbased virtual screening in chemical databases. Our experiments also showed that the increases in performances are particularly marked when the sought active are structurally diverse.