Protein Sequences Features Extraction for Predicting Beta-Turns and their Types: A Review

: Beta-turns are considered to be important secondary structure types that have essential role in molecular recognition, protein folding and stability. They represent 25% of protein structures, therefore they are the most common type of non-repetitive or tight turns structures. Their prediction is considered to be an important issue in bioinformatics and molecular biology, because it provides valuable information and inputs for the fold recognition and drug design. There are many machine learning and statistical based approaches that were designed to predict beta-turns. Among the successful approaches that are based on machine learning are the approaches that used Neural Networks (NNs) and Support Vector Machines (SVMs) methods. These approaches used different features and features organizations. Among the most usable features in beta-turns prediction are the Position Specific Scoring Matrices (PSSMs) and the predicted secondary structure. This work gives a review of the most successful methods that are used for beta-turns prediction and the features as well as the organizations of these features that they used.


Introduction
Secondary structure of proteins is considered to be an important topic in bioinformatics and it consists of alpha-helices, beta-sheets, random coils and turns. Alpha-helices and beta-sheets are considered as regular secondary structure, because they are sequences of residues with repeating φ and ψ values. The residues that correspond to turns structures do not form a regular secondary structure. In turns structures the Calpha-atoms of two residues are separated by one to five peptide bonds and the distance between these Calpha-atoms is less than 7A°. The number of peptide bonds that separate the two end residues determines the specific turn type. In alpha-turns and beta-turns, the two end residues are separated by four and three peptide bonds respectively. In gamma-turns, deltaturns and pi-turns, the two end residues are separated by two, one and five peptide bonds respectively. Beta-turns are the most common type of turn structures since they represent 25% of the secondary structure of the protein sequence. They have the ability to bring together and allow the interaction between the regular secondary structures elements thus their prediction is of significance to protein folding (Petersen et al., 2010). Beta-turns are also important in the biological activities of peptides as the bioactive structures that interact with other molecules such as receptors, enzymes and antibodies and they are important in the design of various peptidomimetics for many diseases (Kee and Jois, 2003;Zheng and Kurgan, 2008). Therefore, the prediction of beta-turns is important for providing valuable insights and inputs for the fold recognition as well as drug design. The beta-turns are not only two states classification problem but it can be further classified to 9 types according to the dihedral angles of residues i + 1 and i + 2 in the turn structure (Hutchinson and Thornton, 1994). The following Table 1 shows the dihedral angles of beta-turns types.   The methods that are used for beta-turns prediction can be categorized as statistical based methods and machine learning based methods. The statistical based methods include (Chou and Fasman, 1974;Wilmot and Thornton, 1988;1990;Chou, 1997;Chou and Blinn, 1997;Zhang and Chou, 1997;Fuchs and Alix, 2005). The machine learning methods are found to be the most successful methods, because they can handle the nonlinearity in the data very well. Most of the successful machine learning methods that are used for beta-turns prediction are based on Neural Networks (NNs), Support Vector Machines (SVMs) and k-nearest neighbour methods. The methods that use NNs include McGregor et al. (1989), BTPRED (Shepherd et al., 1999), BetaTpred2 (Kaur and Raghava, 2003), MOLEBRNN (Kirschner and Frishman, 2008) and NetTurnP (Petersen et al., 2010) and that which use SVMs methods include BTSVM method (Pham et al., 2003), the work of Zhang et al. (2005), Zheng and Kurgan's (2008), Hu and Li's (2008), the method of Liu et al. (2009), DEBT (Kountouris and Hirst, 2010), the method of Tang et al. (2011), our own work H-SVM-LR (Elbashir et al., 2013a) and Nguyen et al. (2014). The methods that are based on k-nearest neighbour include the work of Kim's (2004).
The features are very important inputs for prediction or classification using machine learning or statistical methods. Extracting or selecting the most informative features leads to high classification performance. Selecting the most informative features requires the experimentation of many features. Also some of the features may be combined together to enhance the accuracy of the machine learning methods. As shown in the previous paragraph, there are many researches that developed methods or techniques for beta-turns prediction. These methods used different features and features combinations. The common used features is the Position Specific Scoring Matrices (PSSMs). Since there is intercorrelation between various structural features of protein, secondary structure information has been widely used as an additional features and it enhances the prediction accuracy substantially. Recent researches added other features such as surface accessibility, predicted protein block, predicted backbone dihedral angle and predicted shape string.

Dataset and Performance Measures
There are many datasets that are used for the evaluation of beta-turns prediction methods. The most commonly used dataset in almost all of the recent researches is BT426 dataset therefore, the results that are pointed out in this paper are based on it. BT426 dataset has 426 non-homologous protein chains. It was developed by Guruprasad and Rajkumar (2000). X-ray of crystallography at two resolution or better is used to determine the structures of all the proteins chains in BT426 dataset. Each of these chains contains at least one beta-turns structure. 24.9% (approximately 25%) of all amino acids in BT426 have a beta-turns structure. The dataset can be downloaded from the link http://crdd.osdd.net/raghava/bteval/. The most frequently measures that are used to evaluate beta-turns prediction methods are the prediction accuracy and Matthew's correlation coefficient (MCC). It is important to use MCC with the accuracy because of the imbalanced dataset (25% beta-turns versus 75% non-beta-turns), where it is possible to achieve an accuracy of 75% by predicted all the residues to be non-beta-turns. In this paper the results of the prediction methods are based on these two measures.

Basic Sequence Information and Reliability Indices
The basic sequence information are normally obtained by encoding the protein sequence such that every amino acid type is represented by a single one according to its position in a row composed of 20 positions that represents the 20 amino acids, e.g., alanine, which is located in the first position is represented by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0. The predicted three secondary structures alpha helix, beta-sheets and coil are normally encoded as 1,0,0 and 0,1,0 and 0,0,1 respectively. The basic sequence information in addition with a secondary structure information which obtained using the reliability indices is used by Shepherd et al. (1999) to predict beta-turns. The reliability indices are the strengths of the prediction for each of the three target secondary structures states. They are represented by three integers in the range 0-9 and each index is divided by 10 to get three real numbers between 0 and 1 for each predicted secondary structure state. Shepherd et al. used a window size of 9 with 23 network inputs per window position (20 for the amino-acid information plus 3 for the secondary-structure information) to accomplish the prediction task. With filtering strategy, the accuracy and MCC of their methods are 74.9, 0.348 respectively. Shepherd et al. used a window size of four on amino acid information only as an input to single layer net to predict the different types of betaturns, they obtained an accuracy and MCC pairs of (91.2, 0.219), (95.5, 0.253), (95.7, 0.062), (96.8, 0.033) on types I, II, VIII, IV respectively.

PSSMs and Predicted Secondary Structure
PSSMs are M by 20 matrices, where M represents the sequence length and the number 20 refers to the 20 amino acids positions. These matrices are normally generated using PSI-BLAST (Altschul et al., 1997) using many rounds against specific sequence database. The widely used sequence database for generating them for the purpose of beta-turns prediction is the (NCBI) none redundant (nr) database. Beta-turns prediction is enhanced significantly using PSSM therefore many researchers used it in their prediction methods whether alone or in combination with other features. Kirschner and Frishman (2008) used two neural networks that derived from the Elman network (Elman 1990). These two NNS are used in ensemble manner in which the first NN is fed with PSSM of the sequence. The output of the first layer is fed to the second layer (structure to structure) to recognize beta-turns. They utilized a post-scaling by applying adjustable threshold on the output of the network to filter the prediction. The post-scaling (Montavon et al., 2012) is used to handle the unbalanced class distribution problem. The two stages neural network that is used by Andreas Kirschner and Dmitrij Frishman was adopted in bioinformatics by (Qian and Sejnowski, 1988;Rost and Sander, 1993;Adamczak et al., 2005). Their two stages neural networks obtained accuracy and MCC of 77.9 and 0.45 respectively. For the beta-turns types they used the threshold to obtain two results for their prediction one that maximizes the MCC and the other is a tradeoff between the MCC and the accuracy and their results are as follows: For type I by maximizing the MCC, the accuracy = 82.5 and MCC = 0.317, by using the tradeoff, Almost all of the recent researches combined the predicted secondary structure with PSSM to enhance the prediction accuracy. These researches used different secondary structure organizations. Zhang et al. (2005) used a window size of 7 on the PSSM and then added the secondary structure prediction. The total features that they used is 143. The accuracy and MCC that they obtained are 74.8% and 41% respectively. Kaur and Raghava (2003) used two feed-forward back-propagation networks with a single hidden layer. They used a window size of 9 residue on the PSSM as an input to their networks. Both of their networks have a single hidden layer with 10 units. The prediction results of their first network, which is turn/non-turn (0 or 1) is combined with the probabilities of the predicted secondary structure (3 states) to form 4 units code, which is used as an input for the second network. The three structures states are provided by PSIPRED prediction method (Jones, 1999). The probabilities represent the strength of the prediction and they are in the range of 0-1. Harpreet Kaur and Gajendra Pal Singh Raghava filtered the final result of the prediction using a filtering strategy to calculate the final accuracy and MCC, which are found to be 75.5% and 43% respectively. Zheng and Kurgan (2008) developed a method, which is considered to be the first to break the 80% accuracy barrier. They used sliding window of 7 to extract the features from PSSM. They employed four prediction methods to obtain the secondary structure features. These four methods are PSIPRED v2.5 (McGuffin et al., 2000;Bryson et al., 2005), JNET (Cuff and Barton, 2000), TRANSSEC (Montgomerie et al., 2006) and PROTEUS2 (Montgomerie et al., 2006). Each one of the four methods produces 3 features, where each represent specific structural states. The total number of the features generated from the four methods is 3×4 = 12. The confidence score of each one of the four prediction methods is added to the features vector after dividing it by 10. The confidence score added another 4 features to the feature vector. A binary value representing a specific arrangement of the secondary structure predicted with the four prediction methods for the central and the two adjacent residues is considered in the features vector. This binary value is calculated as follows: If the central amino acids is predicted as C then the two adjacent residues can be C and C this will form CCC arrangement, if one of them is C and the other one is either H or E and if X is assumed to be the set (E, H) then the resulted arrangement will be CCX, or XCC. If both of the adjacent residues are not C, this will result in the arrangement XCX. The total number of features produced by the binary number which represents specific arrangement of the secondary structure will be 4(number of prediction methos) * 3 (the three secondary structures states) * 4 (the pattens CCC, CCX, XCC, or XCX), which is equals to 48 features. Lastly Zheng C, Kurgan added the ratio between the number of residues in a given secondary structures and the window size, this will add additional 12 features. The same features organization that is used by Zheng C, Kurgan L is adopted by Elbashir et al. (2013b) for predicting beta-turns in protein using Kernel logistic regression. Elbashir et al. obtained accuracy and MCC of 80.7 and 0.50 respectively.
Our own method (Elbashir et al., 2013a) used PSSMs and predicted secondary structure to predict the betaturns. Because the training sets used for beta-turns prediction are imbalanced sets 1:3 for beta-turns and non-beta-turns, a clustered model is used. In the cluster model the non-beta-turns are clustered into 3 clusters and each cluster is used with the beta-turns cluster to form a balanced set that can be used to train three localized SVMs. Each localized SVM produce beta-turns and nonbeta-turns predictions. The outputs of the three SVMs are combined to form a single beta-turn/non-turn output using fractional polynomial. The method tried different PSSMs and secondary structure organization i.e., using a sliding window on the PSSMs and then add the predicted secondary structure or using a sliding window on both PSSMs and predicted secondary structure. It was found that using sliding window on both PSSMs and predicted secondary structure produces the best results. Our own method obtained an accuracy and MCC of 82.87 and 0.56 respectively. Petersen et al. (2010) designed a method that consists of two layers of artificial neural networks. They utilized PSSMs, predicted secondary structure and surface accessibility, which is the surface area of a biomolecule that is accessible to a solvent as an input for the first layer Networks. The first layer networks consists of five network one of them is to predict whether an amino acid has beta-turn confirmation or not and the other four are used for predicting the position of the amino acid in the beta-turns confirmation (position1, position2, position3, position4). The surface accessibility is predicted using NetSurfP (Petersen et al., 2009), NetsurfP uses primary network that accepts PSSMs and secondary structure and produces 'B/E Classification' which refer to the raw buried/exposed. The output of the primary network is used with the PSSMs to form an input for the secondary network, which predict the buried/exposed of the given amino acid. The output from the first layers networks, which compose of five networks is used again with the secondary structure and surface accesability as an input to the second layer network to produce the final betaturn/non-turn prediction. The method of Petersen et al.

PSSMs, Predicted Backbone Dihedral Angle and Secondary Structure
There is a high correlation between backbone dihedral angles and the secondary structure elements of the protein so they can be combined together in a feature matrix to enhance the predictions. Kountouris and Hirst (2010) added another features to the PSSM and predicted secondary structure, which is the seven state predicted dihedral angle that obtained from DISSPred (Kountouris and Hirst, 2009). DISSPred is also used to predict the three state secondary structures elements. They used a sliding window of nine on the PSSM to obtain (9×20 = 180) dimension vector. The window size that is used on the three predicted secondary structures states and the seven state predicted dihedral angle is five. These features add (3×5+7×5 = 50) dimension vector. So a 230 dimension vector is used as an input to their classifier which is a SVM classifier. They obtained accuracy and MCC after filtering the final prediction of 79.2% and 0.48 respectively. The same features that are used as input to predict beta-turn/non-turn structure are supplied to SVMs classifiers to recognize the different types of beta-turns and the accuracies obtained are 78.6, 87.4, 71.5, 71.1, 97.6 for types I, II, IV, VIII, NS respectively, where the other types are combined in type NS. Tang et al. (2011) and our own method (Elbashir et al., 2013a) utilized shape strings together with PSSM and predicted secondary structure to predict beta-turns in protein both Tang et al. and our method are SVM methods. The shape strings can be predicted from a predictor constructed based on structural alignment approach. The eight states S, R, U, V, K, A, T and G represents the shape strings of a protein. A detailed information about protein structure including random coil in which beta-turn is located can be provided by shape strings. This can make them as important component that can be used to predict beta-turns. Both of the methods used protein shape string and its Profile Prediction Server (DSP) (Sun et al. 2012) to obtain the predicated shape strings. The eight states of the shape string are encoded using the binary encoding schema. In parts of proteins sequence there can be a location where the j and ψ angles are undefined, or the structure determination for it may be unknown. For these specific parts the DSP server defines additional shape N. an example for the binary encoding schema where the shape is S is (1 0 0 0 0 0 0 0 0) and where the shape is N is (0 0 0 0 0 0 0 0 1). In our method (Elbashir et al., 2013a) we used a cluster model to deal with the imbalance problem in predicting beta-turns. In the cluster model the non-bturns set is divided into a three subsets by k-means clustering algorithm and then three SVMs are used, each of them used one cluster of the non-beta-turns against the beta-turns and then a logistic regression model, modeled using fractional polynomial is used to aggregate the results of the three SVMs. The accuracy and MCC achieved using our own method are 87.37 and 0.67 respectively.

PSSM, Predicted Shape String and Predicted Protein Block
In addition to PSSM and predicted shape string, Lan Anh T. Nguyen et al., added predicted protein block (de Brevern et al., 2000;de Brevern, 2005;Joseph et al., 2010), which they obtained from the web site of PB-kPRED. Sixteen pentapeptide motifs with labels A, B, C, D, E, F, G, H, I, J, K, L, M, N, O and P determine the structural alphabet of the predicted protein block (de Brevern et al., 2000;de Brevern 2005;Tyagi et al., 2006). To deal with imbalance problem in predicting beta-turns (25% turn vs 75% non-turn), Lan Anh T. Nguyen et al used oversampling technique with SVM as a base classifier. They used a window size of 9 on the PSSM, predicted shape string and predicted protein blocks and obtained accuracy and MCC of 87.48 and 0.66 respectively. For bet-turns types prediction they combined types VIa1, VIa2 and VIb in one type named VI that is because types VIa1, VIa2 and VIb are rare (Chou, 2000). Lan Anh T. Nguyen et al. obtained an accuracy of 93.45,99.28,97.90,99.44,90.18,98.07,90.18 on types I, I', II, II', IV, VI, VIII respectively. And MCC of 0.61, 0.75, 0.75, 0.64, 0.38, 0.14, 0.30.

Discussion
The methods that are applied on beta-turns prediction and their types use different proteins sequence features. These features include the basic sequence information, PSSMs, predicted secondary structure, predicted dihedral angle and predicted surface accessibility and the predicted shape string. Table 2 summarizes the results of predicting beta-turns that are obtained by the different prediction methods together with the features they used, while Table 3 summarizes the results of predicting betaturns types that are obtained by the different prediction methods and the features they used. PSSMs are proved to be having a significant contribution in accuracy of beta-turns prediction compared to the basic sequence information. Therefore, PSSMs are used in almost all of the most successful methods that are constructed for beta-turns predictions. In most of the successful betaturns predictions methods, PSSMs are generated using several rounds of the PSI-BLAST program (Altschul et al., 1997) against National Center for Biotechnology Information (NCBI) nonredundant (nr) database. A window based approach is used to compose the input vector from the PSSM. Some of the methods used a window size of 9 whereas most of the methods used a window size of 7. Figure 1 depicts the use of window size of 7 on a PSSM.
Most of the successful methods used an equation to scale the value of the PSSMs to a range between 0 and 1. Predicted secondary structures are combined with PSSMs to enhance the prediction accuracy. This combination is organized differently in the researches.   Shepherd et al. (1999) Basic Sequence information and reliability indices 74.9% 0.35 Kirschner and Frishman (2008) PSSMs 77.9% 0.45 Zhang et al. (2005) PSSMs and predicted secondary structure 74.8 % 0.41 Kaur and Raghava (2003) PSSMs and predicted secondary structure 75.5% 0.43 Zheng and Kurgan (2008) PSSMs and predicted secondary structure 80.7% 0.50 Elbashir et al. (2013a) PSSMs and predicted secondary structure 82.87% 0.56 Petersen et al. (2010) PSSMs, Predicted secondary structure and surface ccessibility 78.2% 0.50 Kountouris and Hirst (2010) PSSMs, Predicted backbone dihedral angle and secondary structure. 79.2% 0.48 Tang et al. (2011) PSSMs, Predicted secondary structures and predicted shape string 87.2% 0.66 Elbashir et al. (2013a) PSSMs, Predicted secondary structures and predicted shape string 87.37 0.67 Nguyen et al. (2014) PSSM, predicted shape string and predicted protein block. 87.48 0.66 Some of the researcher used a sliding window on the PSSMs only and then the three state secondary structures are attached to the feature vector, where others used a sliding window on both PSSMs and predicted secondary structures. In our own work (Elbashir et al.. 2013), we tried both of these organizations and we found that using sliding window on Both PSSMs and predicted secondary structures gives better classification results. The method that is constructed by Zheng and Kurgan (2008) was the first method to predict beta-turns at over 80% accuracy. This method used four protein secondary structure prediction methods to extract several secondary structure information and then combine these information with the PSSMs in different organization. Figure 2 to 5 show this combinations. Figure 2 depicts the secondary structure features that are extracted from the four secondary structure prediction methods (PSIPRED, JNET, TRANSEC and PROTEUS). Figure 3 shows the confidence value of the central residue for each of the prediction method, which are used as features in addition with the secondary structure features. Figure 4 shows the binary values representing a specific arrangement of the secondary structure predicted with the four prediction methods for the central and the two adjacent residues, in the figure TRANSEC is shown as an example of the prediction methods. Figure 5 shows the features that are taken from the ratio between the number of residues in a given secondary structures and the window size for each of the prediction method. A great leap in the prediction of beta-turns was obtained after adding the predicted shape string of the protein to the PSSM and the predicted secondary strictures to form the input features. The accuracy and The MCC that were obtained after adding the predicted shape string of the protein are more the 87% and 55% respectively. Figure 6 shows how the PSSMs, predicted secondary structures and the predicted shape string features are represented.   Fig. 6: PSSMs created using PSI-BLAST, PSS predicted using Proteus and the shape strings predicted using the protein shape string and its profile prediction server (DSP) de Brevern (2016) extended the classical classification (Venkatachalam, 1968;Richardson, 1981;Chan et al., 1993;Hutchinson and Thornton, 1996) of beta-turns types by adding additional beta-turn types. Shapovalov et al. (2019) defined new 18 turn types. These new added betaturns types should be considered in future researches that predict beta-turns types in proteins. Deep learning, which is a rapidly growing research area and many NN architectures are designed to implement it awaits wide applications in bioinformatics (Zhang and Rajapakse, 2009). Its NN architectures consist of multiple nonlinear layers and there are several types of these architectures according to the input characteristic and the objectives for which it is designed (Liu et al., 2017). Deep learning can make a breakthrough in beta-turns prediction, because the features will be automatically created by the NN when it learns, but this does not mean that obtaining features and pre-process it is totally irrelevant. Extracted features such as PSSMs and predicted secondary structures can be used as an input for deep learning algorithms to ease difficulties from complex biological data and improve performance (Zhang and Rajapakse 2009).

Conclusion
The protein secondary structure is considered to be the base of analyzing the functional properties of the protein. These functional properties depend on the protein threedimensional structure. The beta-turns is the most important part of protein secondary structure, therefore their prediction is crucial for the advancement in protein folding and drug design. The methods that are designed for beta-turns prediction used different kinds of features. The most used features in the recent prediction methods is the PSSMs. Although beta-turns itself is one of the secondary structures types, the predicted secondary structures obtained using different prediction servers are added to the PSSMs to form the input vector for prediction methods. These prediction methods used different PSSMs and predicted secondary structures organization and some of them used another secondary structure information combined with PSSMs and predicted secondary structures to form the input vector. Some methods used other features such as surface accessibility and predicted backbone dihedral angle combined with predicted secondary structures and PSSMs. The state of the art methods that obtained the highest classification performance have used either predicted shape string in a combination with PSSMs and predicted secondary structure or predicted protein block in a combination with predicted shape strings and PSSMs.