Low-Homology Protein Structural Class Prediction from Secondary Structure Based on Visibility and Horizontal Visibility Network

: In this study, based on the predicted secondary structures of proteins, we propose a new approach to predict protein structural classes ( α , β , α / β , α + β ) for three widely used low-homology data sets. Fist, we obtain two time siries from the chaos game representation of each predicted secondary structure; second, based on two time series, we construct visibility and horizontal visibility network, respectively and generate a set of features using 17 network features; finaly, we predict each protein structure class using support vector machine and Fisher’s linear discriminant algorithm, respectively. In order to evaluate our method, the leave one out cross-validating test is employed on three data sets. Results show that our approach has been provided as a effective tool for the prediction of low-homology protein structural classes.


Introduction
The roles of proteins are varied and complex. Levitt and Chothia (1976) first propose the protein structural classes. In their pioneering work, four structural classes of protein, namely all-α, all-β, α/β and α +β can be obtained. The all-α and all-β classes represent structures that consist of mainly α-helices and β-strands, respectively. The α/β and α +β classes contain both αhelices and β-strands which are mainly interspersed and segregated, respectively (Murzin et al., 1995).
A knowledge of protein structure class is very important in both theoretical and experimental studies in protein science. The information of structure classes has been employed to improve the prediction accuracy of the protein secondary structure (Gromiha and Selvaraj, 1998), to reduce the search space of various possible conformations of the tertiary structure (Carlacci et al., 1991;Bahar et al., 1997). However, for newly-found proteins, the structural class prediction method of automated and accurate are urgently needed. Therefore, the problem of protein structural class prediction is very important towards the protein structure prediction problem. Despite the significance of this problem, when the sequence similarity rate is low, finding the most precise computational method to solve this problem still remains an unsolved problem.
To predict the protein structural class, the current classification methods mainly focus on two aspects: Feature extraction and classification algorithms. The method of feature extraction contains several aspects. Such as physicochemical based information (Dehzangi et al., 2013a;Sharma et al., 2013), structural based information (Yang et al., 2009;Zhang et al., 2013;Liu and Jia, 2010;Zhang et al., 2011;Ding et al., 2012;Han et al., 2014;Dehzangi et al., 2013b;Wang et al., 2014). Yu et al. (2017) use Chous pseudo amino acid composition and wavelet denoising to prediction structural class. From 2014 to now, several papers (Dehzangi et al., 2014;Wang et al., 2014;Jones, 1999;Faraggi et al., 2012) show that the protein secondary structure is significanc to predict protein structural classes. Firstly the features are extracted, secondly all kinds of algorithms can be used to implement the classification prediction, such as Fisher's linear discriminant algorithm (Yang et al., 2009), Support Vector Machine (SVM) (Cai et al., 2003) and so on.
In this study, based on the predicted protein secondary structure, we attempt to predict the protein structural classes of the three low-homology data sets. Fist, we obtain two time siries from the chaos game representation of each predicted secondary structure, based on two time series, we generate a set of features using 17 network features of visibility or horizontal visibility network. The structure class for each protein is predicted with support vector machine and Fisher's linear discriminant algorithm, respectively. In order to evaluate our approach, the leave one out cross-validating test is employed on three data sets. The result shows that network features are valid features.

Data Sets
To evaluate our proposed approach, we employe three benchmarks with low sequence identity including 25PDB(the homology-range between 22 and 45%) (Yang et al., 2009), 1189 (less than 40% sequence similarities) (Yang et al., 2009) and 640 (with 25% sequence identity) (Yang et al., 2010), respectively. The data sets in this study and the number of proteins belonging to four structural classes are shown in Table 1.

Secondary Structure Prediction
First, we can predict each amino acid in a protein sequence into one of the three secondary structural elements, C (coil), E (strand) and H (helix). For instance, the amino acid sequence of protein 1A1W as follows: MDPFLVLLHSVSSSLSSSELTELKYLCLGRVGKRKL ERVQSGLDLFSMLLEQNDLEPGHTELLRELLASLR RHDLLRRVDDFELEHHHHHH. In this study, if we submit this amino acid sequence to the web server of PSIPRED (http: //globin.bio.warwick.ac.uk/psipred. or http://bioinf.cs.ucl.ac.uk/psipred/) (Jones, 1999), the predicted secondary structure to be returned will be CCHHHHHHHHHHHHCCHHHHHHHHHHHHHHCC CHHHHHCCCHHHHHHHHHHCCCCCCCCHHHHH HHHHHHCHHHHHHHHHHHHHHHCCCCC. Fiser et al. (1994) firstly propose the concept of Chaos Game Representation (CGR) of protein structures. Yang and co-workers proposed CGR of predicted protein secondary structure sequence (Yang et al., 2010) to predict protein structure class.

Chaos Game Representation of Predicted Secondary
In this study, based on the method of Yang et al. (2010), the CGR of four proteins secondary structure sequence as shown in Fig. 1. The blue points represent the CGR points，the blue edge represents the sides of equilateral triangles. corresponding to the order in the predicted secondary structure, the order of the blue points is saved, but not shown in the figure. We can see that the plotted points tend to be distributed around the sides HC and EC, respectively, for proteins in the α and β classes. However, the points lie around both sides HC and EC without preference for proteins in the mixture classes.
Each secondary structure sequence generates a distinct (x, y)-coordinate sequence of the plotted points. Hence we model a CGR plot as two time series, one composed of the x-coordinates， namely x-time series and the other of the y-coordinates, namely y-time series, as shown in Fig. 2.
Recent research showed that the theory of complex network was an effective approach to analyze time series (Lacasa et al., 2008;Luque et al., 2009;Liu et al., 2014). In this study, we hope to reveal some information in the above time series from the perspective of the visibility network (Lacasa et al., 2008) and the horizontal visibility network (Luque et al., 2009).
be a time series of length N. We can obtain a visibility graph from the mapping of a time series of n data into a network of n nodes (where each datum is associated to a specific node and where temporal order is preserved in the node labelling) according to the following visibility criterion: Two arbitrary data (t i , x i ) and (t j , x j ) in the time series have visibility and consequently become two nodes in the associated graph, if any other data (t n , x n ) such that t j < t n < t i fulfills (Lacasa et al., 2008): Some basic properties of the mapping include undirectedness, connectedness (the visibility graph is always connected by definition) and invariance under affine transformations.
Horizontal Visibility Network (HVN): Let { } 1,2, , be a time series of length N. The algorithm assigns each datum of the series to a node in the network. Two nodes i and j in the network are connected if one can draw a horizontal line in the time series joining x i and x j that does not intersect any intermediate data height. Hence, i and j are two connected nodes if the following geometrical criterion is fulfilled within the time series (Luque et al., 2009): For all n such that i < n < j. As a result, given each time series, its HVN is unweighed, undirected and connected (each node sees at least its nearest neighbors (left-hand side and right-hand side).
Network features: Here, we briefly introduce the considered features, namely network characteristics, that we extract from the visibility network and the horizontal visibility network. The network can be represented by graph.  Total  25PDB  443  443  346  441  1673  1189  223  294  334  241  1092  640  138  154  177  171 Fig. 1 gives rise to two time series (x-and ycoordinates, respectively). As a result, we obtain eight time series for four CGRs Characteristic path length (L): It is calculated as: where, N p represents the number of pairs of nodes of the network and d ij is the shortest path (Floyd, 1962) between nodes i and j (Chang et al., 2008). The characteristic path length L is the average of the shortest path lengths. Diameter (D): The diameter D is defined as the largest value of all the shortest path lengths in a network. Diameter is a measure of the compactness in a network and is computed by (Emerson and Gothandam, 2012): D = Max{d ij }, ∀ i-j pairs of shortest paths.
Clustering coefficient of the network (C): The clustering coefficient of any node i is the ratio between the total number of links actually connecting its neighbors and the total number of all possible links between these neighbors. It is given by where e i is the actual number of edges between the neighbors of node j. The clustering coefficient of the network is the average of C i overall nodes. It is calculated as (Chang et al., 2008): Pearson correlation coefficient (r): To understand whether our unweighed undirected networks are of assortive or disassortive type, we calculate the Pearson correlation coefficient r of the degrees at either ends of an edge. For this, we use the expression suggested by Newman (2002): Here, j i and k i are the degrees of the nodes at the two ends of the i-th edge, with i = 1,2,…,M.
Average closeness centrality (ACC): Network centrality measures were developed by Freeman (1978;Beauchamp, 1965;Sabidussi, 1966). Basically "closeness centrality" of node i is calculated as: The closeness value is therefore the inverse of the average distance between node i and the other nodes. The average closeness centrality is calculated as: In this subsection, given a secondary structure sequence, we can convert a protein into two series: x time series and ytime series. Each time series can construct corresponding visibility and horizontal visibility network, respectively. Nine network features can be obtained from a network. The features are the number of nodes (N), average degree (K), characteristic path length (L), network diameter (D), clustering coefficient of network (C), Pearson correlation coefficient (r), average closeness centrality (ACC), Energy (E) and Laplacian Energy (LE). Different time series for the same protein, under the same constructing of network, the number of nodes is the same. Hence we can obtain 1+8×2 = 17 features in total for each protein. So, each protein is described as a real-valued vector of 17 features.

Feature Space of Proteins
As mentioned above, In this study, suppose we use n features to represent a protein sample. Thus, the i-th protein sample P i should be a real-valued vector in a n-D (dimensional) space, i.e.: Here i j p is the j-th (j = 1,2,…n) feature of the P i and can be derived by following the setps.
Before prediction, each of the n features in Equation (1) should be normalized by: where, m is the number of the total proteins in the data set,  Vapnik (1995) introduced a machine learning method of Support Vector Machine (SVM). In our study, we choose Gaussian kernel function. The kernel width parameter γ and the regularization parameter c are optimized using a grid search strategy within a limited range, where γ = 2 i , i = −15,−14,−13,…,4,5 and c = 2 i , i = −5,−4,−3,…,14,15. We find the optimal SVM parameters c and using 10-folding cross validation on the training set for each turn in the leave-one-out cross validation process. The publicly available LIBSVM software (Chang and Lin, 2001) is used to implement the SVM classifier in our paper. The software toolbox can be freely downloaded from http://www.csie.ntu.edu.tw/cjlin/libsvm. Version 3.22 released on December 22, 2016.

Fisher's Discriminant Algorithm
Fisher's discriminant algorithm (Duda et al., 2001) is used to find a classifier in the parameter space for a training set. A training set H = {x 1 x 2 … x n } contains training vectors from two classes. There are n 1 training vectors from one class forming a subset H 1 and n 2 training vectors from another class forming a subset H 2 . Hence, H = H 1 ∪ H 2 and n 1 + n 2 = n. Suppose that each x i is a m-dimension vector. Then, a parameter vector ω = (ω 1 ω 2 … ω m ) T is estimated such that it allows as many training vectors as possible to be accurately predicted. Specifically: and to the class of H 2 otherwise (Duda et al., 2001). The above algorithm is designed for a two-class problem. In this study, we transform a four-class problem of protein structural classes prediction into six two-class problems, namely, α-vs-β, α -vs-α/β, αvs-α + β, β-vs-α/β, β-vs-α + β and α/β-vs-α + β (Yang et al., 2010).

Performance Evaluation
The jackknife test (leave-one-out test) (Chou, 1995) is employed in our study.
The individual sensitivity S n , the individual specificity S p and the overall accuracy OA over the entire data set, as well as Matthew's correlation coefficient MCC (Xu et al., 2013) are used to evaluate performance.

Prediction Performances of our Method
The prediction approach is examined with three benchmark data sets in low similarity by leave-one-out test and report the Sensitivity, Specificity and MCC for each structural class, as well as the OA.
By constructing of visibility network, a protein is described as a real-valued vector of 17 features. The results are shown in Table 2. From Table 2, we can see that the overall accuracies for the three data sets are close to or above 80%. Specifically, when SVM is used to implement the classification prediction, the overall accuracies of 82.07, 79.03 and 80% are achieved for the data sets 25PDB, 1189 and 640, respectively; when Fisher's linear discriminant algorithm is used to implement the classification prediction, the overall accuracies of 80.69, 79.40 and 80% are achieved for the data sets 25PDB, 1189 and 640, respectively. If comparing the four protein structural classes to each other, the predictions of proteins in the α classes are always the best (with accuracies higher than 90% for all the data sets. By constructing of horizontal visibility network, a protein is described as a real-valued vector of 17 features. The results are shown in Table 3. From Table 3, we can see that the overall accuracies for the three data sets are close to or above 80%. Specifically, when SVM is used to implement the classification prediction, the overall accuracies of 82.85%, 79.21% and 81.25% are achieved for the data sets 25PDB, 1189 and 640, respectively; when Fisher's linear discriminant algorithm is used to implement the classification prediction, the overall accuracies of 82.19, 79.30 and 81.41% are achieved for the data sets 25PDB, 1189 and 640, respectively. If comparing the four protein structural classes to each other, the predictions of proteins in the α classes are always the best (with accuracies higher than 90% for all the data sets).
From Table 2 and 3, referring to the classes, our method also performs satisfactorily with prediction accuracies of about 80%. However, it seems very challenging to predict the α/β class and α + β class as their prediction accuracies are relatively low when compared with the other classes.

Comparison with Existing Methods
In this section, the proposed approach is further compared with other recently reported prediction approachs on the same three data sets. The results are shown in Table 4.
As can be seen from Table 4, our methods obtain the high prediction accuracies for all-α ,and all-β classes among all the tested methods. But our methods obtain the low prediction accuracies for α/β and α + β classes among all tested methods. But our method shows that network features are useful for prediction of protein structure class.

Conclusion
The problem of protein structural class prediction is still a challenge problem. Though some of approachs have shown the state-of-the-art performance, there is always room for improvement. In this study, we used matlab software to write programs. 17 network features are utilized to predict low-homology protein structural class. By comparisons with other existing approachs, we may attribute the high prediction accuracy. Three widely used data sets, 25PDB, 1189 and 640, with low sequence similarity, are adopted to evaluate the performance of our approach. Results by leave-one-out test show that our proposed method provides an effective tool for the accurate prediction of lowhomology protein structural classes.