A Novel 2D Graphical Representation and its Application in the Similarities/Dissimilarities Analysis of Protein Sequences

: In this study, a novel 2D graphical representation of protein sequences is proposed based on the physicochemical feature pK2 of amino acids first and then, on the basis of the newly given 2D graphical representation, a new concept of feature appearance model is introduced to analyze the similarity/dissimilarity of protein sequences. Finally, Theoretical and simulation results show that the newly proposed method is effective in similarities/dissimilarities analysis of protein sequences.


Introduction
Graphical representation of protein sequences is a very powerful tool for visual comparison of protein sequences (Yao et al., 2010;Wang et al., 2014;2015). Currently, many effective graphical presentation methods have been proposed to facilitate the analysis of similarities/dissimilarities among the protein sequences. For example, Feng and Zhang (2002) proposed a 2D graphical representation of protein sequence based on the hydrophobicity and charged properties of amino acid residues along the primary sequence. Wen and Zhang (2009) proposed a 2D graphical representation of protein sequence with no circuit or degeneracy based on the chosen physicochemical properties of amino acids. Huang et al. (2013) introduced a 2D graphical representation of protein sequence, called HR-Curve, based on classification and dual vectors. Qi et al. (2012) proposed a 2D graphical representation of protein sequence based on Huffman tree. Abo-Elkhier (2012) proposed a 3D graphical representation of protein sequence on the basis of a right cone of a unit base and unit height on protein sequences interfaces. Hea et al. (2012) introduced a 3D graphical representation, which is a cyclic order of 20 amino acids, based on the order of 6-bit binary Gray code. Abo el Maaty et al. (2010) introduced a 3D graphical representation of protein sequence based on three physicochemical properties of amino acid side chains.
In this study, a novel 2D graphical representation of protein sequences is proposed based on a chosen physicochemical feature pK2 of amino acids first and then, 4 descriptors are extracted from the 2D graphical representation of protein sequences and adopted to analyze the similarities/dissimilarities of protein sequences quantitatively. Theoretical and simulation results show that the newly given method is effective in similarities/dissimilarities analysis of protein sequences and can achieve results that are consistent with the results of the known fact of evolution.

Graphical Representation of Protein Sequences
Proteins are composed of 20 different amino acids and these amino acids have many different physicochemical and biological properties such as the molecular weight (mW), iselectric point (pI), the pKa value for terminal amino acid groups COOH (pK1), the pKa value for terminal amino acid groups NH 3 + (pK2), van der waals radius (Vdwa), kdHydrophobicity (kh) (Kyte and Doolittle, 1982), wwHydrophobicity (wh) (Wimley and White, 1996), hhHydrophobicity (hh) (Hessa et al., 2005), the occurrence in human properties (%) (Oihp (%)), Abundance (Abu), ATP cost in synthesis under aerobic condition (Csae) and ATP cost in synthesis under anaerobic condition (Csan) etc. The names and symbols of the 20 amino acids and the value of their 12 major properties are illustrated in the following Table 1. Let {F 1 ，F 2 ，…, F 12 } represent these 12 different properties of amino acids illustrated in above Table 1  features of τ according to the following Formula 1: where, i∈{1,2,…,12}. Let Ψ = p 1 p 2 …p n (p i ∈ Ω, ∀i∈{1,2,…,N}) represent a protein sequence with n amino acids and for any given letter u∈ Ω, supposing that u appears K times in the protein sequence Ψ totally and the location of u at the j th time in Ψ is u j , then we call the vector <u 1 , u 2 , …, u k > as the "Feature Appearance Model" of u in Ψ.
Based on the concept of "Feature Appearance Model" proposed above, then for each protein sequence Ψ, we can obtain its graphical representation according to the following steps: Step1: According to the concept of Feature Appearance Model, obtain 20 different Feature Appearance Models of amino acids in the protein sequence Ψ.
Step2: For j = 1 to 12, select the jth feature from these 12 features of amino acids {F 1 ，F 2 ,…, F 12 }, ∀ p i ∈ Ψ, let the standardized values of the jth feature of p i-1 , p i and p i+1 be 1  , , , ,..., , x y x y x y according to Formula 2 and 3: where, t∈{1,2,…,k} and for p 1 ∈ Ψ, since there isn't p 0 in Ψ, then we define 0 j Y = 0, for j∈{1,2,…,12}. Obviously, for each different amino acid in the protein sequence Ψ, after connecting all of its coordinates, then we can obtain 20 different curves for the protein sequence Ψ, since it has 20 different amino acids. Therefore, through above steps, we can translate a protein sequence into a graph with 20 curves according to each feature of amino acids and in addition, as for the 12 different features of amino acids, we can obtain 12 groups of curves for the protein sequence Ψ and in each group, there are 20 different curves.

Similarities/Dissimilarities analysis Model of Protein Sequences
Let G Ψ represent a graph of Ψ obtained by the method given above in section 2 and ∀ p t ∈Ψ, t∈ [1,20], let the Feature Appearance Model of p t in Ψ be ∆ t = < 1 2 , ,...
u u u > and E t ∈ G Ψ represent the curve of p t in G Ψ , then we can obtain the ED, PD, D/D, L/L matrixes of E t according to the following Formula 5-8 respectively (Randic et al., 2000;2003a;2003b;2003c;Randic, 2003;Randic et al., 2004;Randic and Vracko, 2003;Bajzer et al., 2003). Thereafter, let Based on the matrixes a a a a of these matrixes respectively (Li and Wang, 1966;Shrock and Tsai, 1997;Biggs, 1974). Hence, for E t ∈ G Ψ , we can describe it with a 20 dimensional vector V t as the following Formula 9: , , , , , , , , , , , , , , , , , , , Thereafter, the graph of Ψ can be represented as a 20×20 matrix Based on the Descriptor Matrix obtained above, we randomly select k (k∈ [1,20]) columns from M Ψ each time, then we will obtain a new 20×k dimensional vector for any j∈ [1,20]. Therefore, for any two protein sequences Ψ 1 and Ψ 2 , supposing that we have obtained two 20×k matrix 1 2 20 > for any i∈{1,2} and j∈ [1,20], then we can obtain the distance d(Ψ 1 , Ψ 2 ) between Ψ 1 and Ψ 2 as follows: Where:  Based on the Formula 10, we can obtain three other distance matrixes M oN , M oG and M oS according to three groups of protein sequences such as the 16 ND5 protein sequences, 13 globin protein sequences and 29 sequences of spike protein respectively. The basic information of these three groups of protein sequences are illustrated in the following Table 2 to 4.
And in addition, when adopting the ClustalW algorithm (Thompson et al., 1994) and the software MEGA (Tamura et al., 2013) to obtain the distance matrixes for each group of protein sequences such as the 16 ND5 protein sequences, 13 globin protein sequences and 29 sequences of spike protein, then we can also obtain three distance matrixes M sN , M sG and M sS according to these three groups of protein sequences respectively.
Cnsidering the above two groups of distance matrixes Let AvgCC = (Acorr(M sN , M oN ) + Acorr(M sG , M oG ) + Acorr(M sS , M oS ))/3, since there are totally 12 kinds of features of amino acids, then we can obtain 12 different graphs for each protein sequence and in each graph, there are 20 different curves. Additionally, according to the Formula 9, we can know that each curve in a graph can be described by a 20 dimensional vector, then, it is obvious that we will obtain lots of values of AvgCC.
For the protein sequence Ψ, supposing that we finally obtain Γ different values of AvgCC such as {AvgCC 1 , AvgCC 2 ,…, AvgCC r } and there is AvgCC 1 ≥ AvgCC 2 ≥…≥ AvgCC r and in addition, supposing that to obtain the value of AvgCC 1 , we shall select the Jth (J∈ [1,12]) feature from these 12 features of amino acids and K (K∈ [1,20]) columns { } Obviously, the Optimal Feature obtained above can be utilized for graphical representation of protein sequences and the Optimal Descriptors obtained above can be utilized for analyzing the similarities/dissimilarities of protein sequences.
Based on three groups of protein sequences such as the 16 ND5 protein sequences, 13 globin protein sequences and 29 sequences of spike protein, through experiments, it is easy to prove that the Optimal Feature will be pK2 and the Optimal Descriptors will be Hence, we can adopt pK2 and as parameters to construct a new Similarities/Dissimilarities Analysis Model according to the following steps: Step1: For each protein sequence Ψ in these two groups of protein sequences such as 16 ND6 proteins and 15 myoglobin proteins, obtain its graphical representation G Ψ based on the feature pK2.
Step2: According to the Formula 10 and the matrix of Optimal Descriptors , obtain two distance matrixes M oND6 and M oGlobin for these two groups of protein sequences such as 16 ND6 proteins and 15 myoglobin proteins respectively. Step3: Utilize these distance matrixes M oND6 and M oGlobin to analyze the similarity/dissimilarity of protein sequences numerically.

Graphical Representation of Protein Sequences
According to the new Similarities/Dissimilarities Analysis Model given above, the following Fig. 1 illustrates some graphs of the protein sequences in the group of 16 ND6 proteins.
From the graphs in Fig. 1, it is easy to see that the four graphs (a), (b), (c) and (d) are similar to each other and it is obvious that the phenomenon is totally consistent with the results of the known fact of evolution.

Similarity/Dissimilarity Analysis of Protein Sequences
According to the Similarities/Dissimilarities Analysis Model proposed above, the distance matrixes of 16 ND6 protein sequence and 15 myoglobin are illustrated in the following Table 5 and 6 respectively. From Table 5, it is easy to find that there are some similar pairs such as (Human, P-Chim) with the distance 245.57, (Human, C-Chim) with the distance 340.17, (Human, Gorilla) with the distance 428.27, (Gorilla, P-Chim) with the distance 336.88, (Gorilla, C-Chim) with the distance 447.87, (P-Chim, C-Chim) with the distance 268.25, (Fin-Wha, Blu-Wha) with the distance 1254.55 and (Sheep, Goat) with the distance 906.41, etc,. and among them, the Opossum and Gallus seems to be two peculiar mammals, since the shortest distance between Opossum and the remaining mammals is more than 3371.80 and the shortest distance between Gallus and the remaining mammals is more than 3071.20. Obviously, the result is consistent with the fact that Opossum is the most remote specie from the remaining mammals and Gallus is not a kind of mammal.
And from Table 6, we can also obtain some similar pairs such as (Black rockcod, Neopagetopsis ionah) with the distance 788.16, (Cattle, Sheep) with the distance 712.94 and (Norway rat, House mouse) with the distance 770.12. Obviously, although there is a little errors in our experiments, but the basic conclusions are consistent with the results of the known fact of evolution.

The Phylogenetic Tree of the Protein Sequences
To demonstrate the performance of our new Similarities/Dissimilarities Analysis Model, in this section, we illustrate the phylogenetic trees obtained by our model and the phylogenetic trees obtained by utilizing the clutalW algorithm (Thompson et al., 1994) in the following Fig. 2. From Fig. 2, it is easy to know that in the phylogenetic trees of the 16 ND6 protein sequences and 15 myoglobin protein sequences obtained by our Model and the clutalW algorithm are almost the same. For example, in the phylogenetic tree of the 16 ND6 protein sequences obtained by our model, the Human, P_Chim, Gorilla and C_Chim are classified into a same category, the sheep, goat and cattle are classified into a same category, the fin_wha and blu_wha are classified into a same category and the rabbit and hare are classified into a same category also. Obviously, the results obtained by our model meet the reality overall except for the rat and mouse.
Similarly, in the phylogenetic tree of the 15 myoglobin protein sequences obtained by our model, the human, polar bear and pig are classified into a same category, the cattle and sheep are classified into a same category, the black rockcod, Neopagetopsis ionah and ocellated icefish are classified into a same category also, which are the same as that illustrated in the phylogenetic tree of the 15 myoglobin protein sequences obtained by the clutalW algorithm. Thus, we can make a conclusion that our method is correct and effective.

Conclusion
In this study, a new 2D graphical representation of protein sequence by mapping a protein sequence into curves based on the physicochemical and biological features of each amino acid first and then, a new similarities/dissimilarities analysis model for protein sequences is proposed based on the newly given 2D graphical representation of protein sequence, finally, on the basis of three well-known proteins sequence groups, simulation results show that our newly given method is correct and effective.