Protein Secondary Structure Prediction using Hybrid Recurrent Neural Networks

: The most important and challenging problem in bioinformatics is protein secondary structure prediction. The molecules of all protein organisms have three-dimensional (primary, secondary, 3-D) structures which are completely recognized by the sequence of amino acids. Protein secondary structure attributes to the polypeptide backbone of the local configuration of proteins. Most generally, the second-level prediction is indicated such as: If there is an amino acid sequence of the protein, then predict that all amino acid has in the α-Helices (H), β-sheet (E), and other Random Coils (C). In this study, Hybrid Recurrent Neural Networks (HRNN) have been proposed for the prediction of protein secondary structure to improve the prediction performance. The purpose of the work is to predict the protein secondary structure and bring out a highly accurate solution that would be easily solved in computational biology. The proposed method can experimentally perform exceedingly better than other previous work and this study could be easily understandable by researchers for solving the protein structure prediction problems. The five techniques are used for this implementation. These are Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Bidirectional, Bidirectional Gated Recurrent Unit (BGRU), and Bidirectional Long Short-Term Memory (BLSTM) neural networks. Especially, the proposed two-dimensional recurrent neural network (2D-RNN) framework consisted of five models: 2D-GRU_RNN, 2D-LSTM_RNN, 2D-Bi_RNN, 2D-BiGRU_RNN, and 2D-BiLSTM_RNN. In this study, firstly the 2D recurrent neural network has been generated and combined the extracted features of protein sequence with Position-Specific Scoring Matrix (PSSM). After that, the model has been trained and tested with those datasets. Finally, the model has been evaluated for prediction. Besides, all prediction accuracy has been compared and improved with existing methods. These achievements are obtained 91% (BiGRU and BiLSTM), 92% (BiGRU), 89% (BiGRU and BiLSTM), 93% (BiGRU and BiLSTM), 88% (BiGRU), 86% (BiGRU), 91% (GRU), 87% (BiLSTM), 88% (BiGRU) and 93% (GRU and BiGRU) for predicting accuracy of Q3.


Introduction
Protein is a macromolecular polypeptide (Zhao et al., 2020). Secondary structures are the building blocks of the macromolecule structure. It formally refers to the pattern of hydrogen bonds in both the amino group and therefore the carboxyl oxygen atoms within the peptide backbone.
There are two types of secondary structure. One is a regular secondary structure and it has three types which are α-Helices (H) and β-sheet (E), Coil (C) (Akkaladevi et al., 2005;Hobbes, 2019) and other is irregular secondary structure and it has many types such as Tight turns, Bulges, etc. According to DSSP, there are 8-class of secondary structures i.e., G (310 Helix) H, I (π-Helix) B 600 (Isolated Bridge), E (Beta-strand), C, S(Bend), T (Tight turns) are converted into 3-class of secondary structure (Hu et al., 2019;Zhang et al., 2018;Yang et al., 2018;Hanson et al., 2019). {B, S, T, G, I, C} are converted into C, {H}> H, {E}>E. Methods of predicting protein secondary structures based on deep learning techniques are the most crucial problems in molecular biology. These different techniques are applied for the prediction of 3state or 8-state secondary structures which are correctly predicted by Q3 or Q8 accuracy with PSSM. But Q3 accuracy (C, H, E) is the best for the PSSP (Zhang et al., 2018;Wang et al., 2017;. Over the early years, the researchers predicted protein secondary structure prediction using various techniques (Zhang et al., 2018;Wardah et al., 2018). Furthermore, the accuracy of the prediction has improved compared to the existing methods. Such strategies are based on all the information needed to determine the three-dimensional structure and the sequence of amino acids which are encoded. A recurrent neural network is a class of artificial neural networks. It is also called a form of feed-forward Neural network. It typically handles sequential data. A recurrent neural network has many types and techniques that sequentially organize data (Babaei et al., 2010). We used a two-dimensional recurrent neural network with long short-term memory cells and a Gated Recurrent Unit with a Bidirectional recurrent neural network for the prediction of the protein secondary structure and predicted accuracy using some kinds of datasets. Deep RNN is used in the model's hidden layers, which are referred to as Gated Recurrent Units (GRUs). GRU is a form of RNN that can model data in sequential order. GRU is faster and consumes less memory than LSTM, however, LSTM is more accurate when working with datasets with longer sequences. Gated recurrent unit and long-short term memory network which is eventually combined with a bidirectional network by a bi-LSTM and Bi-GRU layer. Ten types of datasets are used having all the features in each model and after selecting the models their performances have been evaluated and these performances have been added to the final model. BiGRU uses deep RNN will functionally improve the performance of algorithms. This technique also increases the accuracy of PSSP than other single RNN techniques. It also shows better performance for the prediction on smaller and larger datasets. BiLSTM will also improve the performance of prediction accuracy and sequentially organize data (Hu et al., 2019). It also helps to extend the amount of information that is available on the network (Hu et al., 2019). Zhao et al. (2020) have proposed a new strategy based on generative confrontation and convolutional neural networks to predict protein secondary structures. They created a confrontation network for the extraction of the protein features and then combined the extracted features with the original position-specific scoring matrix data as input from the convolutional neural network to get predicted results. Hu et al. (2019) introduced a Bi-LSTM-based ensemble algorithm to predict the secondary structure of proteins. They introduced the ensemble model. This technique included five Sub-Models (PSSM, HMM (Hidden Markov Model), PSSM-Count, Wordem, and PPS model). Bi_LSTM layer has been created for these models. Each model contained 2 Bi_LSTM layers and composed an ensemble model. Finally, the Bi_LSTM layer is joined with the sub-model. The ensemble model and sub-model trained concurrently and observed the performance of each model. This model achieved the highest 84% accuracy. Zhang et al. (2018) presented a novel deep learning architecture based on a convolutional neural network, residual network, and bidirectional recurrent neural network to improve the prediction performance of protein secondary structure. This model applied RNN to verify the structural class of protein for low and high-dimensional data sets. The Stock well transformation is applied to improve the prediction performance of protein structural class. Guo et al. (2018) introduced a hybrid deep learning framework, 2-dimensional Convolutional Bidirectional Recurrent Neural Networks (2C-BRNNs) to improve the predictability of 8-grade secondary structures. This model also extracted differential local interactions between amino acid residues by 2dimensional convolutional neural networks This hybrid framework comprised four models which 2DConvBGRUs, 2DCNN-BGRUs, 2DConv-BLSTM, and 2DCNN-BLSTM. This model performed better for the prediction of protein secondary structure than the benchmark models. This is also helpful for feature extraction. Li and Yu (2016) proposed an EEDN (end-to-end deep network) that predicted protein secondary structures from integrated local and global contextual features. They presented a CCNN (Cascaded Convolutional Neural Networks) and RNN to predict the secondary structure. This model comprised four parts, one feature embedding layer, 2 nd ; multi-scale Convolutional Neural Network (CNN) layers, 3 rd ; three stacked Bidirectional Gated Recurrent Unit (Bi-GRU) layers, and 4 th ; two fully connected hidden layers. The embedded sequence features and the original profile features are fed into multiscale CNN layers with different kernel sizes to extract multiscale local relevant features and improve the prediction performance. This model was effective to achieve the performance of art. Cheng et al. (2020) proposed a prediction method of protein secondary structure based on the CNN and LSTM model. CNN has two convolutional layers, one carpooling layer and the other ReLU activation layer in its architecture. They used the Soft Max classifier. It is fed with the feature maps extracted from the second 601 convolutional layer and the first probability output is obtained. There is a sequence layer and a last layer in the LSTM model. To get the second probability output, the feature is retrieved from the last layer and fed into a random forest classifier. To obtain the prediction model EN-CSLR in this study, the two probabilistic outputs are weighted and combined.

Related Work
Recent studies show that the prediction of protein secondary structure is a vital issue. Accuracy results and time complexity issues of the prediction process of the existing methods using deep learning techniques are not satisfactory. So, the proposed Hybrid Recurrent Neural Networks (HRNN) are helpful for the prediction of protein secondary structure.

Protein
Proteins are called large and complex organic molecules that take part in a vital role in the body. Proteins are building blocks of amino acid sequence Babaei et al., 2010;Dongardive and Abraham, 2017). They function mostly in cells and are essential for the formation, function, and control of body tissues and organs (Wardah et al., 2018). Figure 1 shows the structure of the protein. Generally, protein structure has four types: Primary, secondary, tertiary, and quaternary structure (Protein Structure, 2019; OPS, 2022).

Primary Structure (PS)
The primitive stage of protein structure is called primary structure. It is a linear sequence of amino acids in a polypeptide chain (Wardah et al., 2018). The hormone insulin contains two polypeptide chains. Every chain has its own set of amino acids, grouped in a specific order. Each amino acid sequence can be linked to the next amino acid sequence by a peptide bond formed during the process of biosynthesis. Protein starts to form the amino-terminal (N) end and ends in the Carboxyl-terminal (C) end (Protein Structure, 2019; OPS, 2022).
Primary Structure = Sequence of Amino Acid. Figure 2 shows the 3 letter code of the amino acid sequence. The order of a protein is found by the DNA of the gene which is encoded by the part of a protein or multisubunit protein.

Secondary Structure (SS)
The second level of protein structure is called the secondary structure. Mostly common and available secondary structure is alpha-helices and beta-strand (Li and Yu, 2016). It is folded or pleated. It is formed into a polypeptide chain by hydrogen bonds between the carbonyl O group and amino hydrogen H. Besides, this contains random coils, bulges, turns, Beta-bends, etc., (Protein Structure, 2019; OPS, 2022). Regular Secondary structure = Alpha -Helices (H), Beta-Strand (E), and Coil (C) (Guo et al., 2019). Figure 3(a) shows the α helix structure and intermolecular hydrogen bonding. It has 3.6 amino acids per turn. Its inner-facing side chains are hydrophobic. This Fig. 3(b) indicates the β-sheet of the secondary structure. Each 5 to 10 amino acid in each region forms beta-sheets.

Tertiary Structure (TS)
It refers to the 3-D structure (Babaei et al., 2010;Guo et al., 2018). It contains many forms of the polypeptide chain. It has R-group interactions. Besides, this structure has many properties following hydrophobic interactions, hydrogen bonding, ionic bonding and disulfide bridge, and dipole-dipole interactions. Hydrogen bonds can be formed by polar R-groups. Hydrophobic interactions and dipole-dipole interactions are very important for the three-dimensional structure.
Tertiary structure = fold helices and strands into domains. Figure 4 shows the tertiary structure. This structure fold helices and strands into domains for the prediction of protein (Protein Structure, 2019; OPS, 2022).

Quaternary Structure (QS)
It gives a specific overall shape of a protein (Guo et al., 2018). It involves interactions and cross-links between different parts of the polypeptide chain. Some units stabilize QS (Wardah et al., 2018). Example: • Hydrophobic and Hydrophilic interactions • Salt bridges • Hydrogen bonds • Disulfide bonds (Protein Structure, 2019; OPS, 2022) Quaternary Structure (Biological Units) = functional assemblies of chains (subunits). Figure 5 points out the quaternary structure. It has two or more tertiary subunits. For example, two alpha chains and two beta chains are included in hemoglobin.
In this study, the secondary structure of protein function has been used for the prediction. There are many techniques for the prediction like the Machine Learning algorithm, Chou-Fasman method, Hidden Markov Model, etc., but hybrid recurrent neural network techniques have been used in this study. This technique helps to improve the prediction performance and accuracy of the existing methods. In this study, the proposed hybrid recurrent neural network consists of Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Bidirectional RNN and their combined techniques are BiGRU and BiLSTM.

Gated Recurrent Unit (GRU)
A Gated recurrent unit is one kind of recurrent neural network. This network unit can be used to improve the memory capacity of a recurrent neural network as well as to facilitate the training of a model. The hidden unit can also be used to solve the invisible gradient problems in recurrent neural networks. This illustrates better performance for the sequence-based processing models and predictions on smaller and larger datasets (Wardah et al., 2018;Phi, 2018). Figure 6 illustrates the architecture of GRU. The different gates of the Gated recurrent network are discussed below: 1. Update Gate: It indicates how much knowledge of the past has to be passed with the future. This is almost the same as the LSTM output gate (Li and Yu, 2016;Phi, 2018) 2. Reset Gate: This gate indicates the past knowledge of how should be reset. This is similar to the consolidation of input gate and forgets gate like LSTM (Li and Yu, 2016;Phi, 2018) 3. Current memory gate: This is included in the reset gate as the input modulation gate. It is a sub-portion of the input gate. It helps to present some nonlinearity input and create the input zero-minute. It is helpful to minimize the impact of past data on the current data which processed in the future (Li and Yu, 2016;Phi, 2018) Fully gated unit, initial value for t = 0, input vector = and output vector ( 0 = 0): [ut: Update gate vector for the fully gated unit]: [qt: reset gate vector for the fully gated unit]: [ˆt g : candidate activation vectot]: [gt: Output vector for the fully gated unit].
Here, is the sigmoid function, is the hyperbolic tangent, and P, V, and C is the parameter metrices and vector (Zhang et al., 2018;Wang et al., 2017;Li and Yu, 2016;Panda and Majhi, 2021;Phi, 2018) Long Short-Term Memory (LSTM) Long Short-Term Memory (LSTM) is one kind of artificial Recurrent Neural Network (RNN) architecture. It is used in deep learning architectures (Cheng et al., 2020). Unlike standard feedforward neural networks, LSTM has feedback connections. It processes both single and entire sequences of data points. LSTM cells can note both long and short-range interactions by applying constant error flow . This allows the input of the entire protein sequence, regardless of sequence segmentation It can be used to predict the performance problems, classification problems, processing problems, protein homology detection, and also sequential problems. We can use an element-wise multiplication vector (⊙) for the first calculation (Sønderby and Winther, 2014;Phi, 2018). Figure 7 indicates a cell state, an input gate, an output gate and a forget gate which are the general components of a typical LSTM unit.
Hidden state and new inputs-the input at a current timestep and the hidden state from a prior timestep are combined before being passed through various gates.
Forget the gate-this gate determines which information should be forgotten. The sigmoid function ranges between 0 and 1 and determines which values in the cell state should be deleted, recalled, or partially remembered. (Multiplied by some value between 0 and 1).
Input gate-This gate facilitates the identification of critical components that must be introduced to the cell state. The cell state candidate gets to multiply the output of the input gate, with just the information the input gate deems important being included in the cell state.
Update cell state-The output of the forget gate gets to multiply the prior cell state gets. For the instance of input gate*cell state candidate, we get new information for the latest cell states.
Update hidden state-It is the last part. The most recent cell state is multiplied by the outcomes of the output gate using the tanh activation function.
The equations of the LSTM cell with forget gate are:

Bidirectional Recurrent Neural Networks (BRNN)
A bidirectional neural network is a special kind of recurrent neural network. This network connects two hidden layers of opposite directions to the same output (Guo et al., 2018;BRNN, 2022). The output layer or model will be able to get information from forwarding and backward states (BRNN, 2022). This recurrent neural network can be used to predict the protein secondary structure and performance problems. Besides, we can use this network for speech, handwriting recognition, and translation. Two bidirectional RNN with GRU and LSTM are stacked to increase the prediction performance (Guo et al., 2018). where: ( )

Bidirectional Gated Recurrent Unit (BiGRU)
A bidirectional gated recurrent unit is a sequence processing model that consists of two GRU: One taking the input in a forward's direction and the other in a backward direction (Guo et al., 2018). It is a bidirectional recurrent neural network. It has an input and forgets the gate (Li and Yu, 2016). This bidirectional network increases the performance of PSSP (Li and Yu, 2016;Kumar et al., 2020;BGRU, 2016).

Bidirectional Long Short-Term Memory (BiLSTM)
A bidirectional LSTM is a bidirectional recurrent neural network. It is called a sequence processing model based on two LSTMs: One taking the input in a forward direction and the other in a backward direction (Guo et al., 2018). Bi-LSTMs also improve the performance of algorithms like LSTM Network. Bi-LSTMs essentially improve the number of information accessible to the network. This technique increases the accuracy of PSSP than other simple RNN techniques (Hanson et al., 2019;Kumar et al., 2020).

Proposed Methods
In this study, a prediction method has been proposed for the hybrid recurrent neural networks for the protein secondary structure. The recurrent neural network-based algorithms have been implemented. The amino acid sequences have been utilized for a better prediction of secondary structure with the help of PSSM. Figure 9 shows at first the dataset has been loaded, after that, it will remove the unnecessary raw data. Further, the dataset has been preprocessed. After processing the dataset, a popular sequence comparison tool i.e., PSSM has been used. Again, a 2D layer of a recurrent neural network model has been created. Before creating the 2D layer, 5 types of RNN have been used and combined these types within 2D-RNN. At the end, the 605 model of 2D-GRU_RNN, 2D-LSTM_RNN, 2D-Bidirectional_RNN, 2D-BiGRU_RNN and 2D-BiLSTM RNN have been built. The datasets have been trained and tested with these models. Finally, the accuracy of these models has been calculated.

CB513 Dataset
This is an essential and suitable dataset for the improvement of algorithms and the prediction of secondary structure. In this study, we used this type of dataset that assist to increase the prediction performance of protein secondary structure (Wang et al., 2017;Zhou et al., 2018).

PDB and other Datasets
This dataset is used to predict the 2-D and 3-D protein structures. It is the largest dataset. It is also used to fold the protein structure and organize the classifying data. We used PDB datasets and other types of datasets. For all of the datasets, the accuracy of 's dataset is better than the other types of datasets (SPS, 2022). RS126 dataset is the oldest and largest dataset for the protein secondary structure prediction. Rost and Electric Sander created the scheme. It is one of the most effective datasets to predict the supermolecule structure. It is also applied to bioinformatics research. It can carry 23,347 residues with an average supermolecule sequence length of 185. There are 3 % α-helices (H), 1% β-sheet as well as 47% random coil in RS126 datasets (RS126Data, 2022).

GRU Recurrent Neural Network Algorithm
GRU (D, P, S, C) Input: D-train data, P-test data, S-sample size, C-number of sequences Output: Prediction results 1. At the first step, the network needs to initialize the size of S and C 2. For j= 1 to n do 3. To calculate update gate ut; initially, t = 0, input vector = yt and output vector, g0 = 0, using the e: ut = σg(Puyt + Vu gt−1 + Cu) 4. To calculate reset gate qt, this model needs a sigmoid function and parameter matrix, using the Eq. 2: 5. Finally, calculate the output vector, a new memory use reset gate qt to store the previous information.
Here, the activation function (ĝt) is Needed to hold the information using Eq. 3 and 4:

Performance Analysis and Results
If the better quality PSSP dataset is available then the accuracy will be better. In this study, ten types of datasets have been used. These models are trained very fast. For the training and testing datasets, batch size =128, verbose = 1, validation_split = 0.5, epochs = 20 have been used. The completion of the prediction process has taken time at 125 ms/step. As a result, these models achieved the highest accuracy of existing methods.

Performance Indices and PSSM
Accuracy: Accuracy is used to measure the performance prediction. It is needed to calculate the accurate result. Q3 Accuracy and PSSM (taking the input datasets) have been used to calculate the accurate prediction of secondary structure. It is calculated by: Where: NC = The number of accurately predicted protein structural classes of C NH = The number of accurately predicted protein structural classes of H NE = The number of accurately predicted protein structural classes of E N = The total number of amino acids: Qj = Represents the total number of amino acid residues.
Which are denoted in the state j Figure 10 points out the test sequence were predicted and the actual sequence has been evaluated from the original sequence. Table 1 and 2 shows the Q3 accuracy with QC, QH, and QE of the tested datasets based on BiGRU, GRU, and BiLSTM. It can be seen that we are shown some (QC, QH, QE) based Q3 accuracy on the proposed methods. Figure 11 shows the performance among 10 types of datasets that are sequentially organized for computing the accuracy. These 5 techniques have been evaluated within those datasets. In this figure, we have been enabled to show a better performance. It is called a hybrid recurrent neural network model by using GRU, LSTM, Bidirectional, BiGRU, and BiLSTM techniques of recurrent neural network. Here, it is seen that the performance result of BiGRU and BiLSTM are higher than the single GRU, LSTM and Bidirectional RNN. From this figure, we got almost 93% accuracy from the PDB and validation secondary datasets because of larger datasets. This hybrid technique has increased the performance of protein secondary structure prediction and will be enabled to handle the sequential and protein fold data. Fig. 12 shows that the comparison between the proposed model and the other CNN model is based on the test datasets. Here, the accuracy based on the proposed model is better than the existing methods. The proposed hybrid recurrent neural network model has improved closer to accurate prediction performance The test datasets CASP10, CASP12, CB513, and 608 PDB25 are obtained at 91, 92,89 and 93% which are 4, 5, 1, and 4% higher than the CNN model (Zhao et al., 2020). It can be seen that their dataset's 3-state accuracy results are not satisfactory. So, the proposed model is helpful to increase closer to accurate prediction performance and sequence alignment.

Conclusion
The main conclusions of the experimental work should be presented. In this study, a hybrid recurrent neural network has been applied to improve the overall prediction performance of protein secondary structure. This technique includes five types of RNN based on GRU, Bidirectional LSTM, BiGRU, and BiLSTM. These techniques are applied to extract the features of protein structural class and sequence alignment. In this study, 2D RNN has been used for a better prediction performance. The BiGRU and BiLSTM also help to improve the 3-state accuracy of predictions compared to the other strategies in the recurrent neural network model. This hybrid recurrent neural network model provides a significant first step towards predicting the third-dimensional structure as well as providing information about protein activity, relationships, and functions. Protein folds based on amino acid sequences can revolutionize the design of accurately predicted drugs and explain the causes of new and old diseases. Having a protein structure provides a broader level of understanding of how a protein works, allowing us to make assumptions about how it can affect, control, or alter it. For example, knowing the structure of a protein allows you to design site-directed mutations to change functions. We are enabled to improve the prediction performance and achieved the highest 93% accuracy for the prediction performance than the existing works. As a neighborhood of future scope of labor, we'll propose the three-dimensional protein structure prediction using deep learning techniques. Also, we can explore more advanced techniques like the unsupervised or semi-supervised learning techniques, and other machine learning techniques, combined with the convolutional and recurrent neural network models, multilayer perceptron, inductive learning, and lots more.