© 2008 Science Publications Development of Deduced Protein Database Using Variable Bit Binary Encoding

A large amount of biological data is semi-structured and stored in any one the following file formats such as flat, XML and relational files. These databases must be integrated with the structured data available in relational or object-oriented databases. The sequence matching process is difficult in such file format, because string comparison takes more computation cost and time. To reduce the memory storage size of amino acid sequence in protein database, a novel probability-based variable bit length encoding technique has been introduced. The number of mapping of triplet CODON for every amino acid evaluates the probability value. Then, a binary tree has been constructed to assign unique bits of binary codes to each amino acid. This derived unique bit pattern of amino acid replaces the existing fixed byte representation. The proof of reduced protein database space has been discussed and it is found to be reduced between 42.86 to 87.17%. To validate our method, we have collected few amino acid sequences of major organisms like Sheep, Lambda phage and etc from NCBI and represented them using proposed method. The comparison shows that of minimum and maximum reduction in storage space are 43.30% and 72.86% respectively. In future the biological data can further be reduced by applying lossless compression on this deduced data.


INTRODUCTION
RNA sequences are composed of four nucleotides: adenine (A), uracil (U), guanine (G), and cytosine (C). Any of the three combinations of the nucleotide bases is called as triplets or CODONS. Hence there are 64 possible CODONs. The combination of these CODONs forms different proteins. The sequence of amino acids represents a particular gene or protein [7] . The protein databases are generally represented in shorthand, using single letter designations. In the existing Protein database, the amino sequences are in a text format and are stored in any one of the following file formats such as flat, XML and relational files. This representation requires more memory storage spaces [3] and manipulation on these sequences can be dealt with high level programming languages. Hence it requires a high computation cost and high execution speed.

MATERIALS AND METHODS
Some of the existing universal biological databases details are listed below.
Universal protein sequence database: There are different categories of biological databases such as nucleic acid sequence, protein sequence and protein structure. Swiss-Prot [2] provides a biological database with a minimum level of redundancy and a high level of integration with other databases. The Protein Research Foundation Sequence Database [1] is the database of protein primary structures. The AceDB is an integrative database system that has been used for management of genome-oriented biological data. The Protein Data Bank (PDB) is the single world wide archieve of structural data of biological macromolecules. The generated sequence data are stored in large genomic repositories, of which the most commonly used databases are EMBL/European Bioinformatics Institute (EBI) and (NCBI) [2] . The Basic Local Alignment Search Tool (BLAST) is used to compare a novel sequence with those contained in nucleotide and protein databases [5.6] . ClustalW is a general purpose multiple sequence alignment program for DNA or proteins. The data storage formats used in different existing databases is shown in Table 1.

Types of data model:
The bio informatics databases are maintained by different organizations using different DBMS with different data models such as flat files or XML or relational or object oriented [4] . Figure 1 shows the biological data storage in Flat, XML and RDB file format.
Flat file: A flat file database is the simplest database model in which the records are stored in one record per line format. Flat-file libraries contain data structured in an ASCII text is shown in Fig. 1a. The ASCII is the de facto standard for data exchange (e.g., BLAST, CLUSTALW etc.) [4] .
XML file: XML is a hierarchical and semi-structured model that has text-based files. An XML databank that  Fig. 1: Structure of a flat-file, XML and RDB entry stores data as a structured text file using XML tags (e.g., PIR-PSD) is shown in Fig. 1b.

Relational database file:
A Relational database file is a highly structured model. SQL statements are used to retrieve information from the database (e.g., AceDB). The results of the query are converted into standard text format. Figure 1c shows the storage format of relational database entry.
Iskandar et al. [9] stated software based approaches like navigational, mediator and data warehousing to integrate the different databases. All the file types discussed above have been used fixed byte representation to store amino acid sequences. But, our method uses variable bit to encode the data which reduces the storage space and it leads to explore the hardware for searching.
Proposed work: All existing protein databases use the ASCII code to encode each character. To encode a single amino acid, it requires 7 bits. Let n be the number of amino acids present in the protein. Then the actual memory needed to store the protein is 7n. To reduce the size of this protein database, a new approach is proposed here. This approach replaces the existing protein databases from the ASCII to a machine readable binary format using a probability-based variable bit length encoding technique. The proposed method reduces the storage space ranges from 43.85 to 87.17%. The detailed description about this reduction factor is described in result and discussion. Here, unique bits of binary codes are assigned to each amino acid sequence. This unique bit of binary codes is generated using a binary tree. The construction of this binary tree is discussed in section 3.2.2. For every amino acid, a probability value is calculated based on the occurrence of CODONs mapping a single amino acid. In this method, a lesser number of bits are assigned to amino acids which appear frequently and more number of bits to those which appear less frequently. The binary code representation of amino acids can be obtained using three techniques, namely, fixed length encoding, unique bit pattern-variable length encoding and probability based variable bit length encoding.
Fixed length encoding: By using fixed length encoding, each of 21 amino acids is represented as a bit pattern of fixed size 5 (2 5 = 32). Some amino acids occur more frequently than others. But all are assigned with the same number of bits. This results in a lengthy encoded sequence. Hence a variable length encoding scheme is proposed below to overcome this issue.
Variable bit length encoding: Variable bit length encoding is done, either by assigning a unique bit pattern to an amino acid or assigning the bit pattern based on the probability of occurrence of CODONs mapping a single amino acid.
Unique bit pattern variable BIT length encoding: If each amino acid is represented by a unique bit pattern(R = 0 L = 1 S = 00 P = 01 V = 10 A = 11 G = 000 T = 001 I = 010 O = 011 H = 100 K = 101 F = 110 Y = 111 N = 0000 D = 0001 C = 0010 Q = 0011 E = 0100 M = 0101 W = 0110), it is distinguishable only when presented separately. The difficulty arises, when these amino acids are formed into a data stream (00101010011110). Without a predictable bit-length, there is a chance of misinterpreting the code. This proposed algorithm, known as probability based variable bit length encoding, solves this issue.

Probability based variable length encoding:
In this method, a probability value is calculated based on the frequency of the occurrence of the amino acid. The probability value is calculated using Eq. 1. These estimated probability values are assigned to the leaf nodes of a binary tree to be constructed. After the binary tree construction, a value zero is assigned to the right edges, and one, to the left edges of the tree. Then the binary code representation of an amino acid is identified by traversing the tree from the root followed by the branches that lead to that amino acid. Figure 2 shows the overview of the probability based variable length encoding process.
Calculation of probability value: Among the 64 possible CODONs, the number (N) of triplet CODONs needed to map a single amino acid is identified. Amino acids which have the same number of triplet CODONs are grouped. The Probability (P) of occurrence of each amino acid group appearing in the total combinations is identified using the Eq. 1.
where 'N' is the number of triplet CODONs needed to map a single amino acid and 'C' is the count of amino acid that has the same number of triplet CODONs. The probability value thus calculated for each amino acid group is shown in Table 2. The relative probability values assigned to each amino acid are given in Table 3.

Process for building the binary three based on the probability value:
The process for building the binary tree is explained in the following steps.   Step 1: Organize the entire amino acid character set into a row, ordered according to its probability value from highest to lowest (or vice versa). Each amino acid character is now a node at the leaf level of a tree.
Step 2: Find the two nodes with the smallest combined probability value. Join them to form a single node that results in a two-level tree so that the combined two original nodes are the children of the new node. This node, one level up from the leaves, is eligible to be combined with other nodes. The sum of the weights of the other two nodes chosen must be smaller than the combination of any other choices.
Step 3: Step 2 is repeated until all the nodes, on every level, are combined into a single binary tree.
The Fig. 3 shows the leaf-level nodes representing the original probability values of amino acids arranged in the ascending order of value. The two nodes with the  The second row shows the next level of combined nodes. The probability value of the new node is found to be the sum of the two least values among the list of values. This decision keeps the branch lines crossing the tree. Hence the nodes are rearranged for clarity.

Assigning the code:
The bit values are assigned to each branch of the constructed binary tree in such a way that the left of each node is assigned with 0 bit and the right of each node is assigned with 1 bit. The bit code assignment for the whole tree is shown in Fig. 4. The bit representation of any amino acid is found by the bits of the root followed by the branches leading to the leaf of that amino acid. The binary code for each amino acid is shown in Table 3. Using this probability-based variable length encoding technique, the binary code for each amino acid along with code length and the position containing bit value 1 is chosen in such a way that no code is the prefix of another code. Hence, there is no chance of misinterpreting the code.

Implementation of variable bit length encoding:
This variable length encoding method maps an amino acid to a binary code and vice versa without any ambiguity. Now, this binary code representation of an amino acid is to be stored in memory, so that the resultant protein database is a deduced one. The technique used to store this binary code representation of an amino acid sequence is explained below.
Encoding algorithm: This encoding algorithm stores the binary code of an amino acid in byte form. Initially the bytes required to store an amino acid sequence are calculated from the code length stored in the symbol table, Table 3 and the memory space is allocated. From the symbol table, Table 3, the length of the binary code and the position containing the bit value 1 can be found. The pseudo code of the encoding algorithm is given below. The function and input variables in this encoding algorithm are also given below.

RESULT AND DISCUSSION
The space needed to store an amino acid sequence using the proposed variable length encoding is compared here with the existing databases that are in the ASCII format. The space occupied by the existing protein database is equal to the product of the number of amino acids (n) in that sequence and seven. This comparison is carried out in three cases, namely, the worst case, the best case and the rest.
In the best case, the selected amino acid sequence contains R (Arginine) alone. Then, the size of the deduced protein database file created by the proposed method is 3(n-1)+6 including the start CODON. Hence the reducing factor is 87.17%. In the worst case, the selected amino acid sequence contains only W (Tryptophan) and M (Methionine). Then, the size of the deduced protein database file created by the proposed method is 6n [6(n-1)+6]. Hence the reducing factor is 42.86%. In other cases, the selected protein database contains the combination of all amino acids of 3/4/5/6 bit representation. Here, the size of the deduced protein sequence database file is given in Eq. 2. In the above cases, the size of the deduced protein database file is less than 7n which is required for any existing protein database. To evaluate our method we have collected few amino acid sequences of major organisms like Sheep, Lambda phage, E.coli, Chlamydomonas, Tetrahymena, Budding yeast, Fission yeast, Neurospora, Maize, Arabidopsis, Medicago truncatula, C.elegans, Drosophila, Xenopus, Zebrafish, Rat and Mouse from NCBI and reduced using proposed method. The result of this reduced space is shown in Table 4 and it found that the average reduced storage space is 43.77%, the minimum and maximum are 43.30%, 72.86%.