The Meaning of a Redundant Codon: There is Protein Folding Information in Nucleic Acids in Addition to the Genetic Code

Jan C. Biro and Josephine M.K. Biro Homulus Foundation, San Francisco, 94 195 CA, USA ___________________________________________________________________________________ Abstract: All the information necessary for protein folding is supposed to be present in the amino acid sequence. It is still not possible to provide specific ab initio structure predictions by bioinformatical methods. It is suspected that additional folding information is present in protein coding nucleic acid sequences, which is not represented by the known genetic code. Nucleic acid subsequences comprising the 1st and/or 3rd codon residues in mRNAs express significantly higher free folding energy (FFE) than the subsequence containing only the 2nd residues (p<0.0001, n=81). This periodic FFE difference is not present in introns and therefore it is a specific physico-chemical characteristic of coding sequences and it might contribute to unambiguous definition of codon boundaries during translation. The FFE in the 1st and 3rd residues is additive, which suggests that these residues contain a significant number of complementary bases and contribute to selection for local RNA secondary structures in coding regions. This periodic, codon-related structure-forming of mRNAs indicates a connection between the structure of exons and the corresponding (translated) proteins. The folding energy dot plots of RNAs and the residue contact maps of the coded proteins are indeed similar. Residue contact statistics using 81 different protein structures confirmed that amino acids that are coded by partially reverse and complementary codons (Watson–Crick (WC) base pairs at the 1st and 3rd codon positions and translated in reverse orientation) are preferentially co-located in protein structures. Exons are distinguished from introns and codon boundaries are physico-chemically defined by periodically distributed FFE differences between codon positions. There is a selection for local RNA secondary structures in coding regions and this nucleic acid structure resembles the folding profiles of the coded proteins. The preferentially (specifically) interacting amino acids are coded by partially complementary codons, which strongly supports the connection between mRNA and the corresponding protein structures and indicates that there is protein folding information in nucleic acids that is not present in the genetic code. This might give some additional explanation of codon redundancy.


INTRODUCTION
The protein folding problem has been one of the grand challenges in computational molecular biology. The problem is to predict the native threedimensional structure of a protein from its amino acid sequence. It is widely believed that the amino acid sequence contains all the necessary information to make up the correct three-dimensional structure, since protein folding is apparently thermodynamically determined; i.e., given a proper environment, a protein will fold up spontaneously. This is called Anfinsen's thermodynamic principle [1].
The thermodynamic principle has been confirmed many times on many different kinds of proteins in vitro. Critics says that the in vivo chemical conditions are different from those in vitro, the correct folding is determined by interactions with other molecules (chaperons, hormones, substrate, etc.) and protein folding is much more complex than renaturation of de-natured poly amino acids. The fact that many naturally occurring proteins fold reliably and quickly to their native state, despite the astronomical number of possible configurations, has come to be known as Levinthal's Paradox [2].Anfinsen's principle was formulated in the 1960s using purely chemical experiments and a lot of intuition. Today, we have a lot of sequences and structures available to establish a logical and understandable link between sequence, structure and function. But it is still not possible to correctly predict the structure (or a range of possible structures) purely from the sequence, ab initio and in silico [3].
There are two potential, external sources of additional and specific protein folding information: (a) the chaperons (other proteins that assist in the folding of proteins and nucleic acids [4]; and (b) the protein coding nucleic acid sequences themselves (which are templates of the protein syntheses, but are not defined as chaperons).
The idea that the nucleotide sequence itself could modulate translation and hence affect cotranslational folding and assembly of proteins has been investigated in a number of studies [5][6][7]. Studies on the relationships between synonymous codon usage and protein secondary structural units are especially popular [8][9][10]. The genetic code is redundant (61 codons code 20 amino acids) and as many as 6 synonymous codons can code the same amino acid (Arg, Leu, Ser). The "wobble" base has no effect on the meaning of most codons but still the codon usage (wobble usage) is not randomly defined [11,12] and there are well known, stable species-specific differences in the codon usage. It seems to be logical to search for some meaning (biological purpose) of the wobble bases and try to associate them with protein folding.
Another observation concerning the code redundancy dilemma is that there is a widespread selection (preference) for local RNA secondary structure in protein coding regions [13]. A given protein can be encoded by a large number of distinct mRNA species, potentially allowing mRNAs to simultaneously optimize desirable RNA structural features in addition to their protein coding function. The immediate question is whether there is some logical connection between the possible, optimal RNA structures and the possible, optimal biologically active protein structures.

MATERIALS AND METHODS
Single-stranded RNA molecules can form local secondary structures through the interactions of complementary segments. WC base pair formation lowers the average free energy, dG, of the RNA and the magnitude of change is proportional to the number of base pair formations. Therefore the free folding energy (FFE) is used to characterize the local complementarity of nucleic acids [13]. The free folding energy is defined as FFE=(dG shuffled -dG native )/L×100, where L is the length of the nucleic acid, i.e., free energy difference between native and shuffle (randomized) nucleic acids per 100 nucleotides. Higher positive values indicate stronger bias toward secondary structure in the native mRNA, and negative values indicate bias against secondary structure in the native mRNA.
We used a nucleic acid secondary structure predicting tool, the mfold [14] to obtain dG values and the lowest dG was used to calculate the FFE. The mfold also provided the folding energy dot plots, which are very useful to visualize the energetically most favored structures in a 2D matrix.
A series of JAVA tools were used: SeqX to visualize the protein structures in 2D as amino acid residue contact maps [15]; SeqForm for selection of sequence residues in predefined phases (every third in our case) [16]; SeqPlot for further visualization and statistical analyses of the dot-plot views [17]; Dotlet as a standard dot-plot viewer [18]. Structural data were downloaded from PDB [19], NDB [20], and from a wobble base oriented database called Integrated Sequence-Structure Database (ISSD) [21].
Structures were generally randomly selected regarding species and biological function (a few exceptions are mentioned in the Results). Care was taken to avoid very similar structures in the selections. A propensity for alpha helices was monitored during selection and structures with very high and very low alpha helix content were also selected to make sure of a wide range of structural representation.
Linear regression analyses and Student's ttests were used for statistical analyses of the results.

RESULTS
Observations were made on human peptide hormone structures. This group of proteins is very well defined and annotated, the intron-exon boundaries are known and even intron data are easily accessible. The coding sequences were phase separated by SeqForm into three subsequences, each containing only the 1st, 2nd or 3rd letters of the codons. Similar phase separation was made for intronic sequences immediately before and after the exon. There are, of course, no known codons in the intronic sequences, therefore we continued the same phase that we applied for the exon, assuming that this kind of selection is correct and maintained the name of the phase denotation even for non-coding regions. Subsequences corresponding to the 1st and 3rd codon letters in the coding regions had significantly higher FFEs than subsequences corresponding to the 2nd codon letters. No such difference was seen in non-coding regions (Figure 1).
In a larger selection of 81 different protein structures, the corresponding protein and coding sequences were used to extend the observations. These 81 proteins were represented different (randomly selected) species and different (also randomly selected) protein functions and therefore the results might be regarded as more generally valid. The propensity of different secondary structure elements was recorded (as annotated in different databases) ( Figure 2).
The proportion of alpha helices varied from 0 to 90% in the 81 proteins and showed a significant negative correlation to the proportion of beta sheets (Figures 3 and 4). The dG values were determined by mfold and the FFE was calculated. Each bar represents the mean±SEM, n=18. structures. L=317±20 (mean±SEM, n=81). Secondary structure codes: H, alpha helix; B, residue in isolated beta bridge; E, extended strand, participates in beta ladder; G, 3-helix (3/10 helix); I, 5 helix (pi helix); P, polyproline type II helix (left-handed); T, hydrogen bonded turn; S, bend.  The original observation made on human hormone proteins, that significantly more free folding energy is associated with the 1st and 3rd codon residues than with the 2nd was confirmed on a larger and more heterogeneous protein selection. A significant difference showed up even between the 1st and 3rd residues in this larger selection ( Figure 5). There is a correlation between the protein structure and the FFE associated with codon residues. The correlation is negative between FFE associated with the 2nd (middle) codon residues and the alpha helix content of the protein structure. The correlation is especially significant when the FFE ratios are compared to the helix/sheet ratios (Figures 6 and 7). The alpha helix is the most abundant structure element in proteins. It shows negative correlation to the frequency of the second most prominent protein structure, the beta sheet. The propensity of some amino acids and the major physico-chemical characteristics (charge and polarity) shows significant correlation (positive or negative) to this structural feature. We include statistical analyses of alpha helix content and other protein characteristics in this article to show the complexity behind the term "alpha helix" and to show the insecurity in interpreting any correlation to this structural feature (Figures 8 and 9). Detailed analyses of these data are outwith the scope of this article. Higher FFE in subsequences of 1st and 3rd codon residues than in the 2nd indicates the presence of a larger number of complementary bases at the right positions of these subsequences. However, this might be the case only because the first and last codons form simpler subsequences and contain longer repeats of the same nucleotide than the 2nd codons. This would not be surprising for the 3rd (wobble) base but would not be expected for the 1st residue, even though it is known that the central codon letters are the most important to distinguish between amino acids (as shown in the in the Common Periodic Table of Codons and Amino Acids [22]). It is more significant to see that the FFEs in 1st and 3rd residues are additive and together they represent the entire FFE of the intact mRNA ( Figure  10). Higher FFE at the 1st and 3rd codon positions than at 2nd indicates that the number of complementary bases (a-t and g-t) is higher in the 1st and 3rd subsequences than in the second. This is possible only if more complementers are in 1-1, 1-3, 3-1, 3-3 position pairs than in 1-2, 2-1, 2-3, 3-2 position pairs. We wanted to know whether the 1-1, 3-3 (complement) or the 1-3, 3-1 (reverse-complement) pairing is more predominant.
The length of phase-separated nucleic acid subsequences (l) is a third of the original coding sequence (L). The number of different residues (a, t, g, and c) varies at different codon positions (1, 2, 3). a1+u1+g1+c1=a2+t2+g2+c2=a3+t3+g3+c3=l=L/3 The highest number of complementary pairs might occur in the 1st subsequence if a1=t1, g1=c1 and a1/t1=g1/c1=1 If, for example, a1>t1, g1=c1 an excess of unpaired a1 occurs and a1/t1>g1/c1=1 and the possible FFE in subsequence 1 will be less. Following the same logic for other pairs in other subsequences we can conclude that any deviation from a/t=g/c=1 is suboptimal regarding the FFE. Counting the different residue ratios and combinations indicates that the optima are obtained if the residues in the first position form WC pairs with residues at the third positions (1-3) and vice versa . This is consistent with the expectation that mRNA will form local loops, in which the direction of more or less double stranded sequences is reversed and (partially) complemented. (Figure 11). The partial (suboptimal) reverse complementarity of codon-related positions in nucleic acids suggested some similarity between protein structures and the possible structures of the coding sequences. This possibility was examined by visual comparison of 16 randomly selected protein residue contact maps and the energy dot plots of the corresponding RNAs. We could see similarities between the two different kinds of maps ( Figure 12). However, this type of comparison is not quantitative and statistical evaluation is not directly possible. Another similar, but still not quantitative, comparison of protein and coding structures was performed on four proteins that are known to have very similar 3D structures but their primary structure (the sequence) is less than 30% similar, as well as the sequence of their mRNA. These four proteins are examples of the fact that the tertiary structure of proteins is much more conserved than the amino acid sequence. We asked the question whether this is true for the RNA structures and sequence? We found that there are signs of conservation even of the RNA secondary structure (as indicated by the energy dot plots) and there are similarities between the protein and nucleic acid structures (Figure 13). Comparisons of the protein residue contact map with the nucleic acid folding maps suggest similarities between the 3D structures of these different kinds of molecules. However, this is a semi-quantitative method.
A more direct statistical support might be obtained by analyzing and comparing residue colocations in these structures. Assume that the structural unit of mRNA is a tri-nucleotide (codon) and the structural unit of the protein is the amino acid. The codon may form a secondary structure by interacting with other codons accordingly to the WC base complementary rules, and contribute to the formation of a local double helix. The 5′-A1U2G3-3′ sequence (Met, M codon) forms a perfect double string with the 3′-U3A2C1-5′ sequence (His, H codon, reverse and complementary reading). Suboptimal complexes are 5′-A1X2G3-3′ partially complemented by 3′-U3X2C1-5′ (AAG, Lys; AUG, Met; AGG, Arg; ACG, Pro; and CAU, His; CUU, Leu; CGU, Arg; CCU, Pro, respectively).
Our experiments with FFE indicate that local nucleic acid structures are formed under this suboptimal condition, i.e., when the 1st and 3rd codon residues are complementary but the 2nd is not. If this is the case, and there is a connection between nucleic acid and protein 3D structure, one might expect that the 4 amino acids coded by 5′-A1X2G3-3′ codons will preferentially co-locate with other 4 amino acids coded by 3′-U3X2C1-5′ codons. We have constructed 8 different complementary codon combinations and found that the codons of co-locating amino acids are often complementary at the 1st and 3rd positions and follow the D-1X3/RC-3X1 formula but not the 7 other formulas (Figures 14 and 15). The tool detected co-locations when two amino acids were within 6 A of each other (neighbors on the same strand were excluded). The total number of co-locations was 34,630. Eight different complementary codes were constructed for the codons (2 optimal and 6 suboptimal). In the two optimal codes, all three codon residues (123) were complementary (C) or reverse complementary (RC) to each other. In the suboptimal codes, only two of three codon residues were C or RC to each other (12,13,23), while the third was not necessarily complementary (X).
(For example, Complementary Code RC_1X3 means that the first and third codon letters are always complementary, but the not the second and the possible codons are read in reverse orientation. The 400 co-locations were divided into 20 subgroups corresponding to 20 amino acids (one of the co-locating pairs), each group containing the 20 amino acids (corresponding to the other amino acid in the co-locating pair). If the codons of the amino acid pairs followed the predefined complementary code the co-location was regarded as positive (P); if not, the co-location was regarded as negative (N). Each symbol represents the mean frequency of P or N colocations corresponding to the indicated amino acid. Paired Student's t-test, n=20. It is well known that coding and non-coding DNA sequences (exon/intron) are different and this difference is somehow related to the asymmetry of the codons, i.e.. that the third codon letter (wobble) is poorly defined. Many Markov models have been formulated to find this asymmetry and de novo predict coding sequences (genes). These in silico methods work rather well but not perfectly and some scientists remain unconvinced that the codon asymmetry explains the exon-intron differences satisfactorily.
Another codon-related problem is that the well known, non-overlapping, triplet codon translation is extremely phase-dependent and there is theoretically no tolerance for any phase shift. There are famous examples of how single nucleotide deletion might destroy the meaningful translation of a sequence and which are incompatible with life. However, considering the magnitude and complexity of the eukaryotic proteome, the precision of translation is astonishingly good. Such physical precision is not possible without massive and consistent physico-chemical fundaments. Therefore, discovery of the existence of secondary structure bias (folding energy differences) in coding regions of many organisms [13] was a very welcome observation because it differentiates exons from introns on a physico-chemical basis.
Our experiments with free folding energy (FFE) confirmed that this bias exists. In addition, there is a very consistent and very significant pattern of FFE distribution along the nucleotide sequence. Comparing the FFE of phase-selected subsequences, subsequences comprised of only the 1st or only the 3rd codon letters showed significantly higher FFE than those consisting only of the 2nd letters. This FFE difference was not present in intronic sequences preceding and following the exons, but it was present in exons from different species including viruses. This is an interesting observation because this phenomena might not only distinguish between exons and introns on a physicochemical basis, but it might even clearly define the trinucleotide codons and thus the phase of the translation. This codon-related phase-specific variation in FFE may explain why mRNAs have greater negative free folding energies than shuffled or codon choice randomized sequences [23].
Free folding energy in nucleic acids is always associated with WC base pair formation. Higher FFE indicates more WC pairs (presence of complementarity) and lower FFE indicates fewer WC pairs (less complementarity). The FFE in the 1st and 3rd codon positions was additive, while the 2nd letter did not contribute to the total FFE; the total FFE of the entire (intact) nucleic acid was the same as subsequences containing only the 1st and 3rd codon letters (2nd deleted). This is an indication for that the local RNA secondary structure bias is caused by complementarity of the 1st and 3rd codon residues in local sequences. This partial, local complementarity is more optimal in reverse orientation of the local sequences as expected with loop formations.
It is known that single stranded RNA molecules can form local secondary structures through the interactions of complementary segments. The novel observation here is that these interactions preferentially involve the 1st and 3rd codon residues. This connection between the RNA secondary structure and codons immediately directed attention toward the question of protein folding and its long suspected connection to RNA folding [24,25].
Only about one-third (20/64) of the genetic code is used for protein coding, i.e., there is a great excess of information in the mRNA. At the same time, the information carried by amino acids seems to be insufficient (as stated by some scientists) to complete unambiguous protein folding. Therefore, it is believed that the third codon residue (wobble base) carries some additional information to that already present in the genetic code. A specialized wobble base oriented database, the ISSD [21, was established in an effort to connect different features of protein structure to wobble bases [26] with more or less success.
We found a significant negative correlation between FFE of the 2nd codon residue and the helix content of protein structures, which was not expected even though this possibility is mentioned in the literature [9]. Our previous work on a Common Periodic Table of Codons and Nucleic Acids [22] indicated that the second codon residue is intimately coupled with the known physico-chemical properties of the amino acids. Almost all amino acids show significant positive or negative correlation to the helix content of proteins. Therefore, the real biological meaning and significance of any connection between FFE of the 2nd codon residue and the propensity of a protein structural element is highly questionable.
It was possible to make direct visual comparison of mRNA structure (as statistically predicted by mfold energy dot-plot) and protein structures (as 2D residue contact maps). This method suggests similarity between nucleic acid and protein structures. It is known that some complex protein structures are very similar even if there is less than 30% sequence similarity. It was interesting to see that the same principle might apply for nucleic acids, and structural similarity might exist even when the sequence similarity is low. Furthermore, significant similarity between nucleic acid and protein structures might exist even without translational connection. Structure seems to be more preserved, even in nucleic acids, than sequence.
However, even if the matrix comparisons are suggestive, they remain semi-quantitative methods. Better support was necessary. A working hypotheses grew out of these observations, namely that (a) partial, local reverse-complementarity exists in nucleic acids that form the nucleic acid structure; (b) there is some degree of similarity between the folding of nucleic acids and proteins; (c) protein structure determines the amino acid co-locations; (4) as a consequence, amino acids coded by the interacting (partially reverse complementary) codons might show preferential co-locations in the protein structures.
And it seems to be the case: codons which contain complementary bases at the 1st and 3rd positions and are translated in reverse orientation result in amino acids which are preferentially co-located (interacting) in the 3D protein structure. Other complementary residue combinations or translation in the same (not reverse) direction (as much as seven combinations in total) did not result in any preferentially co-locating subset of amino acid pairs. Construction of residue contact maps for protein structures and statistical evaluation of residue colocations is a frequently used method for visualization and analyses of spatial connections between amino acids [27][28][29]. The amino acid co-locations in real protein structures is clearly not random [30,31] and therefore residue co-location matrices are often used to assist in the prediction of novel protein structures [32,33]. We have carefully examined the physicochemical properties of specifically interacting amino acids in and between protein structures, and we concluded that these interactions follows the well known physico-chemical rules of size, charge and hydrophobe compatibility (unpublished data) well in line with Anfinsen's prediction. The recent study supports the fact that there is a previously unknown connection between the codons of specifically interacting amino acids; those codons are complementary at the 1st and 3rd (but not the 2nd) codon positions.
The idea that sequence complementarity might explain the nature of specific protein-protein interactions is not new and was suggested already in 1981 [34]. I was never able to experimentally confirm my own original theory, which suggested a perfect complementarity between codons of interacting amino acids [34,35], in contrast to others [36]. The explanation is that this codon complementarity is suboptimal and does not involve the 2nd codon residue. Experimental in vitro confirmation is required to validate this recent theoretical and in silico prediction.