Document Type: Research Paper
Authors
^{1} Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, I.R. IRAN
^{2} School of Mathematics, Statistics, and Computer Science, College of Science, University of Tehran, Tehran, Iran
^{3} School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, I.R. IRAN
Abstract
Keywords
1. Background
RNA plays key role in many aspects of biological processes. The function of RNA is related to its tertiary structure. Since dealing with tertiary structure of RNA is very complicated, RNA secondary structure has been studied in the literature (1-3). A secondary structure of an RNA sequence is basically a set of pairing connections among bases in the sequence, where each base can be paired with at most one another base. RNA secondary structure establishes various significant portions of RNA tertiary structure. Since the biological function of RNA is concluded indirectly from its primary structure, therefore it would be important to analyze the relationship between the RNA sequences and their structures. In this regard, the neutral network would be of great interest (4). Neutral network is a collection of RNA sequences, all coding the same secondary structure and each RNA sequence is distinct from other sequences by no more than a single base mutation (4). Neutral networks consequent to common structures saturate the space of RNA sequences (5, 6) and thus simplify the examination of a huge amount of alternative structures. This is achievable since diverse neutral networks are greatly meshed, i.e. all familiar structures can be achieved within a few (mutational) walks starting from any arbitrary sequence (6).
Several structural properties of the RNA neutral networks have been studied previously (4-10), and the remarkably complex structures underlying it have been discovered. An upper bound S l=1.4848 ´ l-3/2 ´ (1.8488)l for the amount of distinct secondary structures for sequences of length l was obtained in (4). This indicates that the expected size of a neutral network develops as 4l / Sl = 0.673 ´ l 3/2 ´ (2.1636) l is an enormous quantity even for modest values of l. This expected amount is not illustrative for the real spreading of neutral network sizes, which is a very broad function (4, 11). The space of RNA sequences of length l, which is surrounded in a usual l dimension lattice, is directed by a reasonably small number of common structures which are tremendously plentiful and occur to be located as structural motifs in natural and functional RNA molecules (8, 12).
Among the considerable parts of the works that have been studied, Aguirre et al., (13) presented results of specific significance. They concentrated on the topology of RNA neutral networks and studied local and global parameters describing their structures. They have plotted neutral networks over all RNA sequences of length 12, using the RNAfold as a folding method (14). Then they have obtained the topological properties of these neutral networks. Unfortunately, the information obtained in (13) cannot be generalized to the larger space (natural space) of RNA sequences and structures.
An additional useful representation of an RNA secondary structure is the RNA shape. RNA shape notion plots structure in a compact form, holding vicinity and nesting of structural components and reducing their lengths to one unit. It is motivated by the dot-bracket representation identified from the Vienna RNA package (14). Respect to the behind sequence and secondary structure from folding area in dot-bracket demonstration, the shape attitude proposes five deduction levels prepared in their level of abstraction. Also they shorten the loop and stack sizes, where unpaired areas are symbolized by an underline and stacking areas by a couple of formed brackets (15). Figure 1 represents the relationship between a small RNA sequence, its structure, and its shape. RNA shape can be well participated with dynamic programming algorithms, and consequently it can be employed through structure prediction rather than afterwards. This prevents exponential explosion and can still provide a non-heuristic and complete report of properties of the given RNA folding space (15).
The rest of this paper is organized as follows. In section 2, the short-term objectives to do this research are provided. In Section 3, the basic definitions as well as the Variation Network are presented. In section 4, the details of datasets construction and different measures are discussed. The obtained results and conclusions are presented in sections 5 and 6, respectively.
2. Objectives
In this paper, a new concept entitled Variation Network, which is based on the RNA shapes is introduced. Although the RNA shapes are obsolete in previous studies, here we have a special attitude. Based on the proposed Variation Network, different measures from frequency point of view, including frequency and variation rate, and thermodynamic energy point of view, including shape energy, neighborhood energy, as well as the stability are obtained in this paper. Also, the correlations among these measures are calculated for random and natural sequences and their corresponding shapes. Based on our analysis, we conclude that from the thermodynamic energy point of view, the natural RNA sequences are different from those generated randomly.
3. Basic Definitions
An RNA molecule is composed of a chain of nucleotides, namely Adenine (A), Cytosine (C), Guanine (G), and Uracil (U). An RNA sequence d of length l can be considered as a string over Sl , where S is the set of alphabet (S = {A,C,G,U}). Let D denotes the set of all RNA sequences. An RNA sequence tends to fold to itself and forms pairs of bases by the creation of hydrogen bonds between Watson-Crick bases (A-U or C-G) and Wobble base (G-U). This set of base pairs is called the RNA secondary structure and it is defined as follows.
Definition 1. An RNA secondary structure corresponding to an RNA sequence d of length l is a set of pairs (i, j), where i, j Î{1,...,l}) and i < j , and for any two base pairs i1 ,j1 and i2 ,j2 form l,, i1+i2 Û j1=j2, and either i1< j1 <i2< j2 (disjoined) or i1< j1 <i2< j2 (nested) holds.
Let L denotes the set of all RNA secondary structure. Suppose that j :D ®L maps any RNA sequence d into its corresponding minimum free energy secondary structure l = j (d) Considering f as a relation, two sequences d1 and d2 are equivalence under f if and only if j (d1) = j (d2). Based on this equivalence relation, the induced equivalence class of any structure could be defined as follows.
Definition 2. The equivalence class of a structure λ under the mapping φ, denoted by [λ]φ, is the set of RNA sequences having the same structure as λ, i.e. [l]φ = {d |d ÎD and φ(d) = l}.
Suppose that Γ represents the set of all shapes and y :D®Γ maps any RNA secondary structure, say λ, to its corresponding shape γ, where γ =y (l) . As a result, maps any RNA sequence to its corresponding shape. Similar to the equivalence class of structures, we can define the equivalence class of shapes under the mapping χ as follows.
Definition 3. The equivalence class of a shape γ under the mapping χ, denoted by [γ]χ, is the set of RNA sequences having the same shape as γ, i.e. [γ]χ ={d |d ÎD and χ(d) = l}.
Each equivalence class may contain different amount of RNA sequences. In order to measure this cardinality, the following two definitions are presented.
Definition 4. For any structure l Î L, fφ (l) is the cardinality of the equivalence class [l]φ.
Definition 5. For any shape g Î G, fx (g) is the cardinality of the equivalence class [g]x.
Now, based on our terminology, the classical neutral network is defined as follows.
Definition 6. For any structure λ, a graph NNl = (V,E) is called a neutral network under the mapping f where,
· V = {d |d ÎSl,j (d)=l},
· E = {(d1,d2) | d1,d2 ÎV,dist (d1,d2) =1}
Figure 2 shows an example of the neutral network. The neutral network does not take into account the number of sequences that are transformed from [l1]j to [l2]j by performing a single mutation. Considering the equivalence classes of RNA shapes, we could measure how many sequences from [g1]χ are transformed to [g2]χ by performing a single mutation. To do this, the variation rate is defined as follows.
Definition 7. The variation rate between two shapes g1 and g2 , denoted by w(g1 ,g2) is defined as follows (1):
w(g1 ,g2) = S | N(d) Ç [g2]χ | = S | N(d) Ç [g1]χ | dÎ[g1]χ d Î[g2]χ
where N(g) indicates all the sequences that are obtained from g by performing a single mutation in different positions.
Figure 3 is a schematic representation of the previous definitions. Here, the solid lines between RNA sequences indicate the single base mutational neighborhoods, the dashed arcs represent the mappings and the dashed eclipses show the equivalent classes for both structures and shapes. The variation rate between the shapes g1 and g2 is calculated as 3.
Based on the above definitions, the Variation Network can be defined over the set of all shapes as follows.
Definition 8. The Variation Network for the set of all shapes Γ is a weighted graph VN = (V,E,W), where
· V = {g | g ÎG},
· E = {(g1 , g2) | w(g1 , g2) > 0},
· "(g1 , g2) Î E,W(g1 , g2) = w(g1 , g2)
The Variation Network represents the imposed relations among the set of all shapes under the mapping χ as it is presented in Figure 4. With respect to the above definitions, we perform many experiments to explore the relations between RNA sequences, their structures and their shapes, both for frequency and thermodynamic energy points of view. The Variation Networks are created for random and natural RNA sequences and different measures, including frequency, shape energy, variation rate, neighborhood energy, and stability as well as the correlations coefficient between these measures are obtained. The details of our experiments as well as the results are presented in the following sections.
4. Materials and Methods
Since the number of RNA sequences grow exponentially with respect to their lengths, therefore analyzing the huge number of these sequences becomes difficult, especially for long sequences. The idea to tackle this difficulty is to use a small fraction of such sequences. On the other hand, using a small fraction of sequences may corrupt the accuracy of our analysis. But, our experiments show that the accuracy is not affected if sufficient number of sequences is employed. To justify this idea, two datasets of RNA sequences were constructed. The first one (RDS12A) contains all 412 sequences of length 12 and the second one (RDS12R) contains only 111000 of such sequences (generated randomly with uniform distribution over the nucleotides). Our analysis over these two datasets indicates the high correlation rates among different properties of the corresponding Variation Networks. Therefore, employing small fraction of RNA sequences gives us the reasonable accuracy in our analysis.
After justifying the idea for sequences of length 12, we then employed sequences of length 50. To do this, seven different datasets of RNA sequences, each of length 50, were constructed. The first one (RDS50R) contains 20,000,101 sequences, which were generated randomly with uniform distribution over the nucleotides. Since the length of each RNA sequence is 50 and 150 sequences were generated by performing a single mutation in different positions, the smallest number greater than 20,000,000 which is a multiple of 151 is 20,000,101. The other six datasets were constructed by selecting all subsequences of length 50 from the natural RNA sequences (namely, Synthetic RNA, 5S Ribosomal RNA, Hammerhead Ribozyme, other Ribosomal RNA, other Ribozyme, Cis-regulatory element) taken from RNA STRAND server (16). After construction of the datasets, the RNAShape software (15) was employed to fold each RNA sequence in order to obtain its structure, its minimum free energy, and also its shape. Table 1 summarizes the construction details of the above mentioned datasets, as well as the number of distinct sequences, structures, and shapes in each one.
Since the neighborhood sequences of each RNA sequence play an important role in our analysis, therefore in construction process of random datasets (RDS12R and RDS50R), the small number of RNA sequences of desired length (l) were first generated and then all nucleotides appeared in each sequence weremutated (to three other nucleotides) to generate 3l more sequences (each of length l).
In order to perform the analysis and explore the potential relations between random and natural RNA sequences, as well as their corresponding structures, different measures have been employed in our analysis for each dataset. These measures are related to both frequency and thermodynamic energy as follow:
· frequency: For each shape g, the cardinality of the equivalence shape class [g]x is referred to as frequency (fx (g)).
normalized frequency (nf): This is the normalized value of frequency for each shape g and it can be calculated as follows:
(2)
where fxmin and fxmax denote the minimum and maximum frequency among the equivalence shape classes, respectively.
· shape energy average (sea): For each shape g, consider the sequences in [g]x. For all these sequences, the minimum free energies over the corresponding structures were calculated and averaged, i.e.
(3)
· variation rate (vr): Although the variation rate is defined for any two shapes, here the variation rate for a shape g is taken over all other shapes as follows:
(4)
The variation rate vr(g) indicates that by performing a single mutation in different positions, how many sequences from [g]x were transformed to the other equivalence shape classes.
normalized variation rate (nvr): This is the normalized value of variation rate and it can be calculated as follows:
(5)
where vrmin and vrmax denote the minimum and maximum variation rates, respectively.
· neighborhood energy average (nea): For any shape g, consider its neighborhood shapes in the Variation Network. The neighborhood energy average indicates the average minimum free energies over the sequences appeared in neighborhoods’ equivalence shape classes as follows:
(6)
· stability: This measure indicates the stability of a shape g and it is calculated as follows:
(7)
where e is considered as 0.0001 to avoid the divide by zero exception.
All the above measures could be applied on a single shape. In order to compare the above mentioned measures obtained from different datasets, we employ a dimensionless metric as follows:
· Population Pearson correlation coefficient metric
This metric is used to explore the linear dependency among two random variables. It is achieved by dividing the covariance of two random variables by the product of their standard deviations. Regarding the expected values (mx and my) and the standard deviations (sx and sy) of two random variables (X and Y), the population Pearson correlation px,y is calculated as:
(8)
where E and cov denote the expected value and covariance, respectively. Also the corresponding p-value, which is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, is calculated.
All the above mentioned measures were evaluated on different datasets and the obtained results are presented in the next section.
5. Results
As it is mentioned, the RNAshape software (15) is employed to fold each RNA sequence in order to obtain its structure, its minimum free energy, and also its shape.
For datasets RDS12A and RDS12R, six different shapes were obtained. Among them, two shapes have very low frequency and therefore they were not considered in our further analysis (they do not have much information). For each shape in these datasets, the corresponding measures were evaluated and presented in Table 2. Also, the correlation and p-value among different measures for these two datasets are presented in the last rows of Table 2. As it is understood from this table, all measures are correlated and therefore the Variation Network corresponding to the small fraction of sequences gives us a reasonable result about the properties of the Variation Network corresponding to the whole sequences.
The same evaluations have been done for other seven datasets of length 50 constructed from random and natural RNA sequences. The five most frequent shapes of each dataset, as well as their corresponding measures are presented in Table 3. To analyze the relationship between the Variation Networks of random and natural RNA sequences, the correlation and p-value of each measure is calculated separately. To do this, five different percentages of the most frequent shapes of each dataset, namely 100%, 20%, 10%, 5%, and 1%, were employed.
Then, for each measure, the correlation and p-value between any pairs of datasets with respect to the selected percentages of the most frequent shapes were calculated. For the normalized frequency measure, the correlations between different datasets, as well as the corresponding p-value, are presented in Table 4. The results for the other measures, namely shape energy average, normalized variation rate, neighborhood energy average, and stability are presented in Tables 5, 6, 7, and 8, respectively.
As it is understood from these tables, the frequency and variation rate measures in these datasets are highly correlated to each other. This indicates that the most frequent shapes and their variation rates are almost identical in random and natural datasets. On the other hand, the shape energy average, neighborhood energy average, and stability measures are not well correlated. This indicates that, the natural RNA sequences are specialized to do a specific function inside the cell, where the random RNA sequences do not have such a special function (17).
Also, the correlation between the random dataset (RDS50R) and six other natural datasets are higher than the correlation between natural datasets. This indicates that, the natural RNA sequences do not follow the uniform distribution because of their special functionality inside the cell. On the other hand, the big amount of employed random RNA sequences gives us a large number of shapes which covers almost many natural produced shapes (from natural sequences).
The correlations among natural datasets become higher if the small percentages of most frequent shapes were employed in our analysis. This indicates that the most frequent shapes appeared in different families of natural RNA sequences.
6. Conclusions
In this paper, the Variation Network concept has been introduced to analyze the relationship between RNA sequences, structures, and shapes. Although the function of an RNA sequence is related to its structure, the RNA shape indicates the higher level of representation of its functionality inside the cell. Bases on the Variation Networks corresponding to the random and natural RNA sequences, different measures were calculated and the correlation among them are presented in this study. The obtained results indicate that from the frequency point of view, all the employed datasets are highly correlated to each other, but from the thermodynamic energy point of view they are not well correlated. These conclude that the natural RNA sequences are not generated randomly.