Document Type: Research Paper
Authors
^{1} Department of Mathematics, University of Wisconsin, Madison, USA
^{2} Department of Computer Science, Faculty of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, IR Iran
^{3} Department of Computer Science, University of British Columbia, Vancouver, Canada
^{4} Department of Computer Science , School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, IR Iran
Abstract
Keywords
1. Background
RNA is an active component in various cellular process
es. They act as carriers of genetic information, catalysts
in cellular processes and mediators in determining the
expression level of genes. Whereas the folding of RNA
or its secondary structure is very important for func
tionality of RNA, thus it is of great interest to propose
computational techniques for predicting RNA second
ary structural features. Several algorithms are currently
designed for the prediction of RNA secondary structure
(1-8). The most successful approach for RNA structure
prediction is comparative sequence analysis in which
covering residues are identical to a multiple sequence
alignment of diffrent RNAs sequences with similar
structures (9, 10). The accepted secondary structures
of most structural and catalytic RNAs were generated
by comparative sequence analysis. Comparative meth
ods (7, 11-13) use a given set of RNA sequences and their
corresponding structures to predict a structure for a
new RNA sequence. In the comparative methods, the
sequences in the given set must be in the same family
with the new RNA, structurally or as a nucleotides ar
rangement. Hence, these methods fail to predict a struc
ture for a single RNA sequence or a small family of RNAs
with little sequence diversity. For these cases, prediction
of RNA structure based on free energy minimization is
the most widely used technique (5, 6, 14-17). One of the
well-known applied approaches in these methods is the
dynamic programming. Zuker introduced the dynamic
programming approach for predicting RNA structures
(17). Furthermore, Walter et al. (18) improved the algo
rithm by using thermodynamics rules and Zuker (8) pre
sented a web-based predictor application. Nevertheless,
due to undiscovered biochemical reasons, results of the
comparative methods are more similar to RNA natural
structure than results of latter methods. However, one
of the diffilties on the problem of RNA structure pre
diction is the existence of a certain type of RNA second
ary structures called pseudoknots. This type of struc
ture is composed of both nested and crossed base pairs.
Figure 1 represents the structure of Hepatitis Delta Virus
(HDV) in two graphical shapes. Rivas and Eddy (6) intro
duced an approach to predict pseudoknot structures
by using a subset of context sensitive grammar. This ap
proach is also considered by Uemura et al. (16), Lyngsø
and Pedersen (3), and Cai et al. (19). In addition, heuristic
approaches were also applied to capture nested struc
tures. Shapiro and Navetta (20) introduced a parallel
genetic algorithm to predict nested structures that was
one of the famous predictors. Thereafter van Batenburg
et al. (1) demonstrated that GA is a promising method
in RNA structure prediction. Batenburg's GA method is
part of Structure Analysis of RNA (STAR) package. Ren et
al. (21) presented a heuristic algorithm called Hotknots
for prediction of RNA structure including pseudoknots.
They have improved the quality of the former predic
tion algorithms has become and their algorithm often
performs better than the genetic algorithm from the
STAR package and PKnotsRG-mfe algorithm by Reeder
and Giegerich (15). Uemura et al. (16), Akutsu (14), and
Kato et al. (22) demonstrated, Tree Adjoining Grammars
(TAG) are suitable data structure for a capturing an RNA
secondary structure.
A TAG is a kind of formal rewriting system to generate
a set of trees associated to strings (23). More specially,
the structured information of the derived tree in TAGs
is of great signifiance in both syntactical and semanti
cal concepts. Uemura et al. (16) and Kato et al. (22) illus
trated wide ranges of crossed structures which could be
captured by a subclass of TAG, called Extended Simple
Linear Tree Adjoining Grammars (ESL-TAG), where their
corresponding parsing algorithm uses O(n5) time and
O(n3) space. There may be numerous adjunct trees in
the grammar which represent secondary structures of
an RNA. This means that for an RNA sequence, many
structures might be possibly found. Thus, the problem
is like as a search problem for an optimal structure with
respect to thermodynamics energy parameters.
2. Objectives
In this manuscript, we present a new genetic algorithm,
PreRkTAG, to predict RNA secondary structures includ
ing pseudoknots that is developed based on a sub-class
of the adjoining grammars. PreRkTAG uses a practical
data structure for capturing RNA crossed structures.
Our designed genetic algorithm can search a broad
area of feasible solutions of the given RNA structure.
This genetic algorithm is composed of two evolution
ary functions which allow free pairing of nucleotides in
structures. This increases the probability of discovering
new structures that cannot be represented by the regu
lar ESL-TAG. This ability yields to more coverage area of
the feasible solutions rather than ESL-TAG covers. On the
other hand, a common problem of heuristic algorithms
is converging to local minimums, which is highly pos
sible especially with a wide range of feasible solutions.
In PreRkTAG we attempt to tackle this problem by using
novel evolutionary functions. The evaluation of PreRk
TAG against celebrated energy-based prediction algo
rithms, namely, the STAR, Hotknots, and Pknots, shows
high sensitivity and high specifiity of PreRkTAG in the
most cases. The evaluation was performed on 24 RNA
sequences chosen from the RNAseP database. Our algo
rithm correctly identifis more than 80% of base pairs.
3. Materials and Methods
The employed data structure for capturing RNA crossed
structures was introduced in preceding research (14)
and in order to synchronize the notifiations the fist
part of this section is focused on ESL-TAG. Consequently,
the details of PreRkTAG algorithm are described includ
ing the employed inner functions.
3.1. RNA Secondary Structure Modeling
Generally, an ESL-TAG is a 5-tuple grammar G = (N, T, S, I,
A) where N and T are fiite sets of non-terminal and ter
minal symbols, respectively. S is the start symbol and I is
fiite set of initial trees (center trees). Finally, the fiite
set A is a set of adjunct trees (auxiliary trees). The mem
bers of the set N are divided into two subsets; Active and
Inactive. During transition process of the grammar G,
the active nodes can be replaced by a member of the set
A, where inactive nodes do not. Based on the division,
when a tree does not have any active node, it would be
a matured tree. By adjoining two trees, s and t, a tree t'
would be obtained, and t →s t' represents the adjuncted
transition. The transitive closure of → is →* and a tree t'
is derived from t, where t →* t' for some t ϵ (A ᴜ I). A set
of trees of an ESL-TAG G is defied as T (G) = {t | s →*t, s ϵ
I where t is matured tree}. For modeling the secondary
structure of an RNA with pseudoknot, the tree adjoining
grammar G = (N, T, S, I, A) is defied as follows. The set N
is a fiite set of non-terminals {X, Y, Z}, where active non
terminals are marked with asterisk {X*, Y*, Z*}. The set T
is fiite set of terminals {a, u, c, g} as abbreviation of nu
cleotide acids. Moreover, S is a start symbol and the set
of initial trees I is composed of just one tree as depicted
in Figure 2 (this tree is tree of type 1 and denoted by cen
ter tree). The last parameter A, is a fiite set of adjunct
trees, such as given in Figure 2. This set is composed of
four types. As depicted in the Figure 2 , type 2 and type 3
represent base pairs. Moreover via preorder traversal of
trees of these types, the type 2 represents crossed base
pairs denoted by T2u and T2d, and the type 3 represents
nested base pairs denoted by T3L and T3R. In addition,
type 4 is introduced to generate a single base without
any pairing. Trees of type 5 are used to combine sub
structures by using trees of type T5Ld and T5Rd, and
inserting a tree into the current one by using trees of
type T5Lu and T5Ru. As mentioned, the key idea of this
modeling is from Uemura et al. (16). Figure 3 depicts the
process of constructing the crossed base pairs "agacuu"
which is not generated by any context free grammar.
Uemura et al. (16) described a computational obstacle of
ESL-TAG trees via an example that is shown in Figure 4 .
In this example, it is shown that two dissimilar sequenc
es can be captured by one ESL-TAG structure.
3.2. Genetic Algorithm
Genetic Algorithm (GA) is kind of Evolutionary Algo
rithms (EA) which via broad wide searching inside of
probable solution space, analyzes a population of solu
tions instead of a single candidate. Goldberg (24) proved
the GA convergence theory. Thereafter applicability of
the solutions is a critical step in designing the evolu
tionary functions. Novel evolutionary functions are de
signed according to tree adjoining grammars and equip
the PreRkTAG algorithm towards global convergence.
In advance to illustrating the evolutionary functions, a
short description of feasible population and data struc
ture is presented as follows.
3.3. Generating Population
Population of a GA is a set of feasible solutions and a so
lution is feasible if the length of its corresponding struc
ture is the same as the length of the given sequence. In
dividuals of the population in PreRkTAG are generated
the algorithm was extended in wide range of optimiza
tion problems.
The principle of GA is competition of the solutions
where a scoring function (finess function) measures
their optimality. GA methodology is composed of two
evolutionary processes; the fist one is applied on to
a solution and by randomly modifying the solution’s
properties creates a new solution (Mutation). Another
one works on two diffrent randomly selected solutions
and generates two new solutions by combining prede
termined features of the randomly selected solutions
(Crossover). These two evolutionary functions are ap
plied during the progress of the algorithm, where both
these functions are derived from biological concepts
(24). In this optimization method, the evolutionary
functions are supposed to inhibit solutions to converge
to local minimums. Therefore, designing effient and
practical algorithms for these functions play signifiant
role as they converge to desirable results. Since these
functions apply on the domain of feasible solutions,
consideration of the data structure which represents
randomly by using ESL-TAG. As described in the previous
section, an ESA-TAG starts with an initial state and could
be extended easily to cover as many nodes as required.
Hence, using ESL-TAG we are able to generate a tree with
any given numbers of residues. To construct each ele
ment of population, we randomly select a transitive clo
sure operator and apply it on initial trees. In this way
a matured tree can be constructed in length of a given
RNA sequence. Hence, a tree shows a probable structure
for RNA sequences with the same length. Generating the
population only depends on the length of the given RNA
sequence and the rest of the algorithm adjusts these
structures to nucleotide arrangement of the given RNA
sequence. It is noted that outcome of the algorithm is
an optimal structure for the given RNA sequence, which
is a feasible structure for all RNA sequences with the
same length but is not the optimal structure for them.
Generating the population in this way leads to a wide
range of feasible solutions where the finess function
will narrow them down to the optimal solution of the
given RNA sequence. Figure 5 shows an example of a
constructed structure for a sequence of length 5.
3.4. Evolutionary Functions
We present a detailed description of the evolutionary
functions. These functions perform blindly in terms of
considering the context of the given RNA. The muta
tion function takes a structure from the population and
based on its length calculates the number of terminal
nodes that might be mutated. Subsequently the mutat
ed terminal nodes are selected by uniform distribution.
Terminal nodes inside of adjunct trees have three diffr
ent types, thus this function acts in three diffrent ways.
These treatments are discussed as follows:
1) If the selected node is a single node, then the mutation
function selects another single node uniformly at ran
dom and pairs them up. The novel structure has a new
pair which its ESL-TAG transitive closure is unknown.
Hence, the new structure must be parsed by ESL-TAG
parser to obtain the corresponding transitive closure.
2) If the selected node is a member of nested pair, then
the tree reconstruction is not needed and just their ad
junct tree is replaced with two adjunct trees with single
terminals.
3) If the selected node is a member of crossed pair, then
breaking up this pair is equal to replacing the corre
sponding adjunct tree by two adjunct trees; each of
them has a terminal node in opposite side of the tree.
Algorithm 1 shows the mutation function. This algorithm
includes two sub-functions; the "Random (L)" returns a
random value L' (0 < L' < L), and the "Length (tree)" re
turns the number of terminal nodes of the given tree.
In addition, as illustrated (Algorithm 1), in some cases
new obtained structures must be parsed again to recon
struct the ESL-TAG tree, so in this algorithm the "Recon
struct" function is employed to fid ESL-TAG tree of new
structures. The next evolutionary function is Crossover
that crosses two selected solutions and merges com
plimentary parts. The crossover function selects two
solutions via normal distribution from the population.
Thereafter this function extracts two sub-trees with the
same length (number of terminal nodes) and random
starting points from each of the selected trees and the
crossover function swaps the sub-trees. Removing a sub
tree from a tree might come with breaking pairs when
ever a pair contains a node inside and another node out
side of the sub-tree (this pair is called an exteriorly pair).
These nodes will remain single when the swapping
process is done. Constructing the trees blindly, with re
spect to context of the given RNA sequence, makes the
crossover function much easier and helps to avoid lo
cal minimums. The crossover algorithm is presented in
Algorithm 2. Two introduced inner functions in the Mu
tation function (Algorithm 1) are used in this algorithm
as well. In addition, two other functions are applied;
"Move Down (node, length)" is a function for traversing
the given tree from its node to attain length number
of terminal nodes, and "Swap (Sub Tree0, Sub Tree1)" is
a function for swapping sub-trees which results in two
new trees. Figure 6 demonstrates a sample of the cross
over process.
3.5. Fitness Function
To measure the optimality of a predicted RNA structure,
as it has been recently proposed by most methods, ther
modynamics rules (17, 18, 25) are applied as the finess
function. In this case, the finess function estimates the dis
crepancy of simulated structures by measuring energy of
structures, where better structure has lower energy. Upon
thermodynamics rules, constructing a base pair needs a
specifi amount of energy, where in some cases the base pair
releases energy (negative value) and in other cases it gains
energy (positive energy). In the thermodynamics rules, many
parameters effct on the amount of energy and in a simpli
fid model, this amount depends on nucleotide acids that
compose a pair and also their neighbor base pairs. For exam
ple when ‘AA’ in 5’ - 3’ and ‘UU’ in 3’ - 5’, the orientation(AAUU )
Algorithm 1. Mutation Function
Algorithm 2. Crossover Function
This function includes six inner functions; “Random (L)” returns
a random value L’ (0 < L’ < L), the “Length (Sequence)” function
that counts the number of terminal nodes of Sequence, “Popu
lationGeneration (Pop, Number, S)” that generates Number
trees of length of S using ESL-TAG, “GetRand ()” that returns a
random value between [0:1], our fitness function which is called
“ComputScore (Pop)” function by using thermodynamic roles
computes the Pop’s members energy value, and finally the
“RouletteWheel (N, Pi, Pj)” via the roulette wheel method copies
((N*PopulationNumber) / 100) members ofPi into Pj.
makes two consecutive pairs, and the energy of correspond
ing structure will be reduced by -0.9. One example of assign
ing energy to two consecutive pairs is borrowed from (5) that
is shown in Table 1. In this table, every row and every column
shows a pair and the entry (i, j) shows amount of energy of
appearing j’th pair after i’th pair. Since these are thermody
namics energies and also were assessed experimentally, there
is no need to have a symmetric table. We also used the same
energy values in our finess function. For a given sequence S,
with no base pair, the energy value is equal to zero. Assume
e (i, j) denotes the energy value of base pair between i th and
j th nucleotides, thus ∑i,jϵSe(i,j) shows the energy value of se
quence S. The energy of RNA sequences obtained by the ther
modynamic rules implies that number of single bases (with
out pair), coaxial, and stacked base pairs have impressive
effcts on the energy value of the corresponding sequence
(17, 18). Our finess function uses the current version of Turner
group thermodynamic rules (5).
Within the ith iteration of the genetic algorithm, ten percent
of highest score of the current population (that is called Pi) are
directly copied into the next population (Pi+1). Then remaining
90 percent of Pi+1 are produced by applying the roulette wheel
method on the union of P
i and evaluated individuals (Emem
bers) of Pi. The evaluated individuals (Emembers) are new mem
bers generated by performing the evolution functions on the
members of P
i.
Function Mutation ( t: Tree)
Begin
MutationNumber = Random (Length(t))
For i=1 to MutationNumber Do Begin
node0 = A random node from t;
If node0 is a bases of a pair Then Begin
node1 = pair of node0;
If pair (node0, node1) is a crossed pair Then Begin
Break up pair (node0, node1);
Reconstruct t;
End
Else
Mark bases node0 and node1 as single bases;
End
Else Begin
node1 = A random single node from t;
Mark node0 and node1 as a base pair;
Reconstruct t;
End;
End;
Return (t);
End.
Function Crossover ( t0, t1: Tree)
Begin
CrossLength = Random (Length (t0));
node0 = A random node fromt0 ;
node1 = A random node from t1 ;
SubTree0 = MoveDown (node0, CrossLength);
SubTree1 = MoveDown (node1, CrossLength);
If sub-trees have exteriorly pair Then
Break them up;
Swap (SubTree0, SubTree1);
Return (t0, t1);
End.
4. Results
To assess our algorithm its accuracy has been evaluated
and compared with three common thermodynamic
based algorithms: Hotknot (21), Pknots (9), and the
STAR (1). To compare PreRkTAG with these algorithms,
we measured the accuracy of these algorithms on a set
of RNAs from the RNAseP data base. Since genetic algo
rithms have random behavior, every execution of our
algorithm on a given sequence may result in diffrent
structure. Hence, the PreRkTAG algorithm was applied
three times on every sequence and the best structure of
each run was stored. Then one of these three with the
minimum energy was reported as the predicted struc
ture. Because of the proposed evolution algorithms, in
many cases, the mean of the three energy values were
close to the picked minimum energy. The RNAseP data
base includes natural structures of this set, therefore
in this study the accuracy is measured by considering
two diffrent parameters; sensitivity and specifiity of
folded sequences. Sensitivity is defied by the number
of base pairs that are predicted correctly per number of
base pairs in natural structure. The specifiity is equal
to the number of true predicted base pairs per number
of predicted base pairs. Predictors were employed on 24
sequences from RNAseP data base, which are tabulated
in Table 2 . The comparison of sensitivity between our al
gorithm and the other algorithms are given in Table 3 .
Also Table 4 depicts results of comparing specifiity of
these algorithms. Unfortunately because of the length
of the test sequences, which are fairly long, the Pknots
algorithm could not obtain results within a reasonable
timeframe on our hardware. Therefore, in Table 3 and
Table 4 the Pknots results are not shown. We also provide
running time complexity of the examined algorithms in
the Table 5. As mentioned, for every RNA sequence, the
PreRkTAG has been applied three times and we indicat
ed outcome of the one with minimum energy. For calcu
lating the running time of the algorithm, we followed
the same idea and considered the running time of the
one with minimum energy. It is well known that choos
ing suitable parameter values for genetic algorithms is
very diffilt (24). There is an inverse relationship be
tween the number of generations and the size of popu
lation, so that large population size converges fast and
Algorithm 3. PreRkTAG Algorithm
Function PreRkTAG ( S: RNA Sequences)
Begin
PNumber = Random (Length( S));
PopulationGeneration ( P0, PNumber, S);
i= 0; stop-condition = False ;
While (stop-condition = False) Do Begin
Emembers = { } ;
ComputScore(Pi) ;
Copy ten percent of high scores Pi into Pi1+;
For each t in Pi Do Begin
If GetRand() > Mutation Probability Then Begin
t' = Mutation (t) ;
Emembers = Emembers t';
End;
End;
For each t0 in Pi Do Begin
If GetRand() > Crossover Probability Then Begin
t1 = A random member of Pi ;
(t'0, t'1) = Crossover(t0, t1);
Emembers = Emembers (t'0, t'1);
End;
End ;
ComputScore(Emembers) ;
RouletteWheel (90, Pi Emembers, Pi1+);
If Three previous iterations had equal best score Then
stop-condition = True ;
If i exceeds flip flops limitation Then
stop-condition=True ;
i++;
End;
End.
requires the consideration of a few number of genera
tions. On the other hand, large population size requires
more evaluations per generation so as to avoid the de
cline of the rate of convergence. Generally, the optimal
size of population depends on the domain complexity
of the feasible solutions. In addition, the probabilities of
performing evolutionary functions play a critical role to
avoid trapping in local minimums. Small values for the
crossover and mutation rates yield to copying most of
the individuals to the next population directly which is
benefiial in terms of decreasing the running time com
plexity as it is unnecessary to compute finess values for
most of the individuals. However, the small rates may
cause to converge to local minimums. Considering the
behavior of genetic algorithms according to these rates
shows that high probabilities for performing the muta
tion function allow for a greater diversity of solutions
at very little expense. However, these high probabilities
change the concept of genetic algorithms and convert
them into random search procedures. The probability
of performing the crossover function also has its own
diffilties. The high value of the crossover rate helps
the evolution progress and causes to it to jump out of
local minimums while increasing the chance of con
verging to the global minimum. On the other hand, it
increases the running time complexity dramatically. A
theoretical investigation of the optimal size of popula
tion, crossover and mutation rates has been carried out
by Goldberge (24) who has found that the best values for
the population size is a value base on the length of the
solutions. Goldberge showed that a practical crossover
rate is in the range of [0.6, 0.95] and for the mutation
rate should be in [0.001, 0.2]. In order to evaluate PreRk
TAG algorithm, we used Goldberg’s investigation for de
termining our GA’s control parameter values. We tested
diffrent values for the size of population, crossover and
mutation rates on a training set of RNAs. We empirically
determined the best value for “population size” as a func
tion of length of RNA sequences. In addition, we found
that the best “mutation rate” is 0.2 and “crossover rate”
is 0.8.
5. Discussion
In this manuscript, we proposed a new genetic algorithm
for RNA secondary structure prediction based on a sub
class of Tree Adjoining Grammars. This algorithm gives
accurate results in a reasonable amount of time and has
specifiity and sensitivity comparable with existing tools
such as Hotknot, Pknots, and the STAR algorithms. Be
cause of the capability of regular ESL-TAG to cover a large
class of RNAs' knotted structures, the time complexity of
the parsing algorithm damages its integrity, where in our
study, we modifid implementation of the parsing algo
rithm to reduce running complexity.
Since using thermodynamics rules has been demonstrat
ed as the most successful approach for prediction of sin
gle RNA sequence, we used these rules in finess function.
We designed novel ideas for the evolutionary functions
which yielded to accurate predictions close to the natural
structure. The eligibility of tree adjoining grammar for
capturing RNAs with knotted structure and also its ef
iient data structure led us to employ this representation
of RNA in the genetic algorithm. It is strongly believed
this data structure could be improved and extended to
ward covering bigger class of knotted structures and a
heuristic algorithm like genetic algorithm can use it to
predict complicated knotted structures. The algorithm,
PreRkTAG, was implemented and tested on a set of 24
diffrent RNA sequences and compared with three well
known algorithms.
The results of the algorithm show that sensitivity of sev
enteen out of twenty-four samples and specifiity of eigh
teen out of the twenty-four samples are higher than three
well-known algorithms. These results show the practical
ity of the algorithm. This algorithm handles certain class
of pseudoknots captured by ESL-TAG and in future work,
we intend to extend the algorithm for larger class of
pseudoknots.
Acknowledgements
We would like to express our sincere thanks to Profes
sor C.W. Pleij for his kindly helps and guidance on dis
cussions that led to an improved understanding of the
problem characteristics. The authors are also very grate
ful to Professor D.H. Turner for his guidance on solving
our biophysical problems. Finally we thank Professors
F.H.D. van Batenburg, M. Zuker, R. Markham, and E. Rivas
for their helps for comparing our method vs. theirs. The
authors would like to thank the anonymous referees for
their helpful suggestions.
Authors’ Contribution
Initial idea of the research was from Torabi Dashti, Ah
rabian, Nowzari-Dalini, and Zare-Mirakabad. All authors
participated in algorithm design, and the structure
and organization of the manuscript. Torabi Dashti and
Aghaeepour implemented the algorithms and tested on
diffrent data sets. All authors contributed to read and
approved the fial manuscript.
Financial Disclosure
None declared.
Funding/Support
The authors would like to acknowledge the fiancial
support of University of Tehran for this research under
grant number 6103010/1/02.