Helix Segment Assignment in Proteins Using Fuzzy Logic

Document Type: Research Paper

Authors

1 Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, P.O. Box 13145- 1384, Tehran, I.R. Iran

2 Institute for Studies in Theoretical Physics and Mathematics (IPM), Niavaran Square, P.O. Box 19395-5746, Tehran, I.R. Iran

3 Faculty of Mathematical Sciences, Shahid Beheshti University, Evin, Tehran, I.R. Iran

4 Department of Biochemistry, National Institute of Genetic Engineering and Biotechnology, P.O. Box 14155-6343, Tehran, I.R. Iran

Abstract

The automatic assignment of protein secondary structure from three dimensional coordinates is an essential step in the characterization of protein structure. Although, the recognition of secondary structures such as alpha-helices and beta-sheets seem straightforward, but there are many different definitions, each regarding different criteria. We have developed a new algorithm for protein helix assignment, by using fuzzy logic based on backbone torsion angles. In this method, each residue takes a number from 0 to 100 that indicates the helical membership degree of that residue. This method can be converted to a classical method whenever we assume that any residue with a membership degree greater than 83 is a helix. Comparison of the results with structures reported in protein data bank (PDB), dictionary of secondary structure of proteins (DSSP) and structure identification (STRIDE) for 324 proteins indicate that our algorithm works as well as DSSP showing 93% agreement. We believe that the fuzzy secondary structure assignment has more advantages than the other classical approaches used for protein structure comparisons and alignments.

Keywords


INTRODUCTION
The automatic assignment of protein secondary structure from three dimensional coordinates is an essential
step in the characterization of protein structure. The
secondary structure assignment plays an important
role in structural genomics. The secondary structure
segments are used in protein structure classification
(Pearl et al., 2005; Andreeva et al., 2004; Hogue and
Bryant, 1998), protein structure alignment (Sternberg
et al., 1999; Marti-Renom et al., 2000; Sauder et al.,
2000), comparative modeling and threading (Rost,
2000; Rice and Eisenberg, 1997; Kolinski et al., 1999;
Xu et al., 1999), and also influence sequence alignment (Smith and Smith, 1992; Fischel-Ghodsian et al.,
1993; Henneke, 1989). Although, the recognition of
secondary structure such as alpha-helices and betasheets seem straightforward, there are still many different definitions, each regarding different criteria.
The main criteria used in secondary structure assignment are hydrogen bonding patterns known as dictionary of secondary structure of proteins (DSSP) (Kabsch
and Sander, 1983), quantification of the back bone curvature (Richards and Kundrot, 1988), inter-cα distances
(Levitt and Greer, 1977) and combination of hydrogen
bond energy and torsion angle information known as
structure identification (STRIDE) (Frishman and
Argos, 1995). Comparing these methods on a protein
database showed only 63% agreement between the se
three algorithms (Colloc'h et al., 1993). Although, different methods may assign different secondary structure
states to each residue, but they are similar in one aspect;
each residue is defined in one state and we finally have
a string of secondary structure states for the protein
sequence. Despite the similarity between an assigned
state such as the alpha-helix in different parts of a protein or different proteins, these structures are not exactly the same (Barlow and Thornton, 1988). For example,
two alpha-helices with the same length in two different
proteins may not have the exact geometrical similarity,

but in the assignment methods this difference is not
considered, since most of the protein structure comparison methods are based on secondary structure alignment, renouncement of their geometrical differences
leads to an inexact three-dimensional comparison.
Thus, it is necessary to define parameters for secondary
structures so that different and similar structures can be
compared more precisely. In this study, we use fuzzy
logic and assign a membership degree to each residue
by considering the geometry of consecutive residues
with Phi and Psi angles that indicate regular or irregular
turns for consecutive residues. These fuzzy numbers
may vary from 0 to 100 and can be used to compare two
helices for a better similarity or difference.
The exclusive use of backbone torsion angles is not
sufficient for assignment of all the secondary structure
elements, however, helices’ geometry has enough
information for detection of helices. Although the
algorithm presented in this article is solely based on
dihedral angles, results show that the assigned fuzzy
numbers identify helical regions of protein structure as
good as other classical methods.
MATERIALS AND METHODS
Representative set of X-ray and NMR protein structures with resolutions better than 2.5Å and without
chain breaks were gathered from the protein data bank
(PDB) based on the PDBSELECT list for proteins,
with less than 25% sequence similarity. 324 proteins
with 48644 amino acids were selected. These are listed in Table 1.
Alpha-helices assigned by PDB were chosen as
standard assignment. Backbone dihedral angles (ϕ and
Ψ) of each residue were taken as in DSSP. From a
mathematical point of view, and 2 are approximations of the first and second derivatives. Since our
fuzzy algorithm is based on the geometrical structure
of helices, and first and second derivatives are tools for
studying the plot of a structure, we therefore used
φ and 2ϕ, ∆ψ and 2Ψ. To assign a helix fuzzy
number to each residue, the following steps were carried out:
1. On all amino acids in the data set, φ, ∆ψ, 2φ and2ψ for each residue were calculated as follow:
Where n is denoted as the nth amino acid in the protein.
2. Amino acids which are not located in the helix
domain of the Ramachandran plot and with the following conditions were excluded from the data set.
These residues form the set A.
3. All of the segments assigned as alpha-helix by
PDB, with lengths more than seven residues were
selected. Three residues from the N-cap and three
residues from the C-cap were excluded and averages of ∆2ϕ and 2Ψ for the remaining residues
were calculated and denoted by αϕ and αΨ, respectively.
4. For all residues in the helix state, in the data set
with ∆ 2ϕ ≥ α ϕ, average of ϕ was calculated
and named . was also calculated as above
for Ψ angles. Hence, and parameters are defined as:
In fact αϕ and αΨ denote the maximum variations
allowed for a helix to be considered as a standard
helix. Similar to the rational behind a 95% confidence interval for a mean in a normal distribution, we
consider a confidence region for an amino acid to be
in a helix structure, based on and simultaneously. It should be mentioned here that the information on amino acids discarded in step 3, is now
being considered at this stage. This means no information has been missed. Since we are only interested
in helix structure, therefore, all those amino acids
considered in steps 3 (internal) and 4 (C- cap and Ncap) are not to be considered.
5. fϕ and fΨ functions were defined as follows:
finally function f gives the fuzzy value for helicity
according to the following formulation:
RESULTS AND DISCUSSION
Analysis of helix regularity using variation in the consecutive residue dihedral angles φ and ψ gives the
helix fuzzy number for each residue, between 0 to 100.
Table 2 shows these numbers for two proteins. In this
table helix assignment by PDB, DSSP, STRIDE, with
fuzzy numbers greater than also 83 being compared.
Usually the central residues of helices take numbers
close to 100, and N- and C- terminal residues of each
helix take lower values and show less regularity.
Consecutive residues with the same or near fuzzy
numbers show the regular helix turn, although it may
be far from the standard helix structure. Segments with
fuzzy numbers close to 100 are regular helices with
standard helix geometries. Helix distortion has been
studied in detail and can be attributed to factors such
as solvent-side chain interactions, local sequence and
side chain packing (Barlow and Thornton, 1998).
However, these factors cause the residues in helices to
have different major chain conformations and such
distortions could be shown by differences in consecutive dihedral angles.
Figure 1 shows the superposition of fragments
assigned as helices by PDB with the same length and
different or same fuzzy numbers using the CE program
(http://cl.sdsc.edu/) (Shindyalov and Bourne, 1998).
Root mean square (RMS) calculation shows a relation
between fuzzy numbers and geometry of compared
helices. Two superposed helices with the same fuzzy
numbers show less RMS which increases when the
fuzzy numbers of two helices are different. These
assigned fuzzy numbers for residue helicity, in addition to showing helix regularity can be used for comparison and alignment of protein structures. Instead,
those with are based on a string of secondary structure
elements in which each residue is defined as belonging
to one state or another, and where the regularity and
geometry of secondary structure is ignored. Fuzzy
numbers also show helicity for small segments with
lengths of two or three residues that although are not
classified as helices, but share similar geometry with
the helix. However, the main goal of this method is
assignment of a helical fuzzy number to each residue,
but it can also be simply converted to the classical
method involving the assignment of a residue with helical or non-helical structure. For this purpose, residues
with fuzzy numbers greater than a threshold number k,
were assigned as H and others as H. In a five residue
length window, if one H is surrounded by four Hs, it
can be converted to H and vice versa. Allowing k to
vary, we can find all helix structures near to or far from
the standard helix structure. For example, for k close to
100, the helix structures near to the standard are found
and if k was far from 100, we detect the structure far
from the standard. In order to compare with PDB, we
look for a certain k for which the correlation coefficient of data generated by our algorithm after using the
threshold number k and those generated by PDB are
maximized. This leads to k= 83. Comparisons of the
results with the crystallographer’s assignments as percentage of correctly assigned residues in two states
(helix or non-helix) are 90% for all amino acids in the
dataset.
Comparison of DSSP with our method shows that
they have 94% agreement for H and H. Although
many of the crystallographers define secondary structure based on the DSSP algorithm, comparison of
DSSP and PDB assigned secondary structures in our
dataset show 8% differences between them. Analysis
of differences between results of this study and DSSP
showed that 1342 residues were assigned by the
method of this study to H, while DSSP assigned them
to H. There were 1783 residues that our method
assigned to H, while DSSP assigned them to H.
Comparison of our method and STRIDE show approximately 94% agreement for H and H. Table 3 shows
the details of comparisons between the method
described heae with PDB, DSSP and STRIDE and also
comparisons of DSSP and STRIDE with PDB. Most of

the false positive and negative assignments between
method of this study and PDB occurred at the edges of
helices. Although the major assumption of this work is
that helices can be defined by fuzzy logic and instead
of assigning each residue to one state, it may be
assigned by a fuzzy number which is far more valuable
for comparing protein structures. However, this
approach can also be used in the classical assignment
of helix structure. The results obtained are as good as
DSSP and STRIDE algorithms, which are the most
widely used methods for secondary structure assignment.
In this article the main goal was only fuzzy number
assignment to helices followed by demonstration of
their regularities. Fuzzy number assignment to other
secondary structures such as beta-strands and turns can
be the subject of an independent work and in fact we
are developing a method for fuzzy assignment of secondary structures. For this reason the title “Helix segment assignment in proteins using fuzzy logic” was
selected for this article.
It is also believed that the combination of dihedral
angles and other parameters such as H-bonds can lead
to a different method with better results which can also
be the subject of an other independent work.








Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C,
Murzin AG (2004). SCOP database in 2004: refinements integrate structure and sequence family. Nucleic Acids Res. 32:
226-229.
Barlow DJ, Thornton JM (1988). Helix geometry in proteins. J Mol
Biol. 201: 601-619 Colloc’h N, Etchebest C, Thoreau E,
Henrissat B, Mornon JP (1993). Comparison of three algorithms for the assignment of secondary structure in proteins:
the advantages of a consensus assignment. Protein Eng. 6:
377-382.
Colloc'h N, Etchebest C, Thoreau E, Henrissat B, Mornon JP
(1993). Comparison of three algorithms for the assignment of
secondary structure in proteins: the advantages of a consensus
assignment. Protein Eng. 6:377-82.
Fischel-Ghodsian F, Mathiowitz G Smith TF (1993). Alignment of
protein sequences using secondary structure: a modified
dynamic programming method. Protein Eng. 3: 577-81.
Frishman D, Argos P (1995). Knowledge-based protein secondary
structure assignment. Proteins 23: 566-79.
Henneke CM (1989). A multiple sequence alignment algorithm for
homologous proteins using secondary structure information
and optionally keying alignments to functionally important
sites. Comput Appl Biosci. 5: 141-50.
Hogue CW, Bryant SH (1998). Structure databases. Methods
Biochem Anal. 39: 46-73.
Kabsch W, Sander C (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical
features. Biopolymers 22: 2577-2637.
Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J (1999). A method
for the improvement of threading-based protein models.
Proteins 37: 592-610.
Levitt M, Greer J (1977). Automatic Identification of Secondary
Structure in Globular Proteins. J Mol Biol. 114 : 181-239.
Marti-Renom MA, Stuart A, Fiser A, Sanchez R, Melo F (2000).
Comparative protein structure modeling of genes and
genomes. Annu Rev Biophys Biomol Struct. 29: 291- 325.
Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett
C, Marsden R, Grantm A, Lee D (2005). The CATH Doma4in
Structure Database and related resources Gene3D and DHS
provide comprehensive domain family information for
genome analysis. Nucleic Acids Res. 33: Database Issue
D247-D251.
Rice DW, Eisenberg D (1997). A 3D-1D substitution matrix forprotein fold recognition that includes predicted secondary
structure of thesequence. J Mol Biol. 267: 1026-1038.
Richards FM, Kundrot CE (1988). Identification of structural
motifs from protein coordinate data: Secondary structure and
first level super-secondary structure. Proteins 3:71-84.
Rost B (2000). TOPITS: Threading one-dimensional predictions
into three-dimensional structures. The third international conference on Intelligent Systems for Molecular Biology, 314-
321.
Sauder JM, Arthur JW, Dunbrack RL (2000). Large-scale comparison of protein sequence alignment algorithms with structure
alignments. Proteins 40: 6-22.
Shindyalov IN, Bourne PE (1998). Protein structure alignment by
incremental combinatorial extension (CE) of the optimal path.
Protein Eng. 11: 739-747.
Smith RF, Smith TF (1992). Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structuredependent gap penalties for use in comparative protein modeling. Protein Eng. 5: 35-41.
Sternberg MJ, Bates PA, Kelley LA, MacCallum RM (1999).
Progress in protein structure prediction: assessment of
CASP3. Curr Opin Str Biol. 9: 368-73.
Xu Y, Xu D, Crawford OH, Einstein, Larimer F, Uberbacher E,
Unseren MA, Zhang G (1999). Protein threading by
PROSPECT: a prediction experiment in CASP3, Protein Eng.
12: 899-907.