Document Type: Research Paper
Author
Department of Computer Sciences, Shahid Beheshti University GC, Tehran, IR Iran
Abstract
Keywords
1. Background
More than half of eukaryotic proteins are multi-domain proteins (1). It is often assumed that the interaction between two proteins involves binding two or more specifi domains (2) or binding a domain in one protein to
short regions (approximately three to eight residues) of the other protein (3). For example, multiple domains
of Nkx3.1 are involved in contacting SRF (4). While more than two domains could be involved in mediating the
interaction of two proteins, most of the former works have been developed to identify interacting domain
pairs either for the purpose of predicting or explaining the protein interactions (5-7). In particular, they have neglected the identifiation of interactions that require the presence of a third domain. Few exceptions include (8-
10). However, while these works have considered domain combinations for predicting the protein interactions,
they have not evaluated whether these domain combinations are required to mediate protein interactions.
Recently, Hou et al., use the concept of mediate domain
in yeast proteome to construct a mediate protein-protein
interaction network.
2. Objectives
In this study we fid triplet of domains (A, B, C), such
that the domain C has an effctive role in the interaction
of the proteins X and Y containing the domains A and B
respectively. For this purpose, we emphasis on two re
lated issues; fistly, identifying those domain pairs that
occur frequently in interacting proteins and may not be
necessary or suffient for mediating the interactions of
these proteins and secondly, we characterize those do
main triplets that are necessary and suffient for medi
ating the interactions of these proteins. A domain com
bination (a triplet of domains) is suffient for mediating
the interaction of a pair of co-localized proteins if they
interact whenever the domain combination is observed
in them. A domain combination is necessary for mediat
ing a pair of interacted proteins if the deletion of any do
mains in the domain combination, stops the interaction
of those proteins.
3. Materials and Methods
Characterizing domain combinations are necessary
and suffient for mediating protein interactions. The
conditions of being “necessary” and “suffient” cannot
always be determined without additional laboratory ex
periments. For instance, to determine necessity, one has
to delete a domain from a pair of interacting proteins and
then test them in a laboratory whether the two proteins
still interact or not. We approximate these conditions by
pragmatic statistical defiitions.
Let D (X) be the set of domains of the protein X. For the
proteins X and Y and the domains A, B and C, we have:
(A,B) ϵ (X,Y) if A ϵ D(X) & B ϵ D(Y) ( Figure 1 ).
We defie (A,B,C) ϵ (X,Y) if (A,B) ϵ (X,Y) and C ϵ D(X) or C ϵ
D (Y) ( Figure 2 ).
We have (X, Y) contains (A,B) or (X,Y) contains (A,B,C) if
(A,B) ϵ (X,Y) or (A,B,C) ϵ (X,Y) respectively.
Let I be a set of gold-standard interacting protein pairs
and NI be a set of gold-standard non-interacting protein
pairs. Let O be a domain combination (a triplet of do
mains). We defie I0 = {(X,Y) ϵ I | O ϵ (X,Y)} and NIo = {(X,Y)
ϵ NI| O ϵ (X,Y)}.
A domain combination that is observed in only one
pair of interacting proteins is not easily justifid as a real
mediator of the interaction. Therefore it is reasonable to
restrict our attention to domain combinations that are
observed in at least k interacting protein pairs (|I0| ≥ k). In
this manuscript we consider k ϵ {2,3,4}.
Let O be a domain combinations that is observed in at
least k interacting protein pairs, we set equation (Equa
tion 1) , which can be thought as the probability that O
mediating the interaction of the proteins containing
it. Since |NI|/|I| is independent of O, in our calculation
we show it by m (in the results section it is shown that
m=797.09).
| NIo |
| NI |
Odds (O) = =
| Io |
| I | | Io |
| NIo |
| NI |
| I |
Equation 1. The Probability that O Mediating the Interaction of the
Containing Proteins
Let μP and σP be the mean and standard deviation of the
odds of all domain combinations. We assume of μP as the
odds expected of a random domain combination. More
over, for a pair of proteins (X, Y) containing O, let:
Pr ob (X,Y|O) = Pr ecision (O) = |I0|/(|I0 U NI0|)
Which is the probability of the interaction of a pair of
proteins (X, Y) containing O. Let μR and σR be the mean
and standard deviation of the precision of all domain
combinations. We would assume of μR as the precision
expected of a random domain combination. Consider
two thresholds t
1 and t2.
Let U be the set of domain combinations that:
(i) Are observed in at least k pairs of interacting pro
teins.
(ii) Have odds at least t1.
(iii) Have precision at least t2.
In the next section we would discuss how to obtain the
thresholds t
1 and t2. A domain combination O is defied
to be “suffient” for mediating the interaction of a pair
of proteins (X, Y) if O is contained in (X,Y) and O is in U. We
consider a domain combination O is “necessary” for me
diating the interaction of a pair of proteins (X, Y) if there
does not exist a domain combination O’ that is “more suf
fiient” (i.e., has higher precision) for mediating (X,Y)
than O’. Note that O and O’ need not have any domain in
common.
Thus the domain combination O is necessary and suf
iient for mediating the interaction of a pair of proteins
(X,Y) if and only if:
O= arg max O’ ϵ U, O’ ϵ (X,Y) Pr ob ( X,Y|O’).
Such a domain combination O is also the best expla
nation for the interaction of (X,Y); and we describe it as
a necessary and suffient domain combination, “ns do
main combination”, of (X,Y).
We denote the set of all ns domain combinations by U’.
3.1. Odds and Precision Threshold (t1 and t2.)
In this section we do particular statistical approaches to
obtain the threshold t
1 and t2. These statistic evaluations
have been done in the data set which will be explained in
the result section. It is obvious that all the above formulas
depend on k. In the Figures 3 , 4 , 5 , 6 , 7 and 8 the distri
bution of Odd and precision of domain combinations for
diffrent k ϵ {2,3,4} are presented.
Table 1 reveals some of the statistical parameters of odds
and precision distributions.
As the Figures 3 , 4 , 5 , 6 , 7 and 8 have shown the distri
bution of odds and precision are not distributed norma
ly. To indicate the threshold t1 and t2; let Н0(null hypoth
esis) be the assumption of the interaction between two
proteins containing at least one ns domain combination
and Н
1(alternative hypothesis) be the assumption that
there is no interaction between two proteins contain
ing at least one ns domain combination. We fid t1 and
t2 such that the related type-I error for the mentioned
hypothesis, H0 does not exceed γ. We obtain t1 and t2 by
solving P (Precision (O) ≥ t2 |H0) = γ and P (Odds(O) ≥ t1 |H0)
= γ for each data set. In the present study, we consider γ =
0.05, 0.1, and 0.2. A hypothesis test is considered statisti
cally signifiant if its P-value is less than or equal to a sig
nifiance level. In this circumstance the null hypothesis
is rejected.
Typically the values of this signifiance level are con
sidered to be 0.05 and 0.1. To test more domain combi
nations, we consider 0.2 as the signifiance level too. The
number of ns domain combination respect to and are
shown in Table 2 .
In the method section, we defied three laws for predic
tion of interaction between proteins with respect to ns
domain combinations.
3.2. Prediction
Let O = (A,B,C) be ns domain combination and X, Y and
Z be the three proteins. It is predicted that X and Y would
merit an interaction by the following laws:
Law I) It is predicted X and Y have an interaction if:
1. X and Y have common localization
2. O = (A,B,C) ϵ (X,Y) (See Figure 9 )
Law II) It is predicted X and have an interaction if:
1. X, Y and Z have common localization.
2. (A,B) ϵ (X,Y) and C ϵ O (Z)
3. X and Z interact and Y and Z interact (see Figure 10)
Law III) It is predicted X and have an interaction if:
Law I or Law II holds.
4. Results
4.1. Dataset
The dataset DIP containing yeast protein interaction
(http://dip.doe-mbi.ucla.edu/dip/Downlod.cgi) has been
used. This dataset contains 4928 proteins and 17451 inter
actions and there are 3593 various domains in these pro
teins. In order to fiding domains following address has
been used: http://dip.doe-mbi.ucla.edu/dip/servises.cgi.
In this data set there are two diffrent tyos of domains:
1) Domains that are obtained experimentally. The num
ber of these domains in this data set is 1077 (prefi of
pfam codes of these domains is PF).
2) Domains that are obtained by automatic methods.
The number of these domains in this data set is 2516 (pre
fi of pfam codes of these domains is PB).
We derive a reliable subset I from this dataset by includ
ing only those interactions that the two interacting pro
teins (i) have common localization; and (ii) a common
partner. The localization of a protein is the location of
the protein in the cell. This information can be obtained
from Gen Ontology database which is available at www.
genontology.org. Each of these conditions were highly as
sociated with reliable interactions (12).
Therefore we consider the set I as the gold standard in
teraction protein pairs. The resulting subset I has 6955 in
teractions. Subsequently a set NI from those protein pairs
that are assumed to be non-interacting with a high prob
ability as follows had been constructed. A pair of interact
ing proteins that (i) do not have a common localization;
and (ii) do not have a common partner, has been derived.
As these protein pairs violate all the key conditions asso
ciated with reliable interactions (12), it is believed that NI
is a gold standard of non-interacting proteins. The con
structed set NI have 5543821 protein pairs.
Therefore in the above calculations m= |NI|/ |I| = 797.09.
4.2. Evaluation of Prediction
To evaluate the performance of our prediction two
measurements had been expressed; precision and recall
which are defied by:
Recall = TP/|W|Precision = TP/ (TP+FP)
With respect to the refied data set (sets I and NI), it has
been expressed that:
W = I
TP = the number of predicted edges that are in I.
FP = the number of predicted edges that are in NI.
And with respect to the primary data set, we defie:
W = the primary data set.
TP = the number of predicted edges that exist in the data
set.
FP = the number of predicted edges that do not exist in
the data set.
In Tables 3 and 4 the results of prediction with respect
to the refied and primary data sets using the three dif
ferent laws had been described.
On the other hand, by our laws, we predict interaction
between a pair of proteins (X,Y) if they contain at least
one pair of domain (A,B) which is contained in at least
one ns domain combination. There are numerous inter
actions in the data set that do not contain any such pair
of domains. Therefore it is expected that recall is not
high. The best recall is obtained when we consider Law
III, γ = 0.1 and k = 3 with respect to both data sets. In the
next section we reveal the effctiveness of the mediator
domain in the interaction between two proteins.
4.3. Mediator Domains
We estimate the effctiveness of the mediator domain C
in the interaction of the two proteins X and Y that contain
(A, B), in each ns domain combination O = (A,B,C).
Considering protein pairs that contain (A, B), the recall
and the precision is obtained by:
Recall = TP/|W|Precision = TP/ (TP+FP)
With respect to the refied data set (sets I and NI), we
defie:
W = {(X,Y) ϵ I| Ǝ O = (A,B,C) ϵ U’ s.t (A,B) ϵ (X,Y)}
TP = the number of predicted edges that are in I.
FP = the number of predicted edges that are in NI.
NW = {(X,Y) ϵ NI| Ǝ O = (A,B,C) ϵ U’ s.t (A,B) ϵ (X,Y)}
And with respect to the primary data set, we defie:
NW = {(X,Y) | Ǝ O = (A,B,C) ϵ U’ s.t (A,B) ϵ (X,Y) and (X,Y) are
int eracted }
TP = the number of predicted edges that are in data set.
FP = the number of predicted edges that are not in data
set.
NW = {(X,Y) | Ǝ O = (A,B,C) ϵ U’ s.t (A,B) ϵ (X,Y) and (X,Y) are
not int eracted }
In Tables 5 and 6 it has been revealed that the mediator
domain C has an appropriate effctive role in the interac
tion between the two proteins X and Y that contain (A,B),
in each ns domain combination O = (A,B,C). For example
according to Table 5 , under the case of Law III, γ = 0.05, k
= 4, the precision is 0.9691 while the ratio |W|/ (|W|+|NW|)
= 0.1297. This means that the incorporation of C into Law
III has improved the precision by 0.0901/0.1297 = 7.4719
fold. That is, assuming the absence of errors in the data
sets, a pair of proteins exhibiting a ns domain combina
tion (A, B, C) is, on average, 7.4719 times more likely to in
teract than a pair of proteins exhibiting the domain pair
(A,B). It means that the domains X and Y ((A,B) ϵ (X,Y)) can
be applicable for the interaction between proteins X and
Y (A,B) ϵ (X,Y) if:
C is in D(X) U D(Y)
or
- There is a protein Z such that, C ϵ D (Z), (Y, Z) and (X, Z)
interact.
Therefore C has been named the “mediator domain” for
A and B. According to the above results, we predicted the
interaction between some pair of proteins, using ns do
main combination and mediator domain in a good man
ner.
5. Discussion
In the present manuscript a method for the prediction
of the protein interaction using ns domain combination
and mediator domain is presented. It is revealed that the
mediator domains have an effctive role in the prediction
protein interactions. Using ns domain combinations and
mediator domains, we have predicted high reliable inter
actions. That is, a pair of proteins exhibiting a ns domain
combination (A, B, C) is more likely to interact than a pair
of proteins exhibiting the domain pair (A, B).
Acknowledgements
I would like to thank the department of research affirs
of Shahid Beheshti University.
Authors’ Contribution
The whole manuscript has been conducted by C. Eslah
chi.
Financial Disclosure
None declared.
Funding/ Support
University of Shahid Beheshti and IPM.