^{}Department of Computer Sciences, Shahid Beheshti University GC, Tehran, IR Iran

Abstract

Background: Protein-protein interactions do not provide any direct information regarding the domains within the proteins that mediate the interactions. The majority of proteins are multi domain proteins and the interaction between them is often defined by the pairs of their domains. Most of the former studies focus only on interacting domain pairs. However they do not consider the interactions that require the presence of a third domain. Objectives: In this manuscript, we define the concept of necessary and sufficient triplets of domains and mediator domain. Materials and Methods: We approximate these conditions by pragmatic statistical definitions on a set of gold-standard interacting protein pairs and a set of gold-standard non-interacting protein pairs. Results: In this paper we introduce a new method for the prediction of the interaction between two domains using third domains as a mediator.we show that the mediator domain has an effective role in the interaction between proteins. Conclusions: By using these concepts, we introduce a method for the prediction of the interaction between two domains. Subsequently by evaluating the performance of our method on the yeast protein interactions data set, we show that the mediator domain has an effective role in the interaction between proteins.

1. Background More than half of eukaryotic proteins are multi-domain proteins (1). It is often assumed that the interaction between two proteins involves binding two or more specifi domains (2) or binding a domain in one protein to short regions (approximately three to eight residues) of the other protein (3). For example, multiple domains of Nkx3.1 are involved in contacting SRF (4). While more than two domains could be involved in mediating the interaction of two proteins, most of the former works have been developed to identify interacting domain pairs either for the purpose of predicting or explaining the protein interactions (5-7). In particular, they have neglected the identifiation of interactions that require the presence of a third domain. Few exceptions include (8- 10). However, while these works have considered domain combinations for predicting the protein interactions, they have not evaluated whether these domain combinations are required to mediate protein interactions. Recently, Hou et al., use the concept of mediate domain in yeast proteome to construct a mediate protein-protein interaction network. 2. Objectives In this study we fid triplet of domains (A, B, C), such that the domain C has an effctive role in the interaction of the proteins X and Y containing the domains A and B respectively. For this purpose, we emphasis on two re lated issues; fistly, identifying those domain pairs that occur frequently in interacting proteins and may not be necessary or suffient for mediating the interactions of these proteins and secondly, we characterize those do main triplets that are necessary and suffient for medi ating the interactions of these proteins. A domain com bination (a triplet of domains) is suffient for mediating the interaction of a pair of co-localized proteins if they interact whenever the domain combination is observed in them. A domain combination is necessary for mediat ing a pair of interacted proteins if the deletion of any do mains in the domain combination, stops the interaction of those proteins. 3. Materials and Methods Characterizing domain combinations are necessary and suffient for mediating protein interactions. The conditions of being “necessary” and “suffient” cannot always be determined without additional laboratory ex periments. For instance, to determine necessity, one has to delete a domain from a pair of interacting proteins and then test them in a laboratory whether the two proteins still interact or not. We approximate these conditions by pragmatic statistical defiitions. Let D (X) be the set of domains of the protein X. For the proteins X and Y and the domains A, B and C, we have: (A,B) ϵ (X,Y) if A ϵ D(X) & B ϵ D(Y) ( Figure 1 ). We defie (A,B,C) ϵ (X,Y) if (A,B) ϵ (X,Y) and C ϵ D(X) or C ϵ D (Y) ( Figure 2 ).

We have (X, Y) contains (A,B) or (X,Y) contains (A,B,C) if (A,B) ϵ (X,Y) or (A,B,C) ϵ (X,Y) respectively. Let I be a set of gold-standard interacting protein pairs and NI be a set of gold-standard non-interacting protein pairs. Let O be a domain combination (a triplet of do mains). We defie I0 = {(X,Y) ϵ I | O ϵ (X,Y)} and NIo = {(X,Y) ϵ NI| O ϵ (X,Y)}. A domain combination that is observed in only one pair of interacting proteins is not easily justifid as a real mediator of the interaction. Therefore it is reasonable to restrict our attention to domain combinations that are observed in at least k interacting protein pairs (|I0| ≥ k). In this manuscript we consider k ϵ {2,3,4}. Let O be a domain combinations that is observed in at least k interacting protein pairs, we set equation (Equa tion 1) , which can be thought as the probability that O mediating the interaction of the proteins containing it. Since |NI|/|I| is independent of O, in our calculation we show it by m (in the results section it is shown that m=797.09). | NIo | | NI | Odds (O) = = | Io | | I | | Io | | NIo | | NI | | I | Equation 1. The Probability that O Mediating the Interaction of the Containing Proteins Let μP and σP be the mean and standard deviation of the odds of all domain combinations. We assume of μP as the odds expected of a random domain combination. More over, for a pair of proteins (X, Y) containing O, let: Pr ob (X,Y|O) = Pr ecision (O) = |I0|/(|I0 U NI0|) Which is the probability of the interaction of a pair of proteins (X, Y) containing O. Let μR and σR be the mean and standard deviation of the precision of all domain combinations. We would assume of μR as the precision expected of a random domain combination. Consider two thresholds t 1 and t2. Let U be the set of domain combinations that: (i) Are observed in at least k pairs of interacting pro teins. (ii) Have odds at least t1. (iii) Have precision at least t2. In the next section we would discuss how to obtain the

thresholds t 1 and t2. A domain combination O is defied to be “suffient” for mediating the interaction of a pair of proteins (X, Y) if O is contained in (X,Y) and O is in U. We consider a domain combination O is “necessary” for me diating the interaction of a pair of proteins (X, Y) if there does not exist a domain combination O’ that is “more suf fiient” (i.e., has higher precision) for mediating (X,Y) than O’. Note that O and O’ need not have any domain in common. Thus the domain combination O is necessary and suf iient for mediating the interaction of a pair of proteins (X,Y) if and only if: O= arg max O’ ϵ U, O’ ϵ (X,Y) Pr ob ( X,Y|O’). Such a domain combination O is also the best expla nation for the interaction of (X,Y); and we describe it as a necessary and suffient domain combination, “ns do main combination”, of (X,Y). We denote the set of all ns domain combinations by U’. 3.1. Odds and Precision Threshold (t1 and t2.) In this section we do particular statistical approaches to obtain the threshold t 1 and t2. These statistic evaluations have been done in the data set which will be explained in the result section. It is obvious that all the above formulas depend on k. In the Figures 3 , 4 , 5 , 6 , 7 and 8 the distri bution of Odd and precision of domain combinations for diffrent k ϵ {2,3,4} are presented.

Table 1 reveals some of the statistical parameters of odds and precision distributions. As the Figures 3 , 4 , 5 , 6 , 7 and 8 have shown the distri bution of odds and precision are not distributed norma ly. To indicate the threshold t1 and t2; let Н0(null hypoth esis) be the assumption of the interaction between two proteins containing at least one ns domain combination and Н 1(alternative hypothesis) be the assumption that there is no interaction between two proteins contain

ing at least one ns domain combination. We fid t1 and t2 such that the related type-I error for the mentioned hypothesis, H0 does not exceed γ. We obtain t1 and t2 by solving P (Precision (O) ≥ t2 |H0) = γ and P (Odds(O) ≥ t1 |H0) = γ for each data set. In the present study, we consider γ = 0.05, 0.1, and 0.2. A hypothesis test is considered statisti cally signifiant if its P-value is less than or equal to a sig nifiance level. In this circumstance the null hypothesis is rejected.

Typically the values of this signifiance level are con sidered to be 0.05 and 0.1. To test more domain combi nations, we consider 0.2 as the signifiance level too. The number of ns domain combination respect to and are shown in Table 2 . In the method section, we defied three laws for predic tion of interaction between proteins with respect to ns domain combinations.

3.2. Prediction Let O = (A,B,C) be ns domain combination and X, Y and Z be the three proteins. It is predicted that X and Y would merit an interaction by the following laws: Law I) It is predicted X and Y have an interaction if: 1. X and Y have common localization 2. O = (A,B,C) ϵ (X,Y) (See Figure 9 )

Law II) It is predicted X and have an interaction if: 1. X, Y and Z have common localization. 2. (A,B) ϵ (X,Y) and C ϵ O (Z) 3. X and Z interact and Y and Z interact (see Figure 10) Law III) It is predicted X and have an interaction if: Law I or Law II holds. 4. Results 4.1. Dataset The dataset DIP containing yeast protein interaction (http://dip.doe-mbi.ucla.edu/dip/Downlod.cgi) has been used. This dataset contains 4928 proteins and 17451 inter actions and there are 3593 various domains in these pro teins. In order to fiding domains following address has been used: http://dip.doe-mbi.ucla.edu/dip/servises.cgi. In this data set there are two diffrent tyos of domains: 1) Domains that are obtained experimentally. The num ber of these domains in this data set is 1077 (prefi of pfam codes of these domains is PF). 2) Domains that are obtained by automatic methods. The number of these domains in this data set is 2516 (pre fi of pfam codes of these domains is PB). We derive a reliable subset I from this dataset by includ ing only those interactions that the two interacting pro teins (i) have common localization; and (ii) a common partner. The localization of a protein is the location of the protein in the cell. This information can be obtained from Gen Ontology database which is available at www. genontology.org. Each of these conditions were highly as sociated with reliable interactions (12).

Therefore we consider the set I as the gold standard in teraction protein pairs. The resulting subset I has 6955 in teractions. Subsequently a set NI from those protein pairs that are assumed to be non-interacting with a high prob ability as follows had been constructed. A pair of interact ing proteins that (i) do not have a common localization; and (ii) do not have a common partner, has been derived. As these protein pairs violate all the key conditions asso ciated with reliable interactions (12), it is believed that NI is a gold standard of non-interacting proteins. The con structed set NI have 5543821 protein pairs. Therefore in the above calculations m= |NI|/ |I| = 797.09. 4.2. Evaluation of Prediction To evaluate the performance of our prediction two measurements had been expressed; precision and recall which are defied by: Recall = TP/|W|Precision = TP/ (TP+FP) With respect to the refied data set (sets I and NI), it has been expressed that: W = I TP = the number of predicted edges that are in I. FP = the number of predicted edges that are in NI. And with respect to the primary data set, we defie: W = the primary data set. TP = the number of predicted edges that exist in the data set. FP = the number of predicted edges that do not exist in the data set. In Tables 3 and 4 the results of prediction with respect to the refied and primary data sets using the three dif ferent laws had been described.

On the other hand, by our laws, we predict interaction between a pair of proteins (X,Y) if they contain at least one pair of domain (A,B) which is contained in at least one ns domain combination. There are numerous inter actions in the data set that do not contain any such pair of domains. Therefore it is expected that recall is not high. The best recall is obtained when we consider Law III, γ = 0.1 and k = 3 with respect to both data sets. In the next section we reveal the effctiveness of the mediator domain in the interaction between two proteins. 4.3. Mediator Domains We estimate the effctiveness of the mediator domain C in the interaction of the two proteins X and Y that contain

(A, B), in each ns domain combination O = (A,B,C). Considering protein pairs that contain (A, B), the recall and the precision is obtained by: Recall = TP/|W|Precision = TP/ (TP+FP) With respect to the refied data set (sets I and NI), we defie: W = {(X,Y) ϵ I| Ǝ O = (A,B,C) ϵ U’ s.t (A,B) ϵ (X,Y)} TP = the number of predicted edges that are in I. FP = the number of predicted edges that are in NI. NW = {(X,Y) ϵ NI| Ǝ O = (A,B,C) ϵ U’ s.t (A,B) ϵ (X,Y)} And with respect to the primary data set, we defie: NW = {(X,Y) | Ǝ O = (A,B,C) ϵ U’ s.t (A,B) ϵ (X,Y) and (X,Y) are int eracted } TP = the number of predicted edges that are in data set. FP = the number of predicted edges that are not in data set.

NW = {(X,Y) | Ǝ O = (A,B,C) ϵ U’ s.t (A,B) ϵ (X,Y) and (X,Y) are not int eracted } In Tables 5 and 6 it has been revealed that the mediator domain C has an appropriate effctive role in the interac tion between the two proteins X and Y that contain (A,B), in each ns domain combination O = (A,B,C). For example according to Table 5 , under the case of Law III, γ = 0.05, k = 4, the precision is 0.9691 while the ratio |W|/ (|W|+|NW|) = 0.1297. This means that the incorporation of C into Law III has improved the precision by 0.0901/0.1297 = 7.4719 fold. That is, assuming the absence of errors in the data sets, a pair of proteins exhibiting a ns domain combina

tion (A, B, C) is, on average, 7.4719 times more likely to in teract than a pair of proteins exhibiting the domain pair (A,B). It means that the domains X and Y ((A,B) ϵ (X,Y)) can be applicable for the interaction between proteins X and Y (A,B) ϵ (X,Y) if: C is in D(X) U D(Y) or - There is a protein Z such that, C ϵ D (Z), (Y, Z) and (X, Z) interact. Therefore C has been named the “mediator domain” for A and B. According to the above results, we predicted the interaction between some pair of proteins, using ns do main combination and mediator domain in a good man ner.

5. Discussion In the present manuscript a method for the prediction of the protein interaction using ns domain combination and mediator domain is presented. It is revealed that the mediator domains have an effctive role in the prediction protein interactions. Using ns domain combinations and mediator domains, we have predicted high reliable inter actions. That is, a pair of proteins exhibiting a ns domain combination (A, B, C) is more likely to interact than a pair of proteins exhibiting the domain pair (A, B). Acknowledgements I would like to thank the department of research affirs of Shahid Beheshti University. Authors’ Contribution The whole manuscript has been conducted by C. Eslah chi. Financial Disclosure None declared. Funding/ Support University of Shahid Beheshti and IPM.

References

1. Ekman D, Bjorklund AK, Frey-Skott J, Elofsson A. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol. 2005;348(1):231-43.

2. Costa S, Cesareni G. Domains mediate protein-protein interactions and nucleate protein assemblies. Handb Exp Pharmacol. 2008;(186):383-405.

3. Neduva V, Linding R, Su-Angrand I, Stark A, de Masi F, Gibson TJ, et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol. 2005;3(12):e405.

4. Zhang Y, Fillmore RA, Zimmer WE. Structural and functional analysis of domains mediating interaction between the bagpipe homologue, Nkx3.1 and serum response factor. Exp Biol Med (Maywood). 2008;233(3):297-309.

5. Guimaraes KS, Przytycka TM. Interrogating domain-domain interactions with parsimony based approaches. BMC Bioinformatics. 2008;9:171.

6. Raghavachari B, Tasneem A, Przytycka TM, Jothi R. DOMINE: a database of protein domain interactions. Nucleic Acids Res. 2008;36(Database issue):D656-61.

7. Shoemaker BA, Panchenko AR. Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol. 2007;3(4):e43.

8. Han DS, Kim HS, Jang WH, Lee SD, Suh JK. PreSPI: a domain combination based prediction system for protein-protein interaction. Nucleic Acids Res. 2004;32(21):6312-20.

9. Chua HN, Hugo W, Liu G, Li X, Wong L, Ng SK. A probabilistic graph-theoretic approach to integrate multiple predictions for the protein-protein subnetwork prediction challenge. Ann N Y Acad Sci. 2009;1158:224-33.

10. Li L, Zhao B, Du J, Zhang K, Ling CX, Li SS. DomPep--a general method for predicting modular domain-mediated protein-protein interactions. PLoS One. 2011;6(10):e25528.

11. Hou T, Li N, Li Y, Wang W. Characterization of domain-peptide interaction interface: prediction of SH3 domain-mediated protein-protein interaction network in yeast by generic structure-based models. J Proteome Res. 2012;11(5):2982-95.

12. Chua HN, Wong L. Increasing the reliability of protein interactomes. Drug Discov Today. 2008;13(15-16):652-8.