Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

STRIKE: A Protein–Protein Interaction Classification Approach

2011, Advances in Experimental Medicine and Biology

Chapter 1 PROTEIN–PROTEIN INTERACTION PREDICTION USING HOMOLOGYAND INTERDOMAIN LINKER REGION INFORMATION Nazar Zaki1 1. INTRODUCTION One of the central problems in modern biology is to identify the complete set of interactions among proteins in a cell. The structural interaction of proteins and their domains in networks is one of the most basic molecular mechanisms for biological cells. Structural evidence indicates that, interacting pairs of close homologs usually interact in the same way. In this Chapter, we make use of both homology and inter-domain linker region knowledge to predict interaction between protein pairs solely by amino acid sequence information. High quality core set of 150 yeast proteins obtained from the Database of Interacting Proteins (DIP) was considered to test the accuracy of the proposed method. The strongest prediction of the method reached over 70% accuracy. These results show great potential for the proposed method. 1.1 The Importance of Protein–Protein Interaction The term protein-protein interaction (PPI) refers to the association of protein molecules and the study of these associations from the perspective of biochemistry, signal transduction and networks. PPIs occur at almost every 1 Nazar Zaki is an Assistant Professor with the College of Information Technology, UAE University. AlAin 17555, UAE, (phone: +971-50-7332135; fax: +971-3-7626309; e-mail: nzaki@uaeu.ac.ae). 2 Chapter 1 level of cell function, in the structure of sub-cellular organelles, the transport machinery across the various biological membranes, the packaging of chromatin, the network of sub-membrane filaments, muscle contraction, signal transduction, and regulation of gene expression, to name a few1. Abnormal PPIs have implications in a number of neurological disorders; include Creutzfeld-Jacob and Alzheimer's diseases. Because of the importance of PPIs in cell development and disease, the topic has been studied extensively for many years. A large number of approaches to detect PPIs have been developed. Each of these approaches has strengths and weaknesses, especially with regard to the sensitivity and specificity of the method. 1.2 Current Methods to Predict PPI One of the major goals in functional genomics is to determine protein interaction networks for whole organisms, and many of the experimental methods have been applied to study this problem. Co-immunoprecipitation is considered to be the gold standard assay for PPIs, especially when it is performed with endogenous proteins2. Yeast two-hybrid screen investigates the interaction between artificial fusion proteins inside the nucleus of yeast3. Tandem Affinity Purification (TAP) detects interactions within the correct cellular environment4. Quantitative immunoprecipitation combined with knock-down (QUICK) detects interactions among endogenous non-tagged proteins5. High-throughput methods have also contributed tremendously in the creation of databases containing large sets of protein interactions, such as Database of Interacting Proteins (DIP)6, MIPS7 and Human Protein Reference Database (HPRD)8. In addition, several in silico methods have been developed to predict PPI based on features such as gene context9. These include gene fusion10, gene neighborhood11 and phylogenetic profile12. However, most of the in silico methods seek to predict functional association, which often implies but is not restricted to physical binding. Despite the availability of the mentioned methods of predicting PPI, the accuracy and coverage of these techniques have proven to be limited. Computational approaches remain essential both to assist in the design and validation of the experimental studies and for the prediction of interaction partners and detailed structures of protein complexes13. 1.3 Computational Approaches to Predict PPI Some of the earliest techniques predict interacting proteins through the similarity of expression profiles14, description of similarity of phylogenetic 1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information 3 profiles12 or phylogenetic trees15, and studying the patterns of domain fusion16. However, it has been noted that these methods predict PPI in a general sense, meaning joint involvement in a certain biological process, and not necessarily actual physical interaction17. Most of the recent works focus on employing protein domain knowledge18-22. The motivation for this choice is that molecular interactions are typically mediated by a great variety of interacting domains23. It is thus logical to assume that the patterns of domain occurrence in interacting proteins provide useful information for training PPI prediction methods24. An emerging new approach in the protein interactions field is to take advantage of structural information to predict physical binding25-26. Although the total number of complexes of known structure is relatively small, it is possible to expand this set by considering evolutionary relationships between proteins. It has been shown that in most cases close homologs (>30% sequence identity) physically interact in the same way with each other. However, conservation of a particular interaction depends on the conservation of the interface between interacting partners27. In this chapter, we propose to predict PPI using only sequence information. The proposed method combines homology and structural relationships. Homology relationships will be incorporated by measuring the similarity between protein pair using Pairwise Alignment. Structural relationships will be incorporated in terms of protein domain and interdomain linker region information. We are encouraged by the fact that compositions of contacting residues in protein sequence are unique, and that incorporating evolutionary and predicted structural information improves the prediction of PPI28. 2. METHOD In this work, we present a simple yet effective method to predict PPI solely by amino acid sequence information. The overview of the proposed method is illustrated in Fig. 1. It consists of three main steps: (a) extract the homology relationship by measuring regions of similarity that may reflect functional, structural or evolutionary relationships between protein sequences (b) downsize the protein sequences by predicting and eliminating inter-domain linker regions (c) scan and detect domain matches in all the protein sequences of interest. 4 Chapter 1 Figure 1-1. Overview of the proposed method 2.1 Similarity Measures between Protein Sequences The proposed method starts by measuring the PPI sequence similarity, which reflects the evolutionary and homology relationships. Two protein sequences may interact by the mean of the amino acid similarities they contain24. This work is motivated by the observation that an algorithm such as Smith-Waterman (SW)29, which measures the similarity score between two sequences by a local gapped alignment, provides relevant measure of similarity between protein sequences. This similarity incorporates biological knowledge about protein evolutionary structural relationships30. The Smith-Waterman similarity score SW ( x1 , x 2 ) between two protein sequences x1 and x2 is the score of the best local alignment with gaps between the two protein sequences computed by the Smith-Waterman dynamic programming algorithm. Let us denote by µ a possible local alignment between x1 and x2 , defined by a number n of aligned residues, and by the indices 1 ≤ i1 < ... < in ≤ x1 and 1 ≤ j1 < ... < jn ≤ x2 of the aligned residues in x1 and x2 respectively. Let us also denote by ∏( x1 , x2 ) the set of all possible local alignments between x1 and x2 , and by p( x1 , x2 , µ ) the score of the local alignment µ ∈ ∏( x1 , x2 ) between x1 and x2 , the Smith-Waterman score SW ( x1 , x2 ) between sequences x1 and x2 can be written as: SW ( x1 , x2 ) = max p( x1 , x2 , µ ) µ ∈∏ ( x1 , x 2 ) (1) 5 1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information The similarity matrix M can be calculated as follow: SW(x1, x1) SW(x2 , x1) M= : SW(x1, x2 ) SW(x2 , x2 ) : ... ... : SW(x1, xm ) SW(x2 , xm ) : SW(xm , x1) SW(xm , x2 ) ... SW(xm , xm ) (2) where m is the number of the protein sequences. For example, suppose we have the following randomly selected PPI dataset: YDR190C, YPL235W, YDR441C, YML022W, YLL059C, YML011C, YGR281W and YPR021C represented by x1 , x2 , x3 , x4 , x5 , x6 , x7 and x8 respectively. The interaction between these 8 proteins is shown in Fig. 2. Figure 1-2. The interaction between the randomly selected proteins Then the SW similarity score matrix M will be calculated as: x1 x2 x3 x4 x5 x6 x7 x8 x1 X 465 28 30 25 30 34 29 x2 465 X 30 24 32 33 50 47 x3 28 30 X 553 29 27 32 29 x4 30 24 553 X 29 20 25 40 x5 25 32 29 29 X 24 28 49 x6 30 33 27 20 24 X 25 26 x7 34 50 32 25 28 25 X 36 x8 29 47 29 40 49 26 36 X From M , higher score may reflect interaction between two proteins. SW ( x1 , x2 ) and SW ( x2 , x1 ) scores are equal to 465; SW ( x3 , x4 ) and SW ( x4 , x3 ) scores are equal to 553, which confirm the interaction possibilities. However, SW ( x5 , x6 ) and SW ( x6 , x5 ) scores are equal to 24; SW ( x7 , x8 ) and SW ( x8 , x7 ) scores are equal to 36, which are not the highest scores. To correct these errors more biological information is needed, which lead us to the second part of the proposed method. 6 2.2 Chapter 1 Identify and Eliminate Inter-Domain Linker Regions The results could be further enhanced by incorporating inter-domain linker regions knowledge. The next step of our algorithm is to predict interdomain linker regions solely by amino acid sequence information. Our intention here is to identify and eliminate all the inter-domain linker regions from the protein sequences of interest. By doing this step, we are actually downsizing the protein sequence to shorter ones with only domains information to yield better alignment scores. In this case, the prediction is made by using linker index deduced from a data set of domain/linker segments from SWISS-PROT database31. DomCut developed by Suyama et. al32 is employed to predict linker regions among functional domains based on the difference in amino acid composition between domain and linker regions. Following Suyama’s method32, we defined the linker index Si for amino acid residue i and it is calculated as follows: Lin ker Si = − ln fi Domain fi (3) Where f i Lin ker is the frequency of amino acid residue i in the linker region and fi Domain is the frequency of amino acid residue i in the domain region. The negative value of Si means that the amino acid preferably exists in a linker region. A threshold value is needed to separate linker regions as shown in Fig. 3. Amino acids with linker score greater than the set threshold value will be eliminated from the protein sequence of interest. Figure 1-3. An example of linker preference profile generated using Domcut. In this case, linker regions greater than the threshold value 0.093 are eliminated from the protein sequence. When applying the second part of the method, the matrix M will be calculated as follows: 1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information x1 x2 x3 x4 x5 x6 x7 x8 x1 X 504 30 30 25 32 34 27 x2 504 X 30 21 32 32 50 36 x3 30 30 X 775 29 24 38 29 x4 30 21 775 X 19 21 53 37 x5 25 32 29 19 X 28 28 24 x6 32 32 24 21 28 X 23 27 x7 34 50 38 53 28 23 X 339 x8 27 36 29 37 24 27 339 X 7 From M , it’s clearly noted that, more evidence is shown to confirm the interaction possibility between proteins x7 and x8 , and therefore, the result is furthermore enhanced. In the following part of the method, protein domain knowledge will be incorporated in M for better accuracy. 2.3 Detecting Domain Matches and Associated Structural Relationships in Proteins In this part of the method, protein domains knowledge will be incorporated in M . Protein domains are highly informative for predicting PPI as they reflect the potential structural relationships between them. In this implementation, we employed ps_scan33 to scan one or several patterns, rules and profiles from PROSITE against our protein sequences of interest. Running ps_scan through the 8 proteins identifies the following domains: YDR441C ( x 3 ) YML022W ( x 4 ) YGR281W ( x7 ) YPR021C ( x 8 ) → PS00103 → PS00103 → PS00211, PS50893 and PS50929 → PS50929 Which reveals structural relationships between proteins x3 and x4 ; and proteins x7 and x8 . Based on this relationship, SW ( x3 , x 4 ) and SW ( x7 , x8 ) scores will be incremented by specified value (We have randomly selected a value equal to 300). Unfortunately, these results have not added more accuracy in this case, however, it confirmed the interacting possibilities between proteins x3 and x4 ; x7 and x8 . 8 3. Chapter 1 EXPERIMENTAL WORK To test the method, we obtained the PPI data from the Database of Interacting Proteins (DIP). The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of PPIs in Saccharomyces cerevisiae. Knowledge about PPI networks is extracted from the most reliable, core subset of the DIP data34. The DIP version we used contains 4749 proteins involved in 15675 interactions for which there is domain information6. However, only high quality core set of 2609 yeast proteins was considered in this experimental work. This core set is involved in 6355 interactions, which have been determined by at least one small-scale experiment or two independent experiments35. Furthermore, we selected proteins interacts with only one protein and not involved in any other interactions. This process results in a dataset of 150 proteins with 75 positive interactions as shown in Fig. 4. The intention is to design a method capable of predicting protein interaction partner, which facilitate a way to construct PPI using only protein sequences information. Figure 1-4. Dataset of core interaction proteins used in the experimental work We started our experimental work by measuring the protein-protein sequence interaction similarity using SW algorithm as implemented in FASTA36. The default parameters are used: gap opening penalty and extension penalties of 13 and 1, respectively, and a substitution matrix BLOSUM62 matrix. The choice of which substitution matrix to use is not trivial because there is no correct scoring scheme for all circumstances. The BLOSUM matrix is another very common used amino acid substitution matrix that depends on data from actual substitutions. This procedure produces the matrix M 150 X 150 . This matrix was then enhanced by incorporating inter-domain linker regions information. In this case, only well defined domains with sequence length ranging from 50 to 500 residues were considered. We skipped all the frequently matching (unspecific) domains. A 1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information 9 threshold value of 0.093 is used to separate the linker regions. Any residue generates an index greater than the threshold value results in eliminating it. This procedure downsized the protein sequences without losing the biological information. In fact, running the SW algorithm on a sequence having pure domains, results in better accuracy. A linker preference profile is generated using the linker index values along an amino acid sequence using a siding window. A window of size w = 15 was used which gives the best performance (different window sizes were tested). Further more, protein domains knowledge will be incorporated in M 150 X 150 . In this implementation, ps_scan33 is used to scan one or several patterns from PROSITE against the 150 protein sequences. All frequently matching (unspecific) patterns and profiles are skipped. The M 150 X 150 is then used to predict the PPI network. 4. RESULTS AND DISCUSSION The performance of the proposed method is measured by how well it can predict the PPI network. Prediction accuracy, whose value is the ratio of the number of correctly predicted interactions between protein pairs to the total number of interactions and non-interactions possibilities in network, is the best index for evaluating the performance of a predictor. However, approximately 20% of the data are truly interacting proteins, which leads to a rather unbalanced distribution of interacting and non-interacting cases. Suppose we have 6 proteins, then the interacting pairs are 1→2, 3→4 and 5→6, which result in 3 interactions cases out of a total of 15 interaction possibilities (12 non-interactions). To assess our method objectively, another two indices are introduced in this paper, namely specificity and sensitivity commonly used in the evaluation of information retrieval. A high sensitivity means that many of the interactions that occur in reality are detected by the method. A high specificity indicates that most of the interactions detected by the screen are also occurring in reality Sensitivity and specificity are combined measures of true positive ( tp ), true negative ( tn ), false positive ( fp ) and false negative ( fn ) and can be expressed as: Based on these performance measures, the method was able to achieve encouraging results. In Fig. 5-6, we summarized the sensitivity and specificity results based on the three stages of the method. The figures clearly show improvement in sensitivity but not much in specificity and that’s because of the big number of non-interacting possibilities. 10 Chapter 1 Sensitivity Accuracy (100%) Accuracy (100%) Specificity 0.703 0.7025 0.702 0.7015 0.701 0.7005 0.7 0.6995 0.699 0.7026 0.7026 0.6992 0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54 0.6067 0.5733 0.5467 Using only SW Eliminating inter- Adding protein Algorithm domain linker domains regions evidence Using only SW Eliminating inter- Adding protein Algorithm domain linker domains evidence regions The 3 steps of the method The 3 steps of the method Accuracy (100%) Figure 1-5. Sensitivity and specificity results 70.3 70.2 70.1 70 69.9 69.8 69.7 69.6 70.19 70.17 69.82 Using only SW Algorithm Eliminating inter- Adding protein domain linker domains evidence regions The 3 steps of the method Figure 1-6. Overall accuracy The overall performance evaluation results are summarized in Table 1. Table 1-1. Overall performance evaluation tp fp Similarity Using SW Algorithm 82 68 Inter-domain Linker Regions 86 64 Structural Domain Evidence 91 59 5. tn fn Sens Spec 15523 15597 15597 6677 6603 6603 0.5467 0.5733 0.6067 0.6992 0.7026 0.7026 Accuracy 60.82 70.17 70.19 CONCLUSION In this research work we make use of both homology and structural similarities among domains of known interacting proteins to predict putative protein interaction pairs. When tested on a sample data obtained from the DIP, the proposed method shows great potential and a new vision to predict PPI. It proves that the combination of methods predicts domain boundaries or linker regions from different aspects and the evolutionary relationships would improve accuracy and reliability of the prediction as a whole. However, it is difficult to directly compare the accuracy of our proposed method because all of the other existing methods use different criteria for 1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information 11 assessing the predictive power. Moreover, these existing methods use completely different characteristics in the prediction. One of the immediate future works is to consider the entire PPI network and not to restrict our work on binary interaction. Other future work will focus on employing more powerful domain linker region identifier such as profile domain linker index (PDLI)37. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. I. Donaldson, J. Martin, B. Bruijn, B. Wolting, C. Lay, V. Tuekam, B. Zhang, S. Baskin, B. Bader, G. D. Michalickova, K. Pawson, T. and C. W. Hogue, PreBIND and Textomy-mining the biomedical literature for protein-protein interactions using a support vector machine, BMC Bioinformatics, vol. 4(11), (2003). E. Gharakhanian, J. Takahashi, J. Clever, and H. Kasamatsu, In vitro Assay for ProteinProtein Interaction: Carboxyl-Terminal 40 Residues of Simian Virus 40 Structural Protein VP3 Contain a Determinant for Interaction with VP1, PNAS, 85(18), 6607-6611 (1998). P. L. Bartel and S. Fields, The yeast two-hybrid system. In Advances in Molecular Biology (Oxford University Press, New York, 1997). G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Seraphin, A generic protein purification method for protein complex characterization and proteome exploration, Nature Biotechnology, 17, 1030-1032 (1999). M. Selbach, and M. Mann, Protein interaction screening by quantitative immunoprecipitation combined with knockdown (QUICK), Nature Methods, 3, 981-983 (2006). L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, D. Eisenberg, The Database of Interacting Proteins: 2004 update, Nucleic Acids Res., 1(32), 449-51 (2004). H. W. Mewes, MIPS: analysis and annotation of proteins from whole genomes, Nucleic Acids Res., 32, 41-44 (2004). S. Peri,Human protein reference database as a discovery resource for proteomics, Nucleic Acids Res., 32, 497–501 (2004). J. Espadaler, Detecting remotely related proteins by their interactions and sequence similarity, Proc. Natl Acad. Sci. USA, 102, 7151–7156 (2005). E. Marcotte, Detecting protein function and protein–protein interactions from genome sequences, Science, 285, 751–753 (1999). T. Dandekar, Conservation of gene order: a fingerprint of proteins that physically interact, Trends Biochem. Sci., 23, 324–328 (1998). M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates, Assigning protein functions by comparative genome analysis: protein phylogenetic profiles, proceedings of National Academy of Sciences, USA, 96, 4285–4288 (1999). A. Szilàgyi, V. Grimm, A. K Arakaki and J. Sholnick, Prediction of physical proteinprotein interactions, Phys. Biol., 2,1-16 (2005). E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg, A combined algorithm for genome-wide prediction of protein function, Nature, 402, 83–86 (1999). F. Pazos and A. Valencia, Similarity of phylogenetic trees as indicator of protein-protein interaction, Protein Engineering, 14, 609- 614 (2001). 12 Chapter 1 16. J. Enright, I. N. Ilipoulos, C. Kyrpides, and C. A. Ouzounis, Protein interaction maps for complete genomes based on gene fusion events, Nature, 402, 86–90 (1999). 17. D. Eisenberg, E. M. Marcotte, I. Xenarios, and T. O. Yeates, Protein function in the postgenomic era, Nature, 405, 823-826 (2000). 18. J. Wojcik and V. Schachter, Protein-Protein interaction map inference using interacting domain profile pairs, Bioinformatics, 17, 296-305 (2001). 19. W. K. Kim, J. Park, and J. K. Suh, Large scale statistical prediction of protein-protein 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. interaction by potentially interacting domain (PID) pair,Genome Informatics, 13, 42-50 (2002). S. K. Ng, Z. Zhang, and S. H. Tan, integrative approach for computationally inferring protein domain interactions, Bioinformatics, 19, 923-929 (2002). S. M. Gomez, W. S. Noble, and A. Rzhetsky, Learning to predict protein-protein interactions from protein sequences, Bioinformatics, 19, 1875-1881 (2003). C. Huang, F. Morcos, S. P. Kanaan, S. Wuchty, A. Z. Chen, and J. A. Izaguirre, Predicting Protein-Protein Interactions from Protein Domains Using a Set Cover Approach, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(1), 78-87 (2007). T. Pawson and P. Nash, Assembly of cell regulatory systems through protein interaction domains, Science, 300, 445-452 (2003). N. M. Zaki, S. Deris and H. Alashwal, Protein Protein Interaction Detection Based on Substring Sensitivity Measure, International J. of Biomedical Sci., 1, 148-154 (2006). P. Aloy and R. B. Russell, InterPreTS: protein interaction prediction through tertiary structure, Bioinformatics, 19, 161–162 (2003). L. Lu, Multiprospector: an algorithm for the prediction of protein–protein interactions by multimeric threading, Proteins, 49, 350–364 (2002). J. Espadaler, O. Romero-Isart, R. M. Jackson and B. Oliva1 Prediction of protein– protein interactions using distant conservation of sequence patterns and structure relationships, Bioinformatics, 21, 3360–3368 (2005). O. Keskin, A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications, Protein Sci., 13, 1043–1055 (2004). T. Smith and M. Waterman, “Identification of common molecular subsequences”, J. Mol. Bio., 147, 195-197 (1981). H. Saigo, J. Vert, N. Ueda and T. Akutsu, Protein homology detection using string alignment kernels, Bioinformatics, 20(11), 1682-1689 (2004). A. Bairoch and R. Apweiler, The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000, Nucleic Acids Res., 28, 45–48 (2000). M. Suyama and O. Ohara, DomCut: prediction of inter-domain linker regions in amino acid sequences, Bioinformatics, 19, 673-674 (2003). A. Gattiker, E. Gasteiger, A. Bairoch, ScanProsite: a reference implementation of a PROSITE scanning tool, Applied Bioinformatics, 1, 107-108 (2002). I. Xenarios, L. Salwínski, X. J. Duan, P. Higney, S. Kim and D. Eisenberg, DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions, Nucleic Acids Research, Oxford University Press, 30, 303-305 (2002). C. M. Deane, L. Salwinski, I. Xenarios, and D. Eisenberg, Protein interactions: two methods for assessment of the reliability of high throughput observations, Molecular & Cellular Proteomics, 1, 349-56 (2002). W. R. Pearson, Rapid and sensitive sequence comparisons with FASTAP and FASTA Method, Enzymol, 183, 63-93 (1985) Q. Dong, X. Wang, L. Lin, Z. Xu, Domain boundary prediction based on profile domain linker propensity index, Computational Biology and Chemistry, 30, 127–133 (2006).