Chapter 1
PROTEIN–PROTEIN INTERACTION
PREDICTION USING HOMOLOGYAND INTERDOMAIN LINKER REGION INFORMATION
Nazar Zaki1
1.
INTRODUCTION
One of the central problems in modern biology is to identify the complete
set of interactions among proteins in a cell. The structural interaction of
proteins and their domains in networks is one of the most basic molecular
mechanisms for biological cells. Structural evidence indicates that,
interacting pairs of close homologs usually interact in the same way. In this
Chapter, we make use of both homology and inter-domain linker region
knowledge to predict interaction between protein pairs solely by amino acid
sequence information. High quality core set of 150 yeast proteins obtained
from the Database of Interacting Proteins (DIP) was considered to test the
accuracy of the proposed method. The strongest prediction of the method
reached over 70% accuracy. These results show great potential for the
proposed method.
1.1
The Importance of Protein–Protein Interaction
The term protein-protein interaction (PPI) refers to the association of
protein molecules and the study of these associations from the perspective of
biochemistry, signal transduction and networks. PPIs occur at almost every
1
Nazar Zaki is an Assistant Professor with the College of Information Technology, UAE University. AlAin 17555, UAE, (phone: +971-50-7332135; fax: +971-3-7626309; e-mail: nzaki@uaeu.ac.ae).
2
Chapter 1
level of cell function, in the structure of sub-cellular organelles, the transport
machinery across the various biological membranes, the packaging of
chromatin, the network of sub-membrane filaments, muscle contraction,
signal transduction, and regulation of gene expression, to name a few1.
Abnormal PPIs have implications in a number of neurological disorders;
include Creutzfeld-Jacob and Alzheimer's diseases. Because of the
importance of PPIs in cell development and disease, the topic has been
studied extensively for many years. A large number of approaches to detect
PPIs have been developed. Each of these approaches has strengths and
weaknesses, especially with regard to the sensitivity and specificity of the
method.
1.2
Current Methods to Predict PPI
One of the major goals in functional genomics is to determine protein
interaction networks for whole organisms, and many of the experimental
methods have been applied to study this problem. Co-immunoprecipitation is
considered to be the gold standard assay for PPIs, especially when it is
performed with endogenous proteins2. Yeast two-hybrid screen investigates
the interaction between artificial fusion proteins inside the nucleus of yeast3.
Tandem Affinity Purification (TAP) detects interactions within the correct
cellular environment4. Quantitative immunoprecipitation combined with
knock-down (QUICK) detects interactions among endogenous non-tagged
proteins5. High-throughput methods have also contributed tremendously in
the creation of databases containing large sets of protein interactions, such as
Database of Interacting Proteins (DIP)6, MIPS7 and Human Protein
Reference Database (HPRD)8.
In addition, several in silico methods have been developed to predict PPI
based on features such as gene context9. These include gene fusion10, gene
neighborhood11 and phylogenetic profile12. However, most of the in silico
methods seek to predict functional association, which often implies but is not
restricted to physical binding.
Despite the availability of the mentioned methods of predicting PPI, the
accuracy and coverage of these techniques have proven to be limited.
Computational approaches remain essential both to assist in the design and
validation of the experimental studies and for the prediction of interaction
partners and detailed structures of protein complexes13.
1.3
Computational Approaches to Predict PPI
Some of the earliest techniques predict interacting proteins through the
similarity of expression profiles14, description of similarity of phylogenetic
1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information
3
profiles12 or phylogenetic trees15, and studying the patterns of domain
fusion16. However, it has been noted that these methods predict PPI in a
general sense, meaning joint involvement in a certain biological process, and
not necessarily actual physical interaction17.
Most of the recent works focus on employing protein domain
knowledge18-22. The motivation for this choice is that molecular interactions
are typically mediated by a great variety of interacting domains23. It is thus
logical to assume that the patterns of domain occurrence in interacting
proteins provide useful information for training PPI prediction methods24.
An emerging new approach in the protein interactions field is to take
advantage of structural information to predict physical binding25-26. Although
the total number of complexes of known structure is relatively small, it is
possible to expand this set by considering evolutionary relationships between
proteins. It has been shown that in most cases close homologs (>30%
sequence identity) physically interact in the same way with each other.
However, conservation of a particular interaction depends on the
conservation of the interface between interacting partners27.
In this chapter, we propose to predict PPI using only sequence
information. The proposed method combines homology and structural
relationships. Homology relationships will be incorporated by measuring the
similarity between protein pair using Pairwise Alignment. Structural
relationships will be incorporated in terms of protein domain and interdomain linker region information. We are encouraged by the fact that
compositions of contacting residues in protein sequence are unique, and that
incorporating evolutionary and predicted structural information improves the
prediction of PPI28.
2.
METHOD
In this work, we present a simple yet effective method to predict PPI
solely by amino acid sequence information. The overview of the proposed
method is illustrated in Fig. 1. It consists of three main steps: (a) extract the
homology relationship by measuring regions of similarity that may reflect
functional, structural or evolutionary relationships between protein
sequences (b) downsize the protein sequences by predicting and eliminating
inter-domain linker regions (c) scan and detect domain matches in all the
protein sequences of interest.
4
Chapter 1
Figure 1-1. Overview of the proposed method
2.1
Similarity Measures between Protein Sequences
The proposed method starts by measuring the PPI sequence similarity,
which reflects the evolutionary and homology relationships. Two protein
sequences may interact by the mean of the amino acid similarities they
contain24. This work is motivated by the observation that an algorithm such
as Smith-Waterman (SW)29, which measures the similarity score between
two sequences by a local gapped alignment, provides relevant measure of
similarity between protein sequences. This similarity incorporates biological
knowledge about protein evolutionary structural relationships30.
The Smith-Waterman similarity score SW ( x1 , x 2 ) between two protein
sequences x1 and x2 is the score of the best local alignment with gaps
between the two protein sequences computed by the Smith-Waterman
dynamic programming algorithm. Let us denote by µ a possible local
alignment between x1 and x2 , defined by a number n of aligned residues,
and by the indices 1 ≤ i1 < ... < in ≤ x1 and 1 ≤ j1 < ... < jn ≤ x2 of the
aligned residues in x1 and x2 respectively. Let us also denote by ∏( x1 , x2 )
the set of all possible local alignments between x1 and x2 , and by
p( x1 , x2 , µ ) the score of the local alignment µ ∈ ∏( x1 , x2 ) between x1
and x2 , the Smith-Waterman score SW ( x1 , x2 ) between sequences x1 and x2
can be written as:
SW ( x1 , x2 ) = max p( x1 , x2 , µ )
µ ∈∏ ( x1 , x 2 )
(1)
5
1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information
The similarity matrix M can be calculated as follow:
SW(x1, x1)
SW(x2 , x1)
M=
:
SW(x1, x2 )
SW(x2 , x2 )
:
...
...
:
SW(x1, xm )
SW(x2 , xm )
:
SW(xm , x1) SW(xm , x2 )
...
SW(xm , xm )
(2)
where m is the number of the protein sequences. For example, suppose
we have the following randomly selected PPI dataset: YDR190C,
YPL235W, YDR441C, YML022W, YLL059C, YML011C, YGR281W and
YPR021C represented by x1 , x2 , x3 , x4 , x5 , x6 , x7 and x8 respectively.
The interaction between these 8 proteins is shown in Fig. 2.
Figure 1-2. The interaction between the randomly selected proteins
Then the SW similarity score matrix M will be calculated as:
x1
x2
x3
x4
x5
x6
x7
x8
x1
X
465
28
30
25
30
34
29
x2
465
X
30
24
32
33
50
47
x3
28
30
X
553
29
27
32
29
x4
30
24
553
X
29
20
25
40
x5
25
32
29
29
X
24
28
49
x6
30
33
27
20
24
X
25
26
x7
34
50
32
25
28
25
X
36
x8
29
47
29
40
49
26
36
X
From M , higher score may reflect interaction between two proteins.
SW ( x1 , x2 ) and SW ( x2 , x1 ) scores are equal to 465; SW ( x3 , x4 ) and SW ( x4 , x3 )
scores are equal to 553, which confirm the interaction possibilities.
However, SW ( x5 , x6 ) and SW ( x6 , x5 ) scores are equal to 24; SW ( x7 , x8 ) and
SW ( x8 , x7 ) scores are equal to 36, which are not the highest scores. To
correct these errors more biological information is needed, which lead us to
the second part of the proposed method.
6
2.2
Chapter 1
Identify and Eliminate Inter-Domain Linker Regions
The results could be further enhanced by incorporating inter-domain
linker regions knowledge. The next step of our algorithm is to predict interdomain linker regions solely by amino acid sequence information. Our
intention here is to identify and eliminate all the inter-domain linker regions
from the protein sequences of interest. By doing this step, we are actually
downsizing the protein sequence to shorter ones with only domains
information to yield better alignment scores. In this case, the prediction is
made by using linker index deduced from a data set of domain/linker
segments from SWISS-PROT database31. DomCut developed by Suyama et.
al32 is employed to predict linker regions among functional domains based
on the difference in amino acid composition between domain and linker
regions. Following Suyama’s method32, we defined the linker index Si for
amino acid residue i and it is calculated as follows:
Lin ker
Si = − ln
fi
Domain
fi
(3)
Where f i Lin ker is the frequency of amino acid residue i in the linker
region and fi Domain is the frequency of amino acid residue i in the domain
region. The negative value of Si means that the amino acid preferably exists
in a linker region. A threshold value is needed to separate linker regions as
shown in Fig. 3. Amino acids with linker score greater than the set threshold
value will be eliminated from the protein sequence of interest.
Figure 1-3. An example of linker preference profile generated using Domcut. In this case,
linker regions greater than the threshold value 0.093 are eliminated from the protein sequence.
When applying the second part of the method, the matrix M will be
calculated as follows:
1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information
x1
x2
x3
x4
x5
x6
x7
x8
x1
X
504
30
30
25
32
34
27
x2
504
X
30
21
32
32
50
36
x3
30
30
X
775
29
24
38
29
x4
30
21
775
X
19
21
53
37
x5
25
32
29
19
X
28
28
24
x6
32
32
24
21
28
X
23
27
x7
34
50
38
53
28
23
X
339
x8
27
36
29
37
24
27
339
X
7
From M , it’s clearly noted that, more evidence is shown to confirm the
interaction possibility between proteins x7 and x8 , and therefore, the result is
furthermore enhanced. In the following part of the method, protein domain
knowledge will be incorporated in M for better accuracy.
2.3
Detecting Domain Matches and Associated
Structural Relationships in Proteins
In this part of the method, protein domains knowledge will be
incorporated in M . Protein domains are highly informative for predicting
PPI as they reflect the potential structural relationships between them. In this
implementation, we employed ps_scan33 to scan one or several patterns,
rules and profiles from PROSITE against our protein sequences of interest.
Running ps_scan through the 8 proteins identifies the following domains:
YDR441C ( x 3 )
YML022W ( x 4 )
YGR281W ( x7 )
YPR021C ( x 8 )
→ PS00103
→ PS00103
→ PS00211, PS50893 and PS50929
→ PS50929
Which reveals structural relationships between proteins x3 and x4 ; and
proteins x7 and x8 . Based on this relationship, SW ( x3 , x 4 ) and
SW ( x7 , x8 ) scores will be incremented by specified value (We have
randomly selected a value equal to 300). Unfortunately, these results have
not added more accuracy in this case, however, it confirmed the interacting
possibilities between proteins x3 and x4 ; x7 and x8 .
8
3.
Chapter 1
EXPERIMENTAL WORK
To test the method, we obtained the PPI data from the Database of
Interacting Proteins (DIP). The DIP database catalogs experimentally
determined interactions between proteins. It combines information from a
variety of sources to create a single, consistent set of PPIs in Saccharomyces
cerevisiae. Knowledge about PPI networks is extracted from the most
reliable, core subset of the DIP data34. The DIP version we used contains
4749 proteins involved in 15675 interactions for which there is domain
information6. However, only high quality core set of 2609 yeast proteins was
considered in this experimental work. This core set is involved in 6355
interactions, which have been determined by at least one small-scale
experiment or two independent experiments35. Furthermore, we selected
proteins interacts with only one protein and not involved in any other
interactions. This process results in a dataset of 150 proteins with 75 positive
interactions as shown in Fig. 4. The intention is to design a method capable
of predicting protein interaction partner, which facilitate a way to construct
PPI using only protein sequences information.
Figure 1-4. Dataset of core interaction proteins used in the experimental work
We started our experimental work by measuring the protein-protein
sequence interaction similarity using SW algorithm as implemented in
FASTA36. The default parameters are used: gap opening penalty and
extension penalties of 13 and 1, respectively, and a substitution matrix
BLOSUM62 matrix. The choice of which substitution matrix to use is not
trivial because there is no correct scoring scheme for all circumstances. The
BLOSUM matrix is another very common used amino acid substitution
matrix that depends on data from actual substitutions. This procedure
produces the matrix M 150 X 150 . This matrix was then enhanced by
incorporating inter-domain linker regions information. In this case, only well
defined domains with sequence length ranging from 50 to 500 residues were
considered. We skipped all the frequently matching (unspecific) domains. A
1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information
9
threshold value of 0.093 is used to separate the linker regions. Any residue
generates an index greater than the threshold value results in eliminating it.
This procedure downsized the protein sequences without losing the
biological information. In fact, running the SW algorithm on a sequence
having pure domains, results in better accuracy. A linker preference profile
is generated using the linker index values along an amino acid sequence
using a siding window. A window of size w = 15 was used which gives the
best performance (different window sizes were tested).
Further more, protein domains knowledge will be incorporated in
M 150 X 150 . In this implementation, ps_scan33 is used to scan one or several
patterns from PROSITE against the 150 protein sequences. All frequently
matching (unspecific) patterns and profiles are skipped. The M 150 X 150 is then
used to predict the PPI network.
4.
RESULTS AND DISCUSSION
The performance of the proposed method is measured by how well it can
predict the PPI network. Prediction accuracy, whose value is the ratio of the
number of correctly predicted interactions between protein pairs to the total
number of interactions and non-interactions possibilities in network, is the
best index for evaluating the performance of a predictor. However,
approximately 20% of the data are truly interacting proteins, which leads to a
rather unbalanced distribution of interacting and non-interacting cases.
Suppose we have 6 proteins, then the interacting pairs are 1→2, 3→4 and
5→6, which result in 3 interactions cases out of a total of 15 interaction
possibilities (12 non-interactions).
To assess our method objectively, another two indices are introduced in
this paper, namely specificity and sensitivity commonly used in the
evaluation of information retrieval. A high sensitivity means that many of
the interactions that occur in reality are detected by the method. A high
specificity indicates that most of the interactions detected by the screen are
also occurring in reality Sensitivity and specificity are combined measures
of true positive ( tp ), true negative ( tn ), false positive ( fp ) and false
negative ( fn ) and can be expressed as:
Based on these performance measures, the method was able to achieve
encouraging results. In Fig. 5-6, we summarized the sensitivity and
specificity results based on the three stages of the method. The figures
clearly show improvement in sensitivity but not much in specificity and
that’s because of the big number of non-interacting possibilities.
10
Chapter 1
Sensitivity
Accuracy (100%)
Accuracy (100%)
Specificity
0.703
0.7025
0.702
0.7015
0.701
0.7005
0.7
0.6995
0.699
0.7026
0.7026
0.6992
0.61
0.6
0.59
0.58
0.57
0.56
0.55
0.54
0.6067
0.5733
0.5467
Using only SW Eliminating inter- Adding protein
Algorithm
domain linker
domains
regions
evidence
Using only SW Eliminating inter- Adding protein
Algorithm
domain linker domains evidence
regions
The 3 steps of the method
The 3 steps of the method
Accuracy (100%)
Figure 1-5. Sensitivity and specificity results
70.3
70.2
70.1
70
69.9
69.8
69.7
69.6
70.19
70.17
69.82
Using only SW
Algorithm
Eliminating inter- Adding protein
domain linker domains evidence
regions
The 3 steps of the method
Figure 1-6. Overall accuracy
The overall performance evaluation results are summarized in Table 1.
Table 1-1. Overall performance evaluation
tp fp
Similarity Using SW Algorithm 82 68
Inter-domain Linker Regions
86 64
Structural Domain Evidence
91 59
5.
tn
fn
Sens
Spec
15523
15597
15597
6677
6603
6603
0.5467
0.5733
0.6067
0.6992
0.7026
0.7026
Accuracy
60.82
70.17
70.19
CONCLUSION
In this research work we make use of both homology and structural
similarities among domains of known interacting proteins to predict putative
protein interaction pairs. When tested on a sample data obtained from the
DIP, the proposed method shows great potential and a new vision to predict
PPI. It proves that the combination of methods predicts domain boundaries
or linker regions from different aspects and the evolutionary relationships
would improve accuracy and reliability of the prediction as a whole.
However, it is difficult to directly compare the accuracy of our proposed
method because all of the other existing methods use different criteria for
1. Protein–Protein Interaction Prediction Using homologyand InterDomain Linker Region information
11
assessing the predictive power. Moreover, these existing methods use
completely different characteristics in the prediction. One of the immediate
future works is to consider the entire PPI network and not to restrict our
work on binary interaction. Other future work will focus on employing more
powerful domain linker region identifier such as profile domain linker index
(PDLI)37.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
I. Donaldson, J. Martin, B. Bruijn, B. Wolting, C. Lay, V. Tuekam, B. Zhang, S. Baskin,
B. Bader, G. D. Michalickova, K. Pawson, T. and C. W. Hogue, PreBIND and Textomy-mining the biomedical literature for protein-protein interactions using a support vector
machine, BMC Bioinformatics, vol. 4(11), (2003).
E. Gharakhanian, J. Takahashi, J. Clever, and H. Kasamatsu, In vitro Assay for ProteinProtein Interaction: Carboxyl-Terminal 40 Residues of Simian Virus 40 Structural
Protein VP3 Contain a Determinant for Interaction with VP1, PNAS, 85(18), 6607-6611
(1998).
P. L. Bartel and S. Fields, The yeast two-hybrid system. In Advances in Molecular
Biology (Oxford University Press, New York, 1997).
G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Seraphin, A generic
protein purification method for protein complex characterization and proteome
exploration, Nature Biotechnology, 17, 1030-1032 (1999).
M. Selbach, and M. Mann, Protein interaction screening by quantitative
immunoprecipitation combined with knockdown (QUICK), Nature Methods, 3, 981-983
(2006).
L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, D. Eisenberg, The
Database of Interacting Proteins: 2004 update, Nucleic Acids Res., 1(32), 449-51 (2004).
H. W. Mewes, MIPS: analysis and annotation of proteins from whole genomes, Nucleic
Acids Res., 32, 41-44 (2004).
S. Peri,Human protein reference database as a discovery resource for proteomics,
Nucleic Acids Res., 32, 497–501 (2004).
J. Espadaler, Detecting remotely related proteins by their interactions and sequence
similarity, Proc. Natl Acad. Sci. USA, 102, 7151–7156 (2005).
E. Marcotte, Detecting protein function and protein–protein interactions from genome
sequences, Science, 285, 751–753 (1999).
T. Dandekar, Conservation of gene order: a fingerprint of proteins that physically
interact, Trends Biochem. Sci., 23, 324–328 (1998).
M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates,
Assigning protein functions by comparative genome analysis: protein phylogenetic
profiles, proceedings of National Academy of Sciences, USA, 96, 4285–4288 (1999).
A. Szilàgyi, V. Grimm, A. K Arakaki and J. Sholnick, Prediction of physical proteinprotein interactions, Phys. Biol., 2,1-16 (2005).
E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg, A
combined algorithm for genome-wide prediction of protein function, Nature, 402, 83–86
(1999).
F. Pazos and A. Valencia, Similarity of phylogenetic trees as indicator of protein-protein
interaction, Protein Engineering, 14, 609- 614 (2001).
12
Chapter 1
16. J. Enright, I. N. Ilipoulos, C. Kyrpides, and C. A. Ouzounis, Protein interaction maps for
complete genomes based on gene fusion events, Nature, 402, 86–90 (1999).
17. D. Eisenberg, E. M. Marcotte, I. Xenarios, and T. O. Yeates, Protein function in the postgenomic era, Nature, 405, 823-826 (2000).
18. J. Wojcik and V. Schachter, Protein-Protein interaction map inference using interacting
domain profile pairs, Bioinformatics, 17, 296-305 (2001).
19. W. K. Kim, J. Park, and J. K. Suh, Large scale statistical prediction of protein-protein
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
interaction by potentially interacting domain (PID) pair,Genome Informatics, 13, 42-50
(2002).
S. K. Ng, Z. Zhang, and S. H. Tan, integrative approach for computationally inferring
protein domain interactions, Bioinformatics, 19, 923-929 (2002).
S. M. Gomez, W. S. Noble, and A. Rzhetsky, Learning to predict protein-protein
interactions from protein sequences, Bioinformatics, 19, 1875-1881 (2003).
C. Huang, F. Morcos, S. P. Kanaan, S. Wuchty, A. Z. Chen, and J. A. Izaguirre,
Predicting Protein-Protein Interactions from Protein Domains Using a Set Cover
Approach, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(1),
78-87 (2007).
T. Pawson and P. Nash, Assembly of cell regulatory systems through protein interaction
domains, Science, 300, 445-452 (2003).
N. M. Zaki, S. Deris and H. Alashwal, Protein Protein Interaction Detection Based on
Substring Sensitivity Measure, International J. of Biomedical Sci., 1, 148-154 (2006).
P. Aloy and R. B. Russell, InterPreTS: protein interaction prediction through tertiary
structure, Bioinformatics, 19, 161–162 (2003).
L. Lu, Multiprospector: an algorithm for the prediction of protein–protein interactions by
multimeric threading, Proteins, 49, 350–364 (2002).
J. Espadaler, O. Romero-Isart, R. M. Jackson and B. Oliva1 Prediction of protein–
protein interactions using distant conservation of sequence patterns and structure
relationships, Bioinformatics, 21, 3360–3368 (2005).
O. Keskin, A new, structurally nonredundant, diverse data set of protein–protein
interfaces and its implications, Protein Sci., 13, 1043–1055 (2004).
T. Smith and M. Waterman, “Identification of common molecular subsequences”, J.
Mol. Bio., 147, 195-197 (1981).
H. Saigo, J. Vert, N. Ueda and T. Akutsu, Protein homology detection using string
alignment kernels, Bioinformatics, 20(11), 1682-1689 (2004).
A. Bairoch and R. Apweiler, The SWISS-PROT protein sequence database and its
supplement TrEMBL in 2000, Nucleic Acids Res., 28, 45–48 (2000).
M. Suyama and O. Ohara, DomCut: prediction of inter-domain linker regions in amino
acid sequences, Bioinformatics, 19, 673-674 (2003).
A. Gattiker, E. Gasteiger, A. Bairoch, ScanProsite: a reference implementation of a
PROSITE scanning tool, Applied Bioinformatics, 1, 107-108 (2002).
I. Xenarios, L. Salwínski, X. J. Duan, P. Higney, S. Kim and D. Eisenberg, DIP, the
Database of Interacting Proteins: a research tool for studying cellular networks of protein
interactions, Nucleic Acids Research, Oxford University Press, 30, 303-305 (2002).
C. M. Deane, L. Salwinski, I. Xenarios, and D. Eisenberg, Protein interactions: two
methods for assessment of the reliability of high throughput observations, Molecular &
Cellular Proteomics, 1, 349-56 (2002).
W. R. Pearson, Rapid and sensitive sequence comparisons with FASTAP and FASTA
Method, Enzymol, 183, 63-93 (1985)
Q. Dong, X. Wang, L. Lin, Z. Xu, Domain boundary prediction based on profile domain
linker propensity index, Computational Biology and Chemistry, 30, 127–133 (2006).