Lecture 101
Lecture 101
Lecture 101
10/9/2013
Need to find more efficient method Sacrifice certainty of optimum alignment for certainty of good alignment but faster
10/9/2013
Feng-doolittle algorithm
Does all pairwise alignments and scores them Converts pairwise scores to distances D = -logSeff = -log [(Sobs Srand)/(Smax Srand)] Sobs = pairwise alignment score Srand = expected score for random alignment Smax = average of self-alignments of the two sequences
10/9/2013 Bioinformatics I Fall 2002 5
As Srand increases (increasing evolutionary distance), Seff goes down; this is why log is used to scale Seff so that its linear with evolutionary distance.
10/9/2013
Once the distances have been calculated, construct a guide tree (more in the phylogeny class) tells what order to group the sequences Sequences can be aligned with sequences or groups; groups can be aligned with groups
10/9/2013 Bioinformatics I Fall 2002 7
Sequence-sequence alignments: dynamic programming Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned
10/9/2013 Bioinformatics I Fall 2002 8
Example
Seq1
Seq2
Seq3
Seq4
Seq5
Alignment 1
Alignment 3
10/9/2013 Bioinformatics I Fall 2002
Notice that this method does not guarantee the optimum alignment; just a good one.
Gaps are preserved from alignment to alignment: once a gap, always a gap
10/9/2013 Bioinformatics I Fall 2002 10
In-class exercise
Run Gap on all combinations of the sequences in multalign.rsf; penalize endgaps as before; use a gap penalty of 6 and an extension penalty of 2 Record quality scores of each pairwise comparison
10/9/2013
11
Start refining alignment: Use structural info if you have it Find patterns if you dont Use amino acid structure handout from beginning of class for substitution decisions!
10/9/2013
14
ClustalW
Most widely used multiple alignment method Similar strategy to the Feng-Doolittle approach implemented as Pileup, but more complex and gives generally superior results Ad hoc nature of the program can be mysterious
10/9/2013 Bioinformatics I Fall 2002 15
Advantageous differences
Gap penalties vary locally: By observed frequency (in database) after each residue By simple structure prediction lower gap penalties in probable loop regions By proximity to existing gaps higher gap penalties when within 8 residues of an existing gap
10/9/2013 Bioinformatics I Fall 2002 16
Advantages, cont.
Change in substitution matrix choice depending on distance computed for guide tree Substitution matrix families Profile construction (more later) Weighting of sequences in profiles depending on evolutionary distance computed for guide tree More similar sequences get less weight than less similar sequences
10/9/2013
17
In class exercise II
Change a few parameters in the ClustalW program (gap, gap extension, substitution matrix, etc.) one at a time: this is done in Alignment Setup. After each run with a different change, save the alignment project with some descriptive name that you can remember (e.g., gap20 or blosum) Compare alignment results with different parameters changed
10/9/2013 Bioinformatics I Fall 2002 18
MultAlin
MultAlin is also a heuristic algorithm that builds up a multiple alignment from a group of pairwise alignments It differs from Pileup and Clustal in that the guide tree is recalculated based on the results of each alignment step Because this leads to cycles of tree building and alignmnent, MultAlin can take a long time to run. It stops after the overall alignment score stops improving
10/9/2013 Bioinformatics I Fall 2002 19
Introduction to profiles
A profile is a way to take into account variability in characters within each position in a msa As you look along the column of a msa, you can see different characters A profile is a way to preserve the information about the differences observed within each column
10/9/2013 Bioinformatics I Fall 2002 21
10/9/2013
Profile-type programs
Again, idea is to make a profile, then search for more sequences that fit that profile Example: ProfileSearch, BLOCKS, etc. Can combine this with statistical methods like expectation maximization (MEME) and Gibbs Sampler. Both of these depend on random starting points and algorithms that converge on an endpoint
10/9/2013 Bioinformatics I Fall 2002 25
Now select ProfileSearch; use Query profiles pulldown to select the profilemake (.prf) output you just created; this profile will be the input for the Profile Search The default database to search is SwissProt; well leave it at the default, but you can use the Search set option to change or add to the databases searched (this is a local search) Choose ProfileSegments; this will show local alignments between some number of the returned sequences with the profile Choose run; this search will take a long time, perhaps 20 minutes
10/9/2013 Bioinformatics I Fall 2002 27
10/9/2013
28
10/9/2013
30
PAM matrices
Started with alignments of very closely related proteins; each pair of sequences was at least 85% identical Then used idea of parsimony (least number of changes) to build phylogenetic trees
gcgctgfki
gcgctlfki
asgctafkl acactafkl
32
10/9/2013
For each amino acid, find the frequency Fi,j for which it is (reciprocally) substituted by every other amino acid for example, Fg,a = 3 Compute the relative mutability mi of each amino acid the relative mutability is the number of times the amino acid is substituted by any other amino acid in the phylogenetic tree, divided by the number of mutations that could have affected the residue (times scaling factor 100) for example, ma number of times a substituted = 4 number of mutations in entire tree x 2 = 12 frequency of a = 10/63 = 0.159 ma = 4/12 x 0.159 x 100 = 0.0209
10/9/2013 Bioinformatics I Fall 2002 33
Compute the mutation probability Mij of each pair of amino acids Mij = mjFij/SFij SFij is the total number of substitutions involving a in the phylogenetic tree Mg,a = 0.0209 x = 0.0156 (notice j refers to a) Divide the mutation probability by the frequency of occurrence fi of residue i, and take the log fi in this example is fg = 10/63 = 0.1587 Rg,a = log (0.0156/0.1587) = 1.01
10/9/2013 Bioinformatics I Fall 2002 34
Then defined a 1PAM (1 point accepted mutation) matrix to be one where the expected number of mutations overall was 1% (thats why the factor of 100 is used). The entries in the whole matrix generated by the above method gives the PAM1 matrix; this supposes an amount of evolutionary time goes by that will let this amount of change happen
10/9/2013
35
To get longer times, multiply PAM1 by itself by however many units of time you are interested in, e.g. PAM250 is PAM1 raised to the 250 power So for PAM, higher numbers indicate longer evolutionary time
10/9/2013
36
BLOSUM matrices
Well turn to the BLOSUM substitution matrices, because they were made using a special kind of profile called a block: areas of alignments that contain no gaps We want to assign a score that gives a measure of the relative likelihood that the sequences are related as opposed to unrelated So, we consider models of each case, assign a probability to the alignment in each case, and then take a ratio
10/9/2013 Bioinformatics I Fall 2002 37
We need to introduce some notation dont be frightened! xi is the ith symbol in sequence x; yj is the jth symbol in sequence y The symbols come from an alphabet; for proteins, 20 letters; for DNA, 4 letters; letters from the alphabet (a,b)
10/9/2013
38
Random model R
Letter a occurs independently with frequency qa, so the probability of the two sequences is the product of the probabilities of each character P(x,y|R) reads the probability of sequences x and y given R P(x,y|R) = Pqxi Pqyj
i j
10/9/2013 Bioinformatics I Fall 2002 39
Match model M
Pairs of residues occur with a joint probability pab (probability that a and b derived independently from common ancestor c) Probability of the whole alignment is the product of the individual joint probabilities P (x,y|M) = P pxiyi
i
10/9/2013 Bioinformatics I Fall 2002 40
Remember, we said we were going to take the ratio of these probabilities (M/R) So, P(x,y|M) = Pi pxiyi = P pxiyi P(x,y|R) PiqxiPiqyi qxiqyi This is known as the odds ratio: the product of the joint probabilities over the individual probabilities To make it additive, we make it the logodds ratio by taking the logarithm; to get score, we add all the log-odds ratios S = Si s(xiyi), where s(a,b) = log pab/qaqb This is the log likelihood ratio that the pair 10/9/2013 Bioinformatics I Fall 2002 41 a,b is aligned as opposed to unaligned
BLOSUM matrices
Derived from BLOCKS database; aligned, ungapped regions from protein families Sequences in each block were clustered by percentage of identical residues > L% Then calculated frequencies Aab of observing a in one cluster aligned with b in a different cluster; corrections for sizes of clusters by weighting 1/n1n2 Estimated qa = the fraction of pairings that included an a
10/9/2013 Bioinformatics I Fall 2002 42
Estimated pab = the fraction of pairings of a with b (from all observed pairings) Then s(a,b) is the log-odds ratio s(a,b) = log pab/qaqb (These values are actually scaled and rounded) Thus the number after the BLOSUM (e.g., BLOSUM62) means that the log-odds scores in this matrix were from sequences that had > x% identity For BLOSUM, smaller number means less similarity, and hence we presume longer evolutionary time
10/9/2013 Bioinformatics I Fall 2002 43