Module 3 CSE3069 (Bioinformatics)
Module 3 CSE3069 (Bioinformatics)
School of Engineering
Introduction to Bioinformatics
CSE 3069
V Semester 2023-24
Module 3
DNA sequence analysis
Dynamic Programming
4
Components of Alignment
1] Matches
String 1: WEAREHUMANS
2] Mismatches
String 2: WEARENOTHUMANZ
3] Gaps
5
6
Continued…
Scoring Scheme:
A1: +1+0-1+1-1+1 = 1
A2: +1+1+1+0-1+1 = 3
7
Global vs. Local Alignment
• Global Alignment:
• Local Alignment:
8
Needleman and Wunsch Algorithm
Steps:
• Traceback
9
Example 1
10
Continued…
11
Continued…
12
Continued…
13
Continued…
Step 3: Traceback
14
Continued…
_TGGTG
ATCGT_
15
Smith-Waterman algorithm
Steps:
• Traceback
16
Example
17
Continued…
18
Continued…
19
Continued…
20
Continued…
21
Global Alignments
22
Continued…
23
Local Alignments
24
Continued…
25
Examples
Example 1
Example 2
26
Sequence Alignment
27
Types
28
Pairwise sequence alignment
29
Continued…
• Pair-wise sequence alignment only compares
two sequences at a time.
• Optimality is based on SCORE.
• A pairwise alignment consists of a series of
paired bases, one base from each sequence.
There are three types of pairs:
(1)matches = the same nucleotide appears in
both sequences.
(2)Mismatches = different nucleotides are found in
the two sequences.
(3)Gaps = a base in one sequence and a null base
in the other.
30
Continued…
31
Multiple sequence alignment
32
Continued…
34
FASTA
35
Continued…
36
Continued…
Step 1:
Find all k-length identities, then find locally similar
regions by selecting those dense with k-word
identities (i.e. many k-words, without too many
gaps between). The best ten initial regions are
used.
Step 2:
The initial regions are re-scored along their
lengths by applying a substitution matrix in the
usual way. Optimally scoring subregions are
identified.
37
Continued…
Step 3:
Create an alignment of the trimmed initial regions
using dynamic programming. Regions with too low
of a score are not included.
Step 4:
Optimize the alignment from Step 3 using Smith-
Waterman algorithm.
38
Basic Local Alignment Search Tool
39
Continued…
Step 1:
• The sequence is optionally filtered to remove
low-complexity regions that are not useful for
producing meaningful sequence alignments.
Step 2:
• A list of words of length 3 in the query protein
sequence is made starting with positions 1, 2,
and 3; then 2, 3, and 4, etc.; until the last 3
available positions in the sequence are reached
(word length 11 for DNA sequences, 3 for
programs that translate DNA sequences)
41
Continued…
Step 3:
• Using the BLOSUM62 substitution scores, the
query sequence words in step 1 are evaluated for
an exact match with a word in any database
sequence. The words are also evaluated for
matches with any other combination of three
amino acids, the objective being to find the scores
for aligning the query word with any other three-
letter word found in a database sequence.
42
Continued…
43
Continued…
Step 4:
• A cutoff score called neighborhood word score
threshold (T) is selected to reduce the number
of possible matches to the most significant
ones. For example, if this cutoff score T is 13,
only the words that score above 13 are kept.
44
Continued…
Step 5:
• The above procedure is repeated for each three-
letter word in the query sequence. For a
sequence of length 250 amino acids, the total
number of words to search for is approximately
50 × 250 = 12,500.
Step 6.:
• The remaining high-scoring words that comprise
possible matches to each three- letter position
in the query sequence are organized into an
efficient search tree for comparing them rapidly
to the database sequences.
45
Continued…
Step 7:
• Each database sequence is scanned for an exact
match to one of the 50 words corresponding to
the first query sequence position, for the words
to the second position, and so on. If a match is
found, this match is used to seed a possible un-
gapped alignment between the query and
database sequences.
46
Continued…
Step 8:
• In the original BLAST method, an attempt was
made to extend an alignment from the matching
words in each direction along the sequences,
continuing for as long as the score continued to
increase. The extension process in each direction
was stopped when the accumulated score
stopped increasing and had just begun to fall a
small amount below the best score found for
shorter extensions. At this point, a larger stretch
of sequence (called the HSP or high-scoring
segment pair), which has a larger score than the
original word, may have been found.
47
Continued…
27 726 44
48
Continued…
Step 9:
• The next step is to determine whether each HSP
score found by one of the above methods is
greater in value than a cutoff score S. A suitable
value for S is deter- mined empirically by
examining the range of scores found by
comparing random sequences, and by choosing
a value that is significantly greater. The high
scoring pairs (HSPs) matched in the entire
database are identified and listed.
49
Continued…
Step 10:
• BLAST next determines the statistical
significance of each HSP score. A probability
that two random sequences, one the length of
the query sequence and the other the entire
length of the database (which is approximately
equal to the sum of the lengths of all of the
database sequences), could achieve the HSP
score is calculated.
50
Continued…
Step 11:
• Sometimes, two or more HSP regions that can
be made into a longer alignment will be found,
thereby providing additional evidence that the
query and database sequences are related. In
such cases, a combined assessment of the
significance will be made.
51
Continued…
Step 12:
• Smith-Waterman local alignments are shown for
the query sequence with each of the matched
sequences in the database. Earlier versions of
BLAST produced only ungapped alignments that
included the initially found HSP. If two HSPs
were found, two separate alignments were
produced because the two regions could not be
aligned without gaps.
52
Progressive Alignment Method
53
Continued…
54
Clustal W
55
Continued…
56
End of Module 3
57