Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Module 3 CSE3069 (Bioinformatics)

The document discusses various techniques for DNA sequence analysis including dynamic programming, global and local sequence alignment, and algorithms like Needleman-Wunsch, Smith-Waterman, BLAST and FASTA. It provides examples and steps to perform sequence alignment and explains uses of pairwise and multiple sequence alignment.

Uploaded by

Asmi Tanzaen H N
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module 3 CSE3069 (Bioinformatics)

The document discusses various techniques for DNA sequence analysis including dynamic programming, global and local sequence alignment, and algorithms like Needleman-Wunsch, Smith-Waterman, BLAST and FASTA. It provides examples and steps to perform sequence alignment and explains uses of pairwise and multiple sequence alignment.

Uploaded by

Asmi Tanzaen H N
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

PRESIDENCY UNIVERISTY,BENGALURU

School of Engineering

Department of Computer Science & Engineering

Introduction to Bioinformatics
CSE 3069

V Semester 2023-24
Module 3
DNA sequence analysis
Dynamic Programming

• Method for solving a complex problem by


breaking it down into a collection of simpler sub-
problems

• Solving each of these sub-problems just once and


storing their solutions ideally, using a memory
based data structure.

• Then next time the same sub-problem occurs,


instead of re-computing its solution, one simply
looks up the previously computed solution,
thereby saving computation time at the expense
of a modest expenditure in storage space.
3
Continued…

• Breaking down a larger problem into smaller sub


problems/ tasks.

• Solves each sub-problem in order to solve the bigger


problem.

• A computational method to find the best optimal


alignment between two sequences.

• This method compares every character in the two


sequences and generates an alignment.

4
Components of Alignment

1] Matches
String 1: WEAREHUMANS
2] Mismatches
String 2: WEARENOTHUMANZ

3] Gaps

WEARE HUMANS WEARE ___ HUMAN S

WEARE NOTHUMANZ WEARENOTHUMAN Z

5
6
Continued…

Scoring Scheme:

+1 for every match


-1 for every mismatch
0 for gaps

A1: +1+0-1+1-1+1 = 1

A2: +1+1+1+0-1+1 = 3
7
Global vs. Local Alignment

• Global Alignment:

Aligns both sequences end to end.


Example: Needleman Wunsch Algorithm

• Local Alignment:

Aligns stretches of sequence with the highest


density of matches.
Example: Smith Waterman Algorithm

8
Needleman and Wunsch Algorithm

Steps:

• Initialize N+1 x M+1 matrix


Where M and N correspond to the size of the
sequences to be aligned

• Fill the matrix from upper left corner to the


lower right corner in a recursive fashion (using
scoring scheme)

• Traceback

9
Example 1

10
Continued…

11
Continued…

12
Continued…

Step 2: Matrix Filling

13
Continued…

Step 3: Traceback

14
Continued…

_TGGTG

ATCGT_

15
Smith-Waterman algorithm

Steps:

• Initialize N+1 x M+1 matrix


Where M and N correspond to the size of the
sequences to be aligned

• Fill the matrix from upper left corner to the


lower right corner in a recursive fashion (using
scoring scheme)

• Traceback

16
Example

To find the best local alignment of sequences


“ACCTAAGG” and “GGCTCAATCA”, using +2 for a
match, -1 for a mismatch, and -2 for a gap:

We first make matrix T (as in N-W):


The 0th row and 0th column of T are filled with
zeroes
The recurrence relation is then used to fill the
matrix T

17
Continued…

18
Continued…

19
Continued…

20
Continued…

21
Global Alignments

• Global alignments, which attempt to align every


residue in every sequence, are most useful
when the sequences in the query set are similar
and of roughly equal size. (This does not mean
global alignments cannot start and/or end in
gaps.)

• A general global alignment technique is


the Needleman–Wunsch algorithm, which is
based on dynamic programming.

22
Continued…

• Attempts to align the entire sequence using as


many characters as possible, upto both ends of
each sequence.

• Sequences that are quite similar and


approximately the same length are suitable
candidates for global alignment.

• Needleman-Wunch algorithm is used to produce


global alignment between pairs of DNA or
Protein sequences.

23
Local Alignments

• Local alignments are more useful for dissimilar


sequences that are suspected to contain regions
of similarity or similar sequence motifs within
their larger sequence context.

• The Smith–Waterman algorithm is a general


local alignment method based on the same
dynamic programming scheme but with
additional choices to start and end at any place.

• Stretches of sequence with the highest density


of matches are aligned

24
Continued…

• Generates one or more islands of matches or


sub-alignments in the aligned sequences.

• Suitable for aligning sequences that are similar


along some of their lengths but dissimilar in
others, sequences that differ in length, or
sequences that share conserved region or
domain.

• Smith-Waterman algorithm is used to produce


local alignments between pairs of DNA or
protein sequences.

25
Examples

Example 1

Example 2

26
Sequence Alignment

• Sequence Alignment is a way of arranging the


sequences of DNA, RNA, or protein to identify
regions of similarity that may be a consequence
of functional, structural, or evolutionary
relationships between the sequences.

• It involves the identification of the correct


location of deletions and insertions that have
occurred in either of the two lineages since the
divergence from a common ancestor.

27
Types

On the basis of number of comparing sequencing


strand, it is of two types:
• Pair-wise Alignment
• Multiple Sequence Alignment

28
Pairwise sequence alignment

• Used to find the best-matching piecewise (local


or global) alignments of two query sequences.

• Used between two sequences at a time, but


they are efficient to calculate and are often used
for methods that do not require extreme
precision (such as searching a database for
sequences with high similarity to a query).

• The three primary methods of producing


pairwise alignments are dot-matrix methods,
dynamic programming, and word methods.

29
Continued…
• Pair-wise sequence alignment only compares
two sequences at a time.
• Optimality is based on SCORE.
• A pairwise alignment consists of a series of
paired bases, one base from each sequence.
There are three types of pairs:
(1)matches = the same nucleotide appears in
both sequences.
(2)Mismatches = different nucleotides are found in
the two sequences.
(3)Gaps = a base in one sequence and a null base
in the other.

30
Continued…

• Algorithm used are Needleman-Wunsch


algorithm and the Smith-Waterman algorithm

• BLAST (Basic Local Alignment Search TooL)

31
Multiple sequence alignment

• Multiple sequence alignment is an extension of


pairwise alignment to incorporate more than
two sequences at a time.

• Multiple alignment methods try to align all of


the sequences in a given query set.

• Multiple alignments are often used in


identifying conserved sequence regions across a
group of sequences hypothesized to be
evolutionarily related.

32
Continued…

• Such conserved sequence motifs can be used in


conjunction with structural and mechanistic
information to locate the catalytic active
sites of enzymes.

• Alignments are also used to aid in establishing


evolutionary relationships by constructing
phylogenetic trees.

• Multiple sequence alignments are


computationally difficult to produce and most
formulations of the problem lead to NP-
complete combinatorial optimization problems.
33
Continued…

• Nevertheless, the utility of these alignments in


bioinformatics has led to the development of a
variety of methods suitable for aligning three or
more sequences.

• ClustalW, PROBCONS(probabilistic consistency-


based multiple alignment of amino acid
sequences), MUSCLE(MUltiple Sequence
Comparison by Log- Expectation)

34
FASTA

• FASTA is a pairwise sequence alignment tool


that compares the nucleotides or protein
sequences with the existing database and is a
text-based format that can be read and written
with the help of a text editor or word processor.

• It carries the dynamic similarity sequence


search between the protein and nucleotide
sequences against the database and can be
used to find the functional and evolutionary
relationship between the sequences.

35
Continued…

• FASTA produces local alignment scores to


compare the query sequences with every
sequence in the database.

• FASTA sequences are generally obtained by


different methods, including the DNA
sequencing method (Sanger method and
Maxam-Gilbert method) and protein sequencing
method (Edman Degradation reaction and Mass
Spectroscopy).

36
Continued…

Step 1:
Find all k-length identities, then find locally similar
regions by selecting those dense with k-word
identities (i.e. many k-words, without too many
gaps between). The best ten initial regions are
used.

Step 2:
The initial regions are re-scored along their
lengths by applying a substitution matrix in the
usual way. Optimally scoring subregions are
identified.

37
Continued…

Step 3:
Create an alignment of the trimmed initial regions
using dynamic programming. Regions with too low
of a score are not included.

Step 4:
Optimize the alignment from Step 3 using Smith-
Waterman algorithm.

38
Basic Local Alignment Search Tool

• The BLAST algorithm was developed as a new way


to perform a sequence similarity search by an
algorithm that is faster than FASTA while being as
sensitive.
• Access to this BLAST system is possible through the
Internet (http://www.ncbi. nlm.nih.gov/) as a Web
site and through a BLAST E-mail server.
• These programs perform similarity searches using
the same methods as NCBI-BLAST and produce
gapped local alignments.
• The statistical methods used to evaluate sequence
similarity scores are different, and thus WU-BLAST
and NCBI-BLAST can produce different results

39
Continued…

• The BLAST algorithm increases the speed of


sequence alignment by searching first for
common words or k-tuples in the query
sequence and each database sequence.

• BLAST confines the search to the words that are


the most significant, whereas FASTA searches
for all possible words of the same length.

• For the BLAST algorithm, the word length is


fixed at 3 (formerly 4) for proteins and 11 for
nucleic acids (3 if the sequences are translated
in all six reading frames).
40
Steps used by Blast algorithm

Step 1:
• The sequence is optionally filtered to remove
low-complexity regions that are not useful for
producing meaningful sequence alignments.

Step 2:
• A list of words of length 3 in the query protein
sequence is made starting with positions 1, 2,
and 3; then 2, 3, and 4, etc.; until the last 3
available positions in the sequence are reached
(word length 11 for DNA sequences, 3 for
programs that translate DNA sequences)

41
Continued…

Step 3:
• Using the BLOSUM62 substitution scores, the
query sequence words in step 1 are evaluated for
an exact match with a word in any database
sequence. The words are also evaluated for
matches with any other combination of three
amino acids, the objective being to find the scores
for aligning the query word with any other three-
letter word found in a database sequence.

• There are a total of 20 × 20 × 20 = 8000 possible


match scores for this one sequence position

42
Continued…

43
Continued…

Step 4:
• A cutoff score called neighborhood word score
threshold (T) is selected to reduce the number
of possible matches to the most significant
ones. For example, if this cutoff score T is 13,
only the words that score above 13 are kept.

• The list of possible matching words is thereby


shortened from 8000 of all possible to the
highest scoring number of approximately 50

44
Continued…

Step 5:
• The above procedure is repeated for each three-
letter word in the query sequence. For a
sequence of length 250 amino acids, the total
number of words to search for is approximately
50 × 250 = 12,500.

Step 6.:
• The remaining high-scoring words that comprise
possible matches to each three- letter position
in the query sequence are organized into an
efficient search tree for comparing them rapidly
to the database sequences.
45
Continued…

Step 7:
• Each database sequence is scanned for an exact
match to one of the 50 words corresponding to
the first query sequence position, for the words
to the second position, and so on. If a match is
found, this match is used to seed a possible un-
gapped alignment between the query and
database sequences.

46
Continued…

Step 8:
• In the original BLAST method, an attempt was
made to extend an alignment from the matching
words in each direction along the sequences,
continuing for as long as the score continued to
increase. The extension process in each direction
was stopped when the accumulated score
stopped increasing and had just begun to fall a
small amount below the best score found for
shorter extensions. At this point, a larger stretch
of sequence (called the HSP or high-scoring
segment pair), which has a larger score than the
original word, may have been found.

47
Continued…

L P PQG LL - QUERY SEQUENCE


MP PE G LL - DATABASE SEQUENCE
<WORD> THREE LETTER WORD FOUND
| | | INITIALLY
726 BLOSUM62 scores, word score = 15
<------ ------>
EXTENSION TO LEFT EXTENSION TO RIGHT

27 726 44

< HSP > HSP SCORE = 9 + 15 + 8 = 32

48
Continued…

Step 9:
• The next step is to determine whether each HSP
score found by one of the above methods is
greater in value than a cutoff score S. A suitable
value for S is deter- mined empirically by
examining the range of scores found by
comparing random sequences, and by choosing
a value that is significantly greater. The high
scoring pairs (HSPs) matched in the entire
database are identified and listed.

49
Continued…

Step 10:
• BLAST next determines the statistical
significance of each HSP score. A probability
that two random sequences, one the length of
the query sequence and the other the entire
length of the database (which is approximately
equal to the sum of the lengths of all of the
database sequences), could achieve the HSP
score is calculated.

50
Continued…

Step 11:
• Sometimes, two or more HSP regions that can
be made into a longer alignment will be found,
thereby providing additional evidence that the
query and database sequences are related. In
such cases, a combined assessment of the
significance will be made.

51
Continued…

Step 12:
• Smith-Waterman local alignments are shown for
the query sequence with each of the matched
sequences in the database. Earlier versions of
BLAST produced only ungapped alignments that
included the initially found HSP. If two HSPs
were found, two separate alignments were
produced because the two regions could not be
aligned without gaps.

52
Progressive Alignment Method

• It is most widely used approach for multiple


alignments and also known as Hierarchical or
Tree Method.

• Developed by Paulien Hogeweg and Ben Hesper


in 1984.

• Progressive alignment builds up a final MSA by


combining pairwise alignments beginning with
the most similar pair and progressing to the
most distantly related.

53
Continued…

• All progressive Alignment methods require two


stages:

• First stage in which the relationships between


the sequences are represented as a tree, called
a guide tree.

• Second step in which the MSA is built by adding


the sequences sequentially to the growing MSA
according to the guide tree.

54
Clustal W

• The Clustal series of programs are widely used in


molecular biology.

• For the multiple alignment of both nucleic acid and


protein sequences and for preparing phylogenetic
trees.

• It aligns pair of sequences then aligns the next onto


the first pair.

• Most closely related sequences are aligned first, and


then additional sequences and groups of sequences
are added, guided by the initial alignments.

55
Continued…

• Uses alignment scores to produce a


phylogenetic tree.

• Aligns the sequences sequentially, guided by


the phylogenetic relationships indicated by the
tree.

56
End of Module 3

57

You might also like