0% found this document useful (0 votes)

14 views

Module 3 CSE3069 (Bioinformatics)

The document discusses various techniques for DNA sequence analysis including dynamic programming, global and local sequence alignment, and algorithms like Needleman-Wunsch, Smith-Waterman, BLAST and FASTA. It provides examples and steps to perform sequence alignment and explains uses of pairwise and multiple sequence alignment.

Uploaded by

Asmi Tanzaen H N

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Module 3 CSE3069 (Bioinformatics)

Uploaded by

Asmi Tanzaen H N

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

PRESIDENCY UNIVERISTY,BENGALURU

School of Engineering

Department of Computer Science & Engineering

Introduction to Bioinformatics
CSE 3069

V Semester 2023-24
Module 3
DNA sequence analysis
Dynamic Programming

• Method for solving a complex problem by

breaking it down into a collection of simpler sub-
problems

• Solving each of these sub-problems just once and

storing their solutions ideally, using a memory
based data structure.

• Then next time the same sub-problem occurs,

instead of re-computing its solution, one simply
looks up the previously computed solution,
thereby saving computation time at the expense
of a modest expenditure in storage space.
3
Continued…

• Breaking down a larger problem into smaller sub

problems/ tasks.

• Solves each sub-problem in order to solve the bigger

problem.

• A computational method to find the best optimal

alignment between two sequences.

• This method compares every character in the two

sequences and generates an alignment.

4
Components of Alignment

1] Matches
String 1: WEAREHUMANS
2] Mismatches
String 2: WEARENOTHUMANZ

3] Gaps

WEARE HUMANS WEARE ___ HUMAN S

WEARE NOTHUMANZ WEARENOTHUMAN Z

5
6
Continued…

Scoring Scheme:

+1 for every match

-1 for every mismatch
0 for gaps

A1: +1+0-1+1-1+1 = 1

A2: +1+1+1+0-1+1 = 3
7
Global vs. Local Alignment

• Global Alignment:

Aligns both sequences end to end.

Example: Needleman Wunsch Algorithm

• Local Alignment:

Aligns stretches of sequence with the highest

density of matches.
Example: Smith Waterman Algorithm

8
Needleman and Wunsch Algorithm

Steps:

• Initialize N+1 x M+1 matrix

Where M and N correspond to the size of the
sequences to be aligned

• Fill the matrix from upper left corner to the

lower right corner in a recursive fashion (using
scoring scheme)

• Traceback

9
Example 1

10
Continued…

11
Continued…

12
Continued…

Step 2: Matrix Filling

13
Continued…

Step 3: Traceback

14
Continued…

_TGGTG

ATCGT_

15
Smith-Waterman algorithm

Steps:

• Initialize N+1 x M+1 matrix

Where M and N correspond to the size of the
sequences to be aligned

• Fill the matrix from upper left corner to the

lower right corner in a recursive fashion (using
scoring scheme)

• Traceback

16
Example

To find the best local alignment of sequences

“ACCTAAGG” and “GGCTCAATCA”, using +2 for a
match, -1 for a mismatch, and -2 for a gap:

We first make matrix T (as in N-W):

The 0th row and 0th column of T are filled with
zeroes
The recurrence relation is then used to fill the
matrix T

17
Continued…

18
Continued…

19
Continued…

20
Continued…

21
Global Alignments

• Global alignments, which attempt to align every

residue in every sequence, are most useful
when the sequences in the query set are similar
and of roughly equal size. (This does not mean
global alignments cannot start and/or end in
gaps.)

• A general global alignment technique is

the Needleman–Wunsch algorithm, which is
based on dynamic programming.

22
Continued…

• Attempts to align the entire sequence using as

many characters as possible, upto both ends of
each sequence.

• Sequences that are quite similar and

approximately the same length are suitable
candidates for global alignment.

• Needleman-Wunch algorithm is used to produce

global alignment between pairs of DNA or
Protein sequences.

23
Local Alignments

• Local alignments are more useful for dissimilar

sequences that are suspected to contain regions
of similarity or similar sequence motifs within
their larger sequence context.

• The Smith–Waterman algorithm is a general

local alignment method based on the same
dynamic programming scheme but with
additional choices to start and end at any place.

• Stretches of sequence with the highest density

of matches are aligned

24
Continued…

• Generates one or more islands of matches or

sub-alignments in the aligned sequences.

• Suitable for aligning sequences that are similar

along some of their lengths but dissimilar in
others, sequences that differ in length, or
sequences that share conserved region or
domain.

• Smith-Waterman algorithm is used to produce

local alignments between pairs of DNA or
protein sequences.

25
Examples

Example 1

Example 2

26
Sequence Alignment

• Sequence Alignment is a way of arranging the

sequences of DNA, RNA, or protein to identify
regions of similarity that may be a consequence
of functional, structural, or evolutionary
relationships between the sequences.

• It involves the identification of the correct

location of deletions and insertions that have
occurred in either of the two lineages since the
divergence from a common ancestor.

27
Types

On the basis of number of comparing sequencing

strand, it is of two types:
• Pair-wise Alignment
• Multiple Sequence Alignment

28
Pairwise sequence alignment

• Used to find the best-matching piecewise (local

or global) alignments of two query sequences.

• Used between two sequences at a time, but

they are efficient to calculate and are often used
for methods that do not require extreme
precision (such as searching a database for
sequences with high similarity to a query).

• The three primary methods of producing

pairwise alignments are dot-matrix methods,
dynamic programming, and word methods.

29
Continued…
• Pair-wise sequence alignment only compares
two sequences at a time.
• Optimality is based on SCORE.
• A pairwise alignment consists of a series of
paired bases, one base from each sequence.
There are three types of pairs:
(1)matches = the same nucleotide appears in
both sequences.
(2)Mismatches = different nucleotides are found in
the two sequences.
(3)Gaps = a base in one sequence and a null base
in the other.

30
Continued…

• Algorithm used are Needleman-Wunsch

algorithm and the Smith-Waterman algorithm

• BLAST (Basic Local Alignment Search TooL)

31
Multiple sequence alignment

• Multiple sequence alignment is an extension of

pairwise alignment to incorporate more than
two sequences at a time.

• Multiple alignment methods try to align all of

the sequences in a given query set.

• Multiple alignments are often used in

identifying conserved sequence regions across a
group of sequences hypothesized to be
evolutionarily related.

32
Continued…

• Such conserved sequence motifs can be used in

conjunction with structural and mechanistic
information to locate the catalytic active
sites of enzymes.

• Alignments are also used to aid in establishing

evolutionary relationships by constructing
phylogenetic trees.

• Multiple sequence alignments are

computationally difficult to produce and most
formulations of the problem lead to NP-
complete combinatorial optimization problems.
33
Continued…

• Nevertheless, the utility of these alignments in

bioinformatics has led to the development of a
variety of methods suitable for aligning three or
more sequences.

• ClustalW, PROBCONS(probabilistic consistency-

based multiple alignment of amino acid
sequences), MUSCLE(MUltiple Sequence
Comparison by Log- Expectation)

34
FASTA

• FASTA is a pairwise sequence alignment tool

that compares the nucleotides or protein
sequences with the existing database and is a
text-based format that can be read and written
with the help of a text editor or word processor.

• It carries the dynamic similarity sequence

search between the protein and nucleotide
sequences against the database and can be
used to find the functional and evolutionary
relationship between the sequences.

35
Continued…

• FASTA produces local alignment scores to

compare the query sequences with every
sequence in the database.

• FASTA sequences are generally obtained by

different methods, including the DNA
sequencing method (Sanger method and
Maxam-Gilbert method) and protein sequencing
method (Edman Degradation reaction and Mass
Spectroscopy).

36
Continued…

Step 1:
Find all k-length identities, then find locally similar
regions by selecting those dense with k-word
identities (i.e. many k-words, without too many
gaps between). The best ten initial regions are
used.

Step 2:
The initial regions are re-scored along their
lengths by applying a substitution matrix in the
usual way. Optimally scoring subregions are
identified.

37
Continued…

Step 3:
Create an alignment of the trimmed initial regions
using dynamic programming. Regions with too low
of a score are not included.

Step 4:
Optimize the alignment from Step 3 using Smith-
Waterman algorithm.

38
Basic Local Alignment Search Tool

• The BLAST algorithm was developed as a new way

to perform a sequence similarity search by an
algorithm that is faster than FASTA while being as
sensitive.
• Access to this BLAST system is possible through the
Internet (http://www.ncbi. nlm.nih.gov/) as a Web
site and through a BLAST E-mail server.
• These programs perform similarity searches using
the same methods as NCBI-BLAST and produce
gapped local alignments.
• The statistical methods used to evaluate sequence
similarity scores are different, and thus WU-BLAST
and NCBI-BLAST can produce different results

39
Continued…

• The BLAST algorithm increases the speed of

sequence alignment by searching first for
common words or k-tuples in the query
sequence and each database sequence.

• BLAST confines the search to the words that are

the most significant, whereas FASTA searches
for all possible words of the same length.

• For the BLAST algorithm, the word length is

fixed at 3 (formerly 4) for proteins and 11 for
nucleic acids (3 if the sequences are translated
in all six reading frames).
40
Steps used by Blast algorithm

Step 1:
• The sequence is optionally filtered to remove
low-complexity regions that are not useful for
producing meaningful sequence alignments.

Step 2:
• A list of words of length 3 in the query protein
sequence is made starting with positions 1, 2,
and 3; then 2, 3, and 4, etc.; until the last 3
available positions in the sequence are reached
(word length 11 for DNA sequences, 3 for
programs that translate DNA sequences)

41
Continued…

Step 3:
• Using the BLOSUM62 substitution scores, the
query sequence words in step 1 are evaluated for
an exact match with a word in any database
sequence. The words are also evaluated for
matches with any other combination of three
amino acids, the objective being to find the scores
for aligning the query word with any other three-
letter word found in a database sequence.

• There are a total of 20 × 20 × 20 = 8000 possible

match scores for this one sequence position

42
Continued…

43
Continued…

Step 4:
• A cutoff score called neighborhood word score
threshold (T) is selected to reduce the number
of possible matches to the most significant
ones. For example, if this cutoff score T is 13,
only the words that score above 13 are kept.

• The list of possible matching words is thereby

shortened from 8000 of all possible to the
highest scoring number of approximately 50

44
Continued…

Step 5:
• The above procedure is repeated for each three-
letter word in the query sequence. For a
sequence of length 250 amino acids, the total
number of words to search for is approximately
50 × 250 = 12,500.

Step 6.:
• The remaining high-scoring words that comprise
possible matches to each three- letter position
in the query sequence are organized into an
efficient search tree for comparing them rapidly
to the database sequences.
45
Continued…

Step 7:
• Each database sequence is scanned for an exact
match to one of the 50 words corresponding to
the first query sequence position, for the words
to the second position, and so on. If a match is
found, this match is used to seed a possible un-
gapped alignment between the query and
database sequences.

46
Continued…

Step 8:
• In the original BLAST method, an attempt was
made to extend an alignment from the matching
words in each direction along the sequences,
continuing for as long as the score continued to
increase. The extension process in each direction
was stopped when the accumulated score
stopped increasing and had just begun to fall a
small amount below the best score found for
shorter extensions. At this point, a larger stretch
of sequence (called the HSP or high-scoring
segment pair), which has a larger score than the
original word, may have been found.

47
Continued…

L P PQG LL - QUERY SEQUENCE

MP PE G LL - DATABASE SEQUENCE
<WORD> THREE LETTER WORD FOUND
| | | INITIALLY
726 BLOSUM62 scores, word score = 15
<------ ------>
EXTENSION TO LEFT EXTENSION TO RIGHT

27 726 44

< HSP > HSP SCORE = 9 + 15 + 8 = 32

48
Continued…

Step 9:
• The next step is to determine whether each HSP
score found by one of the above methods is
greater in value than a cutoff score S. A suitable
value for S is deter- mined empirically by
examining the range of scores found by
comparing random sequences, and by choosing
a value that is significantly greater. The high
scoring pairs (HSPs) matched in the entire
database are identified and listed.

49
Continued…

Step 10:
• BLAST next determines the statistical
significance of each HSP score. A probability
that two random sequences, one the length of
the query sequence and the other the entire
length of the database (which is approximately
equal to the sum of the lengths of all of the
database sequences), could achieve the HSP
score is calculated.

50
Continued…

Step 11:
• Sometimes, two or more HSP regions that can
be made into a longer alignment will be found,
thereby providing additional evidence that the
query and database sequences are related. In
such cases, a combined assessment of the
significance will be made.

51
Continued…

Step 12:
• Smith-Waterman local alignments are shown for
the query sequence with each of the matched
sequences in the database. Earlier versions of
BLAST produced only ungapped alignments that
included the initially found HSP. If two HSPs
were found, two separate alignments were
produced because the two regions could not be
aligned without gaps.

52
Progressive Alignment Method

• It is most widely used approach for multiple

alignments and also known as Hierarchical or
Tree Method.

• Developed by Paulien Hogeweg and Ben Hesper

in 1984.

• Progressive alignment builds up a final MSA by

combining pairwise alignments beginning with
the most similar pair and progressing to the
most distantly related.

53
Continued…

• All progressive Alignment methods require two

stages:

• First stage in which the relationships between

the sequences are represented as a tree, called
a guide tree.

• Second step in which the MSA is built by adding

the sequences sequentially to the growing MSA
according to the guide tree.

54
Clustal W

• The Clustal series of programs are widely used in

molecular biology.

• For the multiple alignment of both nucleic acid and

protein sequences and for preparing phylogenetic
trees.

• It aligns pair of sequences then aligns the next onto

the first pair.

• Most closely related sequences are aligned first, and

then additional sequences and groups of sequences
are added, guided by the initial alignments.

55
Continued…

• Uses alignment scores to produce a

phylogenetic tree.

• Aligns the sequences sequentially, guided by

the phylogenetic relationships indicated by the
tree.

56
End of Module 3

Bionic 880
100% (1)
Bionic 880
13 pages
Alignment Methods
No ratings yet
Alignment Methods
33 pages
3
No ratings yet
3
107 pages
Unit 2.1
No ratings yet
Unit 2.1
77 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
89 pages
SEQUENCE ALIGNMENT Bioinformatics
No ratings yet
SEQUENCE ALIGNMENT Bioinformatics
13 pages
Bioinformatics 04
No ratings yet
Bioinformatics 04
28 pages
Sequence Alignment
No ratings yet
Sequence Alignment
36 pages
Dynamic Programming Methods in Pairwise Alignment
No ratings yet
Dynamic Programming Methods in Pairwise Alignment
41 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Sequence Alignment
No ratings yet
Sequence Alignment
29 pages
Unit 3 Bioinformatics
No ratings yet
Unit 3 Bioinformatics
11 pages
BIOLOGICAL DATABASES
No ratings yet
BIOLOGICAL DATABASES
13 pages
Sequence Analysis in Bioinformatics
No ratings yet
Sequence Analysis in Bioinformatics
18 pages
L3.4 Alignment
No ratings yet
L3.4 Alignment
90 pages
Sequencing Alignment & Its Methods Group II
No ratings yet
Sequencing Alignment & Its Methods Group II
12 pages
Chapter 5 Pairwise Alignment
No ratings yet
Chapter 5 Pairwise Alignment
8 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
Sequence Alingment
No ratings yet
Sequence Alingment
10 pages
Sequence Alignment
No ratings yet
Sequence Alignment
92 pages
1 T Coffee Dalign 18
No ratings yet
1 T Coffee Dalign 18
31 pages
BLAST Analysis and Algorythim
No ratings yet
BLAST Analysis and Algorythim
11 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
19 pages
Chapter 2 Bioinformatics
No ratings yet
Chapter 2 Bioinformatics
9 pages
05. Sequence Alignment
No ratings yet
05. Sequence Alignment
9 pages
Week 4
No ratings yet
Week 4
38 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Lec7 - Multiple Sequence Alignment
No ratings yet
Lec7 - Multiple Sequence Alignment
22 pages
Analytical
No ratings yet
Analytical
24 pages
Multiple Alignment
No ratings yet
Multiple Alignment
28 pages
Chapter 04 (2)
No ratings yet
Chapter 04 (2)
42 pages
2. Sequence Alignment
No ratings yet
2. Sequence Alignment
25 pages
4. Sequence Alignment
No ratings yet
4. Sequence Alignment
24 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Introduction To Bioinformatics Presentation
No ratings yet
Introduction To Bioinformatics Presentation
13 pages
DOT PLOT and SEQUENTIAL ALIGNMENT
No ratings yet
DOT PLOT and SEQUENTIAL ALIGNMENT
22 pages
Unit 2
No ratings yet
Unit 2
33 pages
Multiple Sequence Alignment: Sumbitted To: DR - Navneet Choudhary
No ratings yet
Multiple Sequence Alignment: Sumbitted To: DR - Navneet Choudhary
23 pages
BLAST and Sequence Alignment
No ratings yet
BLAST and Sequence Alignment
36 pages
Lec4 - Multiple Sequence Alignment
No ratings yet
Lec4 - Multiple Sequence Alignment
22 pages
Bioinformatics Lecture 5-9 Review
100% (4)
Bioinformatics Lecture 5-9 Review
44 pages
Blast & Fasta
No ratings yet
Blast & Fasta
47 pages
Bio 2
No ratings yet
Bio 2
39 pages
New Sequence Alignment Algorithm Using Ai Rules and Dynamic Seeds
No ratings yet
New Sequence Alignment Algorithm Using Ai Rules and Dynamic Seeds
14 pages
What Are The Common Algorithms in Machine Learning
No ratings yet
What Are The Common Algorithms in Machine Learning
3 pages
Introduction To Bioinformatics Lecture 3
No ratings yet
Introduction To Bioinformatics Lecture 3
20 pages
ANN Unit-2 Chapter-2
No ratings yet
ANN Unit-2 Chapter-2
56 pages
Bio Model
No ratings yet
Bio Model
12 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
Multiple Sequence Alignment Part 1
No ratings yet
Multiple Sequence Alignment Part 1
64 pages
Lecture 04 Alignment
No ratings yet
Lecture 04 Alignment
22 pages
lecture_06
No ratings yet
lecture_06
51 pages
Unit2 2
No ratings yet
Unit2 2
30 pages
Alpha Fold
No ratings yet
Alpha Fold
16 pages
Msa
No ratings yet
Msa
28 pages
Multiple Seq Alignment
No ratings yet
Multiple Seq Alignment
36 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
How To Lose Weight Through Calories Deficit
No ratings yet
How To Lose Weight Through Calories Deficit
5 pages
St. Joseph Parish: Celebration of The Eucharist
No ratings yet
St. Joseph Parish: Celebration of The Eucharist
5 pages
DCM601A51 - Technical Data
100% (1)
DCM601A51 - Technical Data
405 pages
Legal Opinion
No ratings yet
Legal Opinion
2 pages
Umeb-S.A Three-Phase Squirrel Cage Induction Motors in Flameproof Enclosure Ex D IIC T4 or Ex de IIC T4 Type ASA 160M-4 11 KW, 1500 Rot/min
No ratings yet
Umeb-S.A Three-Phase Squirrel Cage Induction Motors in Flameproof Enclosure Ex D IIC T4 or Ex de IIC T4 Type ASA 160M-4 11 KW, 1500 Rot/min
5 pages
Diamond - A Novella (Kitsune Duet Book 2) - PDF Room
No ratings yet
Diamond - A Novella (Kitsune Duet Book 2) - PDF Room
83 pages
Catalog Tire Tools Glominpro
No ratings yet
Catalog Tire Tools Glominpro
21 pages
Bdgo 4103 Introductory Organisation Behaviour
0% (1)
Bdgo 4103 Introductory Organisation Behaviour
16 pages
Criteria MCadet Batch A-2023
No ratings yet
Criteria MCadet Batch A-2023
3 pages
Soal Penilaian Tengah Semester (PTS) Bahasa Inggris Kelas 7 SMP Negeri 220
No ratings yet
Soal Penilaian Tengah Semester (PTS) Bahasa Inggris Kelas 7 SMP Negeri 220
6 pages
Corporate Level Strategies1
No ratings yet
Corporate Level Strategies1
31 pages
Burning For More - Kaye Kennedy
No ratings yet
Burning For More - Kaye Kennedy
272 pages
Indices in Orthodontics
No ratings yet
Indices in Orthodontics
67 pages
Grade 8 Geography Introducing Hdi 1 2022a
No ratings yet
Grade 8 Geography Introducing Hdi 1 2022a
5 pages
PDF
No ratings yet
PDF
93 pages
Quotation Slip: Reff No. 18.02.00686/QS/EAR/NES
No ratings yet
Quotation Slip: Reff No. 18.02.00686/QS/EAR/NES
5 pages
Manual 5160 User Manual
No ratings yet
Manual 5160 User Manual
186 pages
Iqwq KD WPWRK D7 1109 - 0
No ratings yet
Iqwq KD WPWRK D7 1109 - 0
76 pages
Operation Manual: Lqs-Iicsa Refrigerating Hydrogen Dryer
100% (1)
Operation Manual: Lqs-Iicsa Refrigerating Hydrogen Dryer
35 pages
Periodic Table Formative KEY
No ratings yet
Periodic Table Formative KEY
2 pages
SINAG (SDRRM) Newsletter
No ratings yet
SINAG (SDRRM) Newsletter
1 page
A Case Solution Submitted by Kritika Sharma: Chandragupt Institute of Management Patna
No ratings yet
A Case Solution Submitted by Kritika Sharma: Chandragupt Institute of Management Patna
8 pages
Social Enterprises - Doroteo
No ratings yet
Social Enterprises - Doroteo
4 pages
Bee CFA Mtakimau PIN
No ratings yet
Bee CFA Mtakimau PIN
28 pages
The Role of Stigmatizing Experiences MILKEWICZ ANNIS CASH HRABOSKY
No ratings yet
The Role of Stigmatizing Experiences MILKEWICZ ANNIS CASH HRABOSKY
13 pages
Printable-Spice-Baking-Pantry-Staples-Lists-1
No ratings yet
Printable-Spice-Baking-Pantry-Staples-Lists-1
5 pages
Camel: C - Capital Adequacy
No ratings yet
Camel: C - Capital Adequacy
2 pages
Atracurium Besylate
No ratings yet
Atracurium Besylate
3 pages
1VAG271201-DB ICX RevM May2013
No ratings yet
1VAG271201-DB ICX RevM May2013
8 pages