Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
171 views

Fasta Sequence Database

FASTA is a program for rapid alignment of protein and DNA sequences. It looks for matching sequence patterns called k-tuples and attempts to build local alignments from these matches. Due to its speed and sensitivity, FASTA is useful for sequence database searches. The FASTA algorithm finds similarities in four steps: identifying k-tuples, evaluating matches, extending alignments, and performing local alignment. An example showed FASTA identifying excellent matches between test and database sequences.

Uploaded by

Klevis Xhyra
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views

Fasta Sequence Database

FASTA is a program for rapid alignment of protein and DNA sequences. It looks for matching sequence patterns called k-tuples and attempts to build local alignments from these matches. Due to its speed and sensitivity, FASTA is useful for sequence database searches. The FASTA algorithm finds similarities in four steps: identifying k-tuples, evaluating matches, extending alignments, and performing local alignment. An example showed FASTA identifying excellent matches between test and database sequences.

Uploaded by

Klevis Xhyra
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

FASTA SEQUENCE DATABASE

ALDO LISI
NAIL SPAHIJA
KLEVIS XHYRA
INTRODUCTION
Bio-Informatics is an upcoming field, comprising
of application of Computer Science to Biological
study
The slides discusses the Search techniques used in
FASTA – and – and Example of FASTA database
search.
The searching technique uses the Heuristic
approach for sequence alignment in FASTA.
FASTA
FASTA is a program for rapid alignment of pairs of
protein and DNA sequences.

Rather than comparing individual residues in the two


sequences, FASTA instead looks for matching sequence
patterns or words, called k-tuples, and then attempts to
build a local alignment based upon these word matches.

Due to the relatively high speed and sensitivity of the


algorithm, FASTA is very useful.
FASTA vs BLAST
FASTA is comparable in algorithm and in reliability to BLAST but is
considerably slower in speed.

FASTA was found to be more sensitive at finding related protein


sequences than BLAST when using the blosum62 scoring matrix

Two studies have shown that both FASTA and BLAST are not as good at
finding protein families in sequence databases as exhaustive local
homology searches by dynamic programming.

FASTA, BLAST and other searches may be performed on.

The blocks database should be searched with test sequences to find


proteins which share amino acid motifs with a test protein sequence.
FASTA Algorithm
FASTA compares an input DNA or protein sequence to all of the
sequences in a target sequence database, and then reports the best
matched sequences and local alignments of these matched sequence with
the input sequence.
The input sequence and database are usually in FASTA format.

FASTA finds sequence similarities between the test sequence and each
database sequence in four steps:
1. first, the ten best words or k-tuples in each sequence pair are located;
2. second, the k-tuples are evaluated using a symbol comparison table and
the highest scoring regions are identified and used to rank the database
3. third, longer regions of identity are generated by joining initial regions
with scores greater than a certain threshold, by rescoring these regions
using a gap penalty and
4. fourth, an optimal local alignment is performed between the input test
sequence and the best scoring database sequences.
FASTA Algorithm
In the initial search for regions of
similarity, FASTA uses a computer
method known as hash coding 
In this method, a lookup table
showing the positions of each
sequence word of length k, called a k-
tuple, is constructed for each
sequence.
The relative positions of each word
in the two sequences is then calculated
by subtracting the position in the first
sequence from that in the second.
Words that have the same offset
position reveal a region of alignment
between the two sequences.
FASTA Algorithm
The number of comparisons increases
linearly in proportion to average
sequence length.
In contrast, the time taken in dot
matrix and dynamic programming
methods increases as the square of the
average sequence length.
The k-tuple length is user-defined and
is usually 1 or 2 for protein sequences
For nucleic acid sequences, the k-tuple
is 5-20, and is much longer because short
k-tuples are much more common due to
the 4 letter alphabet of nucleic acids. The
larger the k-tuple chosen, the more rapid
but less thorough, a database search.
Significance of FASTA matches
A major focus of the package is the
calculation of accurate similarity statistics,
so that biologists can judge whether an
alignment is likely to have occurred by
chance, or whether it can be used to infer
homology. The FASTA package is available
from the University of Virginia and the
European Bioinformatics Institute

The FASTA file format used as input for this


software is now largely used by other
sequence database search tools (such as
BLAST) and sequence alignment programs
(Clustal, T-Coffee, etc.)
Implementations of FASTA
There are several implementations of the FASTA algorithm:

FASTA - compares a protein sequence to another protein sequence or a


protein library or a DNA sequence to another DNA sequence or to a
DNA sequence library

TFASTA - compares a protein sequence to a DNA sequence or DNA


sequence library by translating each DNA sequence into all 6 possible
reading frames and then comparing each frame to the protein
sequence.

LFASTA - identifies one or more regions of similarity between two


sequences.

PLFASTA - presents a dot matrix plot of regions of sequence similarity


between two sequences.
Example of FASTA database search
An example of a database search for matches between the amino acid sequence of the E.
coli DinH product and the swissprot sequence database using the GCG implementation of
FASTA is shown below. Shown first are portions of a histogram listing the range of init1 and
initn scores obtained with every sequence in the database.

Higher numbers have been cutoff on the right of the histogram. The number of sequences
within a particular range of init1 scores is shown by '-' (none shown here), the number of
initn in each range by '+', and the numbers of init1 and initn scores, when they are equal, by
'='. The sequences giving the highest scoring matches over 80 are immediately apparent.
Example of FASTA database search
In the next stage of analysis, the high scoring sequences are aligned with the test sequence
using a dynamic programming method to find an optimal local alignment. The similarity
scores of these alignments are calculated and are listed as the 'opt' score. The sequences
giving the highest scores with the test sequence are then listed in descending order of
scores, and the local homology alignments of these sequences with the test sequence are
then shown.
In the alignment, a '|' character between amino acid pairs indicates identity and a ':', a pair
often found in alignments of other related proteins. As may be seen, a previously
unidentified, but excellent, alignment was obtained between the E. coli DinH protein and
the B. subtilis spoIIIA protein.
Example of FASTA database search
LINKS

https://www.ctu.edu.vn/~dvxe/Bioinformatic%20course/manuals/blast/blastmanual/fasta.htm
https://www.ebi.ac.uk/
https://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=fasta-I20180608-113446-0745-87498455-p2m
The interpretation of significance of results depends on several
factors such as word size, length of the sequences being aligned, the
gap penalties & the alignment scoring system used.

FASTA is better for translated DNA-protein comparison and DNA


database searches because it calculates a single alignment that
allows frame shifts.

By treating forward-reading frames as a single sequence, FASTA


makes it much easier to produce high-quality alignments that extend
the length of the protein sequence, resulting in improved sensitivity.
Bibliography / References:
• Developing BioInformatics Computer Skills – Cynthia Gibas &
Per Jambeck
• An Introduction to BioInformatics Algorithms - Neil Jones
and Pavel Pevzner
• BioInformatics Computing – Bryan Bergeron
• Improved Tools for Biological Sequence Analysis - W. R.
Pearson and D. J. Lipman
• An Introduction to BioInformatics Algorithms -
http://www.bioalgorithms.info
• NCBI references - http://www.ncbi.nlm.nih.gov

You might also like