Fasta Sequence Database
Fasta Sequence Database
ALDO LISI
NAIL SPAHIJA
KLEVIS XHYRA
INTRODUCTION
Bio-Informatics is an upcoming field, comprising
of application of Computer Science to Biological
study
The slides discusses the Search techniques used in
FASTA – and – and Example of FASTA database
search.
The searching technique uses the Heuristic
approach for sequence alignment in FASTA.
FASTA
FASTA is a program for rapid alignment of pairs of
protein and DNA sequences.
Two studies have shown that both FASTA and BLAST are not as good at
finding protein families in sequence databases as exhaustive local
homology searches by dynamic programming.
FASTA finds sequence similarities between the test sequence and each
database sequence in four steps:
1. first, the ten best words or k-tuples in each sequence pair are located;
2. second, the k-tuples are evaluated using a symbol comparison table and
the highest scoring regions are identified and used to rank the database
3. third, longer regions of identity are generated by joining initial regions
with scores greater than a certain threshold, by rescoring these regions
using a gap penalty and
4. fourth, an optimal local alignment is performed between the input test
sequence and the best scoring database sequences.
FASTA Algorithm
In the initial search for regions of
similarity, FASTA uses a computer
method known as hash coding
In this method, a lookup table
showing the positions of each
sequence word of length k, called a k-
tuple, is constructed for each
sequence.
The relative positions of each word
in the two sequences is then calculated
by subtracting the position in the first
sequence from that in the second.
Words that have the same offset
position reveal a region of alignment
between the two sequences.
FASTA Algorithm
The number of comparisons increases
linearly in proportion to average
sequence length.
In contrast, the time taken in dot
matrix and dynamic programming
methods increases as the square of the
average sequence length.
The k-tuple length is user-defined and
is usually 1 or 2 for protein sequences
For nucleic acid sequences, the k-tuple
is 5-20, and is much longer because short
k-tuples are much more common due to
the 4 letter alphabet of nucleic acids. The
larger the k-tuple chosen, the more rapid
but less thorough, a database search.
Significance of FASTA matches
A major focus of the package is the
calculation of accurate similarity statistics,
so that biologists can judge whether an
alignment is likely to have occurred by
chance, or whether it can be used to infer
homology. The FASTA package is available
from the University of Virginia and the
European Bioinformatics Institute
Higher numbers have been cutoff on the right of the histogram. The number of sequences
within a particular range of init1 scores is shown by '-' (none shown here), the number of
initn in each range by '+', and the numbers of init1 and initn scores, when they are equal, by
'='. The sequences giving the highest scoring matches over 80 are immediately apparent.
Example of FASTA database search
In the next stage of analysis, the high scoring sequences are aligned with the test sequence
using a dynamic programming method to find an optimal local alignment. The similarity
scores of these alignments are calculated and are listed as the 'opt' score. The sequences
giving the highest scores with the test sequence are then listed in descending order of
scores, and the local homology alignments of these sequences with the test sequence are
then shown.
In the alignment, a '|' character between amino acid pairs indicates identity and a ':', a pair
often found in alignments of other related proteins. As may be seen, a previously
unidentified, but excellent, alignment was obtained between the E. coli DinH protein and
the B. subtilis spoIIIA protein.
Example of FASTA database search
LINKS
https://www.ctu.edu.vn/~dvxe/Bioinformatic%20course/manuals/blast/blastmanual/fasta.htm
https://www.ebi.ac.uk/
https://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=fasta-I20180608-113446-0745-87498455-p2m
The interpretation of significance of results depends on several
factors such as word size, length of the sequences being aligned, the
gap penalties & the alignment scoring system used.