Bioinformatics Tutorial
Bioinformatics Tutorial
Bioinformatics Tutorial
2. Before the era of bioinformatics, the biological experiments were studied in 2 ways.
What are the 2 ways and briefly explain the 2 ways. (4m)
In vitro studies – study of biological experiments outside the organism
In vivo studies– study of biological experiments inside the organism.
5. Who has been hailed by director of National Centre for Biology Information (NCBI) as
the “mother and father of bioinformatics” and why? (2m)
Margaret Oakley Dayhoff
7. What are the differences between bioinformatics and computational biology? (4m)
2. There are two common types of biological data. List the two common types of biological
data with one example for each type(4m)
Sequences, eg: DNA
Annotations, eg: gene function
6. What are the two differences between text format and FASTA format? (4m)
Text format FASTA format
no additional annotation can be added additional annotation such as > can be
added at the beginning of the new sequence
Common extensions - .txt, .seq Common extension - .fasta
9. What are the two differences between primary and derivative sequence databases? Give
one example of primary and derivative sequence databases. (6m)
Primary Databases Derivative Databases
Original submissions by experimentalists Derived from primary data
Content controlled by the submitter Content controlled by third party (NCBI)
Ex: Genbank Ex: RefSeq
10. From these following accession numbers, determine what is the type of molecular
sequence that you will get (4m):
a. NM_15392: RNA
b. NT_030059: DNA
c. X02775 : DNA
d. NP_52280: Protein
11. What information will you get when you look up a gene in UniGene? (2m)
UniGene displays information about the abundance of a transcript (expressed gene), as
well as its regional distribution of expression
12. List down three search strategy plan that you can do to help you look for the information
you need in Pubmed. (3m)
Try the PubMed tutorial
Identify the key concepts.
Determine alternative terms for these concepts, if needed.
The process of locating equivalent regions of sequences to maximize their similarity. Two sequences are
directly compared, position by position.
2. Give two advantages and disadvantages of using dot plot in sequence alignments (4m).
Advantages: Use to identify long regions of strong similarity. It produces a plot, which is easy to
make and interpret
A B R A C A D A B R A C A D A B R A
A • • • • • • • •
B • • •
R • • •
A • • • • • • • •
C • •
A • • • • • • • •
D • •
A • • • • • • • •
B • • •
R • • •
A • • • • • • • •
C • •
A • • • • • • • •
D • •
A • • • • • • • •
B • • •
R • • •
A • • • • • • • •
Which stretch of these sequences can be aligned best to each other (write down the sequence that
show similarity)? (1m)
ABRACADABRACADABRA
What can you conclude from the plot about these two sequences? (1m)
It is repeated sequence as more than one diagonal in the same region of a sequence.
4. Define the following (3m):
a. Identity: refers to an exact match between two nucleotides or amino acids
b. Similarity: the resemblance between two sequences when they are compared.
c. Homolog: the resemblance or similarity between two sequences due to the organisms
being of common ancestry
Orthologs Paralogs
similar sequences in two different organisms. similar sequences arisen within the organism
This similarity is arisen due to speciation This similarity arisen due to gene duplication.
(formation of new and distinct species in the
course of evolution) event.
Tutorial 4 Total: 43
6. Write the formula to calculate the percentage of positive and identities. (2m)
7. Give two differences between raw score and bit score. (4m)
9. Above is short protein sequence from chicken. Is low complexity region presence in this
sequence? If yes, write down the sequence of low complexity region. (2m)
Yes, SSSSSSSSSSSSSSSSSS
10. Why complexity region need to be filtered out? (2m)
it is as if the low-complexity region is "sticky" and is pulling out many sequences that
are not truly related.
14. BLAST search have been done to predict the function of human query protein. The
alignments of best hits are given above.
a. Which hit is statistically more significant? Explain. (3m)
Hit 2, because it has lower e-value(5e-11), higher score(167), higher
percentage of identities(76%) and higher percentage of positive(84%).
b. Which of the two hits do you think is most likely to be true homolog? (3m)
Hit 1, because it occurs outside low complexity region whereas sequence 2
occurs in the low complexity region. E value of Hit 1is low enough to be
considered as significant which lower than 0.01. Their homology can be
determined.
15. We have determined the genome sequence of a bacterium. How can we use BLAST to
identify protein-coding genes in this genome if we only have access to protein sequence
databases? (1m)
Blast x
Tutorial 5. Total mark: 25
1. BLAST search is done to predict function of a short protein segment from chicken. Top 10 hits
are given above.
a) Can you predict the function of this protein based on this output? Justify your answer.
(2m)
No, because all results show larger e value.
b) What can you do to improve this search? (2m)
Turn on the filter of low complexity region. Increase the length of sequence and use PSI-
Blast.
2. A BLASTP search has not returned any hits at all. Would it be useful to do a PSI-BLAST using
the same settings as the original BLASTP? (2m)
No, because no result to run PSSM.
4. List all the programs that can be used to conduct multiple sequence alignment (4m)
Clustal Omega
T-Coffee
DIALIGN
MUSCLE
5. If you have 4 different sequences and you want to align the 4 sequences using pairwise alignment
approach, how many number of pairwise alignments needed to find similarity between the 4
different sequences? Show your calculation (2m)
(4-1)(4)/2
=6
6. Why Feng-Doolitle follow rule: once a gap always a gap in making multiple sequence alignment.
(2m)
To maintain the initial gap choices is to trust that those gaps are most believable. Assures that
gaps occurring between sequences that are most closely related in a multiple sequence alignments
will be preserved
7. The alignment above is part of multiple alignment of six protein sequences (human, chimpanzee,
mouse, rat, shark and chicken). Amino acids are shaded according to their conservation and
physio-chemical properties
a. List all conserved positions (3m)
2, 11, 16, 17, 18, 23, 25, 26, 27, 28,29, 30, 31, 32, 33,34, 35, 44, 50
b. If the first sequence is from human and the third is rat, which one is chimpanzee and which
one is mouse? (2m)
Second is chimpanzee, forth is mouse.
c. Is the shark sequence (5th sequence) closer to the human than the chicken is to human? Use
identity (mismatch=0, match= 1) to calculate the distance. (4m)
Yes
Chicken: Mismatch: 19 x
Match: 23 x 1=23
=23
Shark: Mismatch: 26 x0 = 0
Match: 24 x 1 = 24
=24
Tutorial 6
Phylogenetic (total:18)
2. What is phylogenetic tree made of? Draw simple phylogenetic tree showing parts of phylogenetic
tree. (4)
Phylogenetic tree is made of branches, nodes, terminals/leaves, and a root
5. Below is an example of types of phylogenetic tree. Label A and B. Give reason to your answer
(4).
A B
A is cladogram, B is phylogram. Cladograms show branching order – branch lengths are meaningless.
Branches indicate only branching order. Phylograms show branch lengths – branch lengths may indicate
genetic distance. Branch length represent real distances