Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Bioinformatics Tutorial

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Tutorial 1 (15 marks)

1. Define bioinformatics? (2m)


The use of computer tools to manage, analyze and manipulate large set of biological data,
including text data, phylogenetic trees, gene expression profiles, biochemical pathways

2. Before the era of bioinformatics, the biological experiments were studied in 2 ways.
What are the 2 ways and briefly explain the 2 ways. (4m)
In vitro studies – study of biological experiments outside the organism
In vivo studies– study of biological experiments inside the organism.

3. Bioinformatics study is called as....... (1m)


in silico study

4. The term ‘Bioinformatics’ was coined by _____________ (1m)


Paulien Hogeweg

5. Who has been hailed by director of National Centre for Biology Information (NCBI) as
the “mother and father of bioinformatics” and why? (2m)
Margaret Oakley Dayhoff

6. Why bioinformatics is considered dry lab? (1m)


As we use only software in the computers and not chemicals or instruments to do
experiments.

7. What are the differences between bioinformatics and computational biology? (4m)

Bioinformatics Computational biology


Research, development, or application of development and application of data-analytical
computational tools and approaches for expanding and theoretical methods, mathematical modeling
the use of biological, medical or health data, and computational simulation techniques to the
including those to acquire, store, organize, archive, study of biological, behavioral and social systems.
analyze or visualize such data.

biologist called bioinformaticians/ computational biologist (who are computer


bioinformaticists scientists, mathematicians, statisticians and
engineers)

Tutorial 2 (database) Total: 43 marks


1. What is database? (2m)
Can be defined as a collection of related data that is stored in a computer in such a way
that it can easily be found by a computer user

2. There are two common types of biological data. List the two common types of biological
data with one example for each type(4m)
Sequences, eg: DNA
Annotations, eg: gene function

3. List down 4 different types of data format with example. (8m)


Sequence
– eg. Text
• Sequence annotation
– eg. GenBank
• Aligned sequences
– eg. MSF(multiple sequence file
• Protein structural data
– eg. PDB

4. What is flat file? (2m)


The elementary format underlying the information held in DDBJ/EMBL/GenBank
5. List down three major parts of flat files. (3m)
the header, which contains the information (descriptors) that apply to the entire record
the features, which are the annotations on the record
the nucleotide sequence itself

6. What are the two differences between text format and FASTA format? (4m)
Text format FASTA format
no additional annotation can be added additional annotation such as > can be
added at the beginning of the new sequence
Common extensions - .txt, .seq Common extension - .fasta

7. Expand the following (4m):


a. NCBI: The National Center for Biotechnology Information
b. EST: Expressed Sequence Tags
c. RefSeq: Reference Sequences
d. CDD: Conserved Domain Database

8. If you want literature information, what is the best website? (1m)


Pubmed

9. What are the two differences between primary and derivative sequence databases? Give
one example of primary and derivative sequence databases. (6m)
Primary Databases Derivative Databases
Original submissions by experimentalists Derived from primary data
Content controlled by the submitter Content controlled by third party (NCBI)
Ex: Genbank Ex: RefSeq

10. From these following accession numbers, determine what is the type of molecular
sequence that you will get (4m):
a. NM_15392: RNA
b. NT_030059: DNA
c. X02775 : DNA
d. NP_52280: Protein

11. What information will you get when you look up a gene in UniGene? (2m)
UniGene displays information about the abundance of a transcript (expressed gene), as
well as its regional distribution of expression
12. List down three search strategy plan that you can do to help you look for the information
you need in Pubmed. (3m)
Try the PubMed tutorial
Identify the key concepts.
Determine alternative terms for these concepts, if needed.

Tutorial 3 (Total: 18 marks)

1. What is a sequence alignment?(2m)

The process of locating equivalent regions of sequences to maximize their similarity. Two sequences are
directly compared, position by position.

2. Give two advantages and disadvantages of using dot plot in sequence alignments (4m).

Advantages: Use to identify long regions of strong similarity. It produces a plot, which is easy to
make and interpret

Disadvantages: No statistical analysis. Do not provide a precise alignment


3. For the following two sequences, construct a simple dot plot using a grid (or squared paper).
Place each sequence along one axis, and place a dot in the plot for each identical pair of
nucleotides (3m).
ABRACADABRACADABRA
ABRACADABRACADABRA

A B R A C A D A B R A C A D A B R A
A • • • • • • • •
B • • •
R • • •
A • • • • • • • •
C • •
A • • • • • • • •
D • •
A • • • • • • • •
B • • •
R • • •
A • • • • • • • •
C • •
A • • • • • • • •
D • •
A • • • • • • • •
B • • •
R • • •
A • • • • • • • •

Which stretch of these sequences can be aligned best to each other (write down the sequence that
show similarity)? (1m)
ABRACADABRACADABRA
What can you conclude from the plot about these two sequences? (1m)
It is repeated sequence as more than one diagonal in the same region of a sequence.
4. Define the following (3m):
a. Identity: refers to an exact match between two nucleotides or amino acids
b. Similarity: the resemblance between two sequences when they are compared.
c. Homolog: the resemblance or similarity between two sequences due to the organisms
being of common ancestry

5. Give two differences between orthologs and paralogs (4m).

Orthologs Paralogs
similar sequences in two different organisms. similar sequences arisen within the organism
This similarity is arisen due to speciation This similarity arisen due to gene duplication.
(formation of new and distinct species in the
course of evolution) event.

Tutorial 4 Total: 43

1. Expand BLAST (1m)


Basic Local Alignment Search Tool
2. What is the fundamental of BLAST searching? (2m)
BLAST searching is fundamental to understanding the relatedness of any favorite query
sequence to other known proteins or DNA sequences
3. BLAST is a collection of five programs for different combinations of query and database
sequences. Describe briefly all five BLAST programs. (10m)

4. Briefly explain the steps needed to perform BLAST search. (4m)


1. Specify sequence of interest (query)
2. Select the BLAST program
3. Choose the database to search
4. Choose optional parameters

Then click “BLAST”


5. Above is an alignment from BLAST result. Define positive and identities? (2m)
Positive: % similarity
Identities: % identical matches

6. Write the formula to calculate the percentage of positive and identities. (2m)

Positive: Identical matches + Similar matches x 100

Total length of the aligned region

Identities: Identical matches x 100

Total length of the aligned region

7. Give two differences between raw score and bit score. (4m)

Raw score Bit score


Calculated from the substitution matrix and Calculated from the raw score by
gap penalty parameters that are chosen normalizing with the statistical variables
Raw scores are not comparable between Bit scores are comparable between
different searches different searches

8. What is low complexity region? (2m)


Regions with low-complexity sequence have an unusual biased amino acid /nucleotides
composition that can complicate sequence similarity searching.

9. Above is short protein sequence from chicken. Is low complexity region presence in this
sequence? If yes, write down the sequence of low complexity region. (2m)
Yes, SSSSSSSSSSSSSSSSSS
10. Why complexity region need to be filtered out? (2m)
it is as if the low-complexity region is "sticky" and is pulling out many sequences that
are not truly related.

11. Name program to mask complexity region in nucleotide sequences (1m)


DustMasker
12. Name program to mask complexity region in protein sequences (1m)
SEG
13. Given word size (k) is 3, how does BLAST read these sequences? (3m)
MKKKSLALVLATGMA
MKK, KKK, KKS, KSL, SLA, LAL, ALV, LVL, VLA, LAT, ATG, TGM, GMA.

14. BLAST search have been done to predict the function of human query protein. The
alignments of best hits are given above.
a. Which hit is statistically more significant? Explain. (3m)
Hit 2, because it has lower e-value(5e-11), higher score(167), higher
percentage of identities(76%) and higher percentage of positive(84%).
b. Which of the two hits do you think is most likely to be true homolog? (3m)
Hit 1, because it occurs outside low complexity region whereas sequence 2
occurs in the low complexity region. E value of Hit 1is low enough to be
considered as significant which lower than 0.01. Their homology can be
determined.
15. We have determined the genome sequence of a bacterium. How can we use BLAST to
identify protein-coding genes in this genome if we only have access to protein sequence
databases? (1m)
Blast x
Tutorial 5. Total mark: 25

1. BLAST search is done to predict function of a short protein segment from chicken. Top 10 hits
are given above.
a) Can you predict the function of this protein based on this output? Justify your answer.
(2m)
No, because all results show larger e value.
b) What can you do to improve this search? (2m)
Turn on the filter of low complexity region. Increase the length of sequence and use PSI-
Blast.

2. A BLASTP search has not returned any hits at all. Would it be useful to do a PSI-BLAST using
the same settings as the original BLASTP? (2m)
No, because no result to run PSSM.

3. What is multiple sequence alignment?(2m)


Multiple sequence alignment is an alignment of more than two sequences.

4. List all the programs that can be used to conduct multiple sequence alignment (4m)
Clustal Omega
T-Coffee
DIALIGN
MUSCLE

5. If you have 4 different sequences and you want to align the 4 sequences using pairwise alignment
approach, how many number of pairwise alignments needed to find similarity between the 4
different sequences? Show your calculation (2m)
(4-1)(4)/2
=6
6. Why Feng-Doolitle follow rule: once a gap always a gap in making multiple sequence alignment.
(2m)
To maintain the initial gap choices is to trust that those gaps are most believable. Assures that
gaps occurring between sequences that are most closely related in a multiple sequence alignments
will be preserved

7. The alignment above is part of multiple alignment of six protein sequences (human, chimpanzee,
mouse, rat, shark and chicken). Amino acids are shaded according to their conservation and
physio-chemical properties
a. List all conserved positions (3m)
2, 11, 16, 17, 18, 23, 25, 26, 27, 28,29, 30, 31, 32, 33,34, 35, 44, 50
b. If the first sequence is from human and the third is rat, which one is chimpanzee and which
one is mouse? (2m)
Second is chimpanzee, forth is mouse.
c. Is the shark sequence (5th sequence) closer to the human than the chicken is to human? Use
identity (mismatch=0, match= 1) to calculate the distance. (4m)
Yes
Chicken: Mismatch: 19 x
Match: 23 x 1=23
=23
Shark: Mismatch: 26 x0 = 0
Match: 24 x 1 = 24
=24
Tutorial 6

Phylogenetic (total:18)

1. What is phylogenetic tree? (2)


A diagram that illustrates the evolutionary relationships among species, genes, or proteins

2. What is phylogenetic tree made of? Draw simple phylogenetic tree showing parts of phylogenetic
tree. (4)
Phylogenetic tree is made of branches, nodes, terminals/leaves, and a root

3. Define outgroup. (2)


A lineage that is known to be more distantly related to the other species (or DNA/proteins) being
studied.

4. Use the tree above to answer the following questions.


a. A common ancestor for both species C and E could be at position number____4__ (1)
b. The two currently living species that are most closely related to each other are _C & D___
(2)
c. Which of these living species is considered as an outgroup? (3m)
E and A

5. Below is an example of types of phylogenetic tree. Label A and B. Give reason to your answer
(4).

A B

A is cladogram, B is phylogram. Cladograms show branching order – branch lengths are meaningless.
Branches indicate only branching order. Phylograms show branch lengths – branch lengths may indicate
genetic distance. Branch length represent real distances

You might also like