Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Module 1_Session 3_Part 2

Uploaded by

mariabrowny33
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 1_Session 3_Part 2

Uploaded by

mariabrowny33
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Introduction to Bioinformatics Online Course : IBT

Module 1: Introduction to databases and resources

(Session 3)

Part II
Genbank flat file

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
How to identify genes using bioinformatics tools
Obtain the DNA or RNA sequence data from the organism of
interest through web lab research using various sequencing
Sequencing technologies, such as Sanger sequencing or next- generation
sequencing (NGS) is first done to obtain the sequence of the
gene of interest.

Clean and process the sequence data to remove any errors or


artefacts introduced during sequencing, as well as any
Preprocessing sequences that do not align with the reference genome or
transcriptome.

A sequence record is called 'annotated' when biological


information is added and linked to a position in the sequence
annotated information represented in sequence features table
Gene identification Annotate the genome or transcriptome to identify potential genes
and their features, such as coding regions, exons, introns,
and annotation promoters, and regulatory elements. This can be done using tools
such as Ensembl, NCBI's, RefSeq, or UCSC Genome Browser.
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Flat File Storage Data Formats (Genbank format vs Embl format )

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Enterz search

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Saving the record by different format from send me tap

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
1- Header information
• Header • MEDLINE - Medline unique identifier for a citation
• LOCUS - A short mnemonic name for the entry. The line • PUBMED - PubMed unique identifier for a citation.
contains the Accession number, length of molecule, type of • REMARK - relevance of a citation to an entry
molecule (DNA or RNA), a three letter reference to possible • COMMENT -RefSeq records includes the term REFSEQ and
Taxonomy, and the date that the data was made public.
identifies the record status cross-references to other sequence entries,
• DEFINITION - description of the sequence
comparisons to other collections, notes of changes in LOCUS names,
• ACCESSION - accession number is a unique, unchanging code and other remarks.
assigned to each entry
• VERSION - primary accession number and a numeric version
number associated with the current version of the sequence
data in the record. This is followed by an integer key (a "GI")
assigned to the sequence by NCBI
• KEYWORDS - gene description
• SOURCE - common name of the organism or the name most
frequently used in the literature
• ORGANISM - formal scientific name of the organism (first
line) and taxonomic classification levels (second and
subsequent lines)
• REFERENCE - articles containing data reported in this entry
• AUTHORS - authors of the citation
• TITLE - full title of citation
• JOURNAL - journal name, volume, year, and page numbers of Introduction to Bioinformatics online course: IBT
the citation Bioinformatics Resources & Databases: Abeir Shalaby
Find the status of refseq and sequence information

NG_017006 it is contig so it can have more than one Gene.


Use highlight sequence features , you will see 2 genes

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Find the status of refseq and sequence information

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Find the revision history

Introduction to Bioinformatics online course: IBT


Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases:
Bioinformatics Abeir Shalaby
Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
2- Features of the sequence

•gene features contain introns, exons and


UTRs (untranslated regions)

•mRNA features contain exons and UTRs

•CDS (coding sequence) features contain


exons only (so they start at ATG and end at
the stop codon)

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
How to extract the features of the sequence

1- the feature table

2- Highlight sequence tape

3- Graphics tape

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
1- extract the Features from the feature table

• The start of the feature section is started by


“Locus”
• SOURCE - contains information about organism,
mapping, chromosome, tissue alignment, clone
identification
• CDS - instructions on how to join sequences
together to make an amino acid sequence from
the given coordinates. Includes cross references
to other databases
• GENE Feature - a segment of DNA identified by a
name.
• RNA Feature - used to annotate RNA on
genomic sequence (for example: mRNA, tRNA,
rRNA)
(CCDS) The Consensus CDS project is a
collaborative effort to identify a core set of human
and mouse protein coding regions that are
consistently annotated and of high quality.
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Where is my targeted sequence located ?

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
The location operator
Features can be complete, partial on the 5' end, partial on the 3' end,
and/or on the complementary strand. Can use join & order operator
Examples:
1) complete feature is simply written as n. .m Example: 687..3158
The feature extends from base 687 through base 3158 in the sequence shown
2) < indicates partial on the 5' end Example: <1..888
The feature starts before the first sequenced base continues to and includes base 888
3) > indicates partial on the 3' end Example: 1.. 888>
The feature starts at the first sequenced continues beyond base 888
4) (complement) indicates that the feature is on the complementary strand
Example: complement(3300..4037)
The feature extends from base 3300 through base 4037 but is actually on the complementary strand.
It is therefore read in the opposite direction on the reverse complement strand , the Start at the base
complementary to 4037

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
partial at its 5’ end & codon_start

/Codon_start has valid value of 1 or 2 or 3, it is the first nucleotide of the CDS , that is the first base of the
first complete codon must be indicated with the qualifier "codon_start".
BLAST places the single letter AA codes in the middle of the complete codons.
We have 2 situation :
1- complete CDSs There is no need to indicate the codon_start on complete CDSs, as the translation always
begins at the first nucleotide of the interval. complete codon (coding triplet)
The default situation is that the codon_start is 1 and in this case it is not the ORF1

2- partial CDSs at its 5’ or 3 end to translate correctly with an incomplete codon (lacking the first nucleotide
or the first and the second nucleotides of the codon).
In this case Codon completion determines the reading frame for translating a 5’ or 3 partial CDS into
protein. GenBank uses the term “codon_start” as a synonym for the reading frame in this case .
• For example, nucleotide 2 begins the first complete codon of the protein x in CDS. So the codon start is
2.

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
• 1- AA code placed on the 2nd nucleotide: reading frame (codon_start) is 1

• case 1 • Explanation: BLAST places the single letter AA codes in the middle of the complete
codons. In this case, nucleotides 1, 2, and 3 represent a complete codon. The
CDS <1..18 translation therefore starts with nucleotide 1.
/codon_start=1
/transl_table=1
/translation="FGCRR"
• 2- AA code placed on the 3rd nucleotide: reading frame (codon_start) is 2
• case 2 • Explanation: The translation skips the first base of the sequence to start at the first
CDS <1..22 complete codon (nucleotides 2, 3, and 4).
/codon_start=2
/transl_table=1
/translation="SAAEDK“ • 3- AA code placed on the 4th nucleotide: reading frame (codon_start) is 3
• Explanation: The translation skips the first two nucleotides of the sequence to start the first
• case 3 complete codon (bases 3, 4, and 5).
CDS <1..26
/codon_start=3 nucelotide sequence ttcggctgcagaagataaataaataa
/transl_table=1
translated amino acid sequence, case 1 F G C R R *
/translation="RLQKINK"
translated amino acid sequence, case 2 S A A E D K *
translated amino acid sequence, case 3 R L Q K I N K *

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
A pairwise BLAST alignment with the CDS A pairwise BLAST alignment with the
A pairwise BLAST alignment with the CDS CDS feature display. Query aligns to
feature display. Query aligns to Subject feature display. Query aligns to Subject
from base 1. Lack of initiation codon (ATG) Subject from base 1. Lack of initiation
from base 1. Lack of initiation codon (ATG) codon (ATG) indicates a 5’ partial CDS.
indicates a 5’ partial CDS. The first indicates a 5’ partial CDS. The first
complete codon (underlined in red) on The first complete codon (underlined in
complete codon (underlined in red) on red) on Query are bases 3, 4, and 5 with
Query are bases 1, 2, and 3 with the AA Query are bases 2, 3, and 4 with the AA
residue “A” in the middle of the codon. the AA residue “G” in the middle of the
residue “L” in the middle of the codon. codon. Query's reading frame is 3.
Query's reading frame is 1. Query's reading frame is 2.

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
The following genetic codes and its translation table number :
https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
•1. The Standard Code (transl_table=1)
•2. The Vertebrate Mitochondrial Code (transl_table=2)
•3. The Yeast Mitochondrial Code (transl_table=3)
•4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
•(transl_table=4)
•5. The Invertebrate Mitochondrial Code (transl_table=5)
•6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (transl_table=6)
•9. The Echinoderm and Flatworm Mitochondrial Code
•10. The Euplotid Nuclear Code
•11. The Bacterial, Archaeal and Plant Plastid Code
•12. The Alternative Yeast Nuclear Code
•13. The Ascidian Mitochondrial Code
•14. The Alternative Flatworm Mitochondrial Code
•16. Chlorophycean Mitochondrial Code (transl_table=16)
•21. Trematode Mitochondrial Code (transl_table=21)
•22. Scenedesmus obliquus Mitochondrial Code
•23. Thraustochytrium Mitochondrial Code
•24. Rhabdopleuridae Mitochondrial Code
•25. Candidate Division SR1 and Gracilibacteria Code
•26. Pachysolen tannophilus Nuclear Code
•27. Karyorelict Nuclear Code
•28. Condylostoma Nuclear Code
•29. Mesodinium Nuclear Code
•30. Peritrich Nuclear Code
•31. Blastocrithidia Nuclear Code
•33. Cephalodiscidae Mitochondrial UAA-Tyr Code (transl_table=33)
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Advanced search by primary organism [porgn] to avoid synthetic construct

Look at the source annotation to see if you find Homo sapiens So , searching the Organism field will search the
Organism field in the general annotation but also the
/organism fields in the feature annotation.
Activate the Homo sapiens filter in the Results by
taxon section.
How should you do the search to return human
sequences without synthetic constructs?

You see that there are two organisms annotated:


• synthetic construct: this is what the sequence is - a construct made
by scientists. This is called the primary organism.
• homo sapiens: the sequence in the construct is originally a human
sequence. This is called the secondary organism.
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
2- Extract the feature by Highlight sequence feature tape

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
Misc- features

Introduction to Bioinformatics online course: IBT


Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases:
Bioinformatics Abeir Shalaby
Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases:
Bioinformatics Abeir Shalaby
Resources & Databases: Abeir Shalaby
3-Sequence format in genbank

1-Genbank format 2-FASTA File Format


The sequence data begin on the line immediately Standard text-based format for storing
below ORIGIN. nucleotide/protein sequence information
In raws each raw has 60 letters divided equally
on 6 columns and ended by // • First line contains metadata
• starts with >
• standardized within given database

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
• Convert genbank seq. format to fasta format and clean your sequence using nucleic
acid massager
http://www.mathaddict.net/dnatranslate3.htm

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
https://www.bioinformatics.org/sms2/genbank_feat.html

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
3- Extract the feature by Graphics Sequence Viewer

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
We have 3 genes in this genomic
region
1-the green lines=genes
2-the black line = genomic region
3-the violet line= mRNA
the red line – protein
4-the bold black boxes = exons
5- the direction of arrows determine
the direction of transcription
6- the black arrows on the terminal =
partial at its 5’ end > or partial at its 3’
end <

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
From tool translate your sequence as text

Introduction to Bioinformatics online course: IBT


Bioinformatics Resources & Databases: Abeir Shalaby
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby
GenBank Nucleotide Flat File Format
• Header • MEDLINE - Medline unique identifier for a citation
• LOCUS - A short mnemonic name for the entry. The line
contains the Accession number, length of molecule, type of • PUBMED - PubMed unique identifier for a citation.
molecule (DNA or RNA), a three letter reference to possible • REMARK - relevance of a citation to an entry
Taxonomy, and the date that the data was made public. • COMMENT -RefSeq records includes the term REFSEQ and
• DEFINITION - description of the sequence identifies the record status cross-references to other sequence entries,
• ACCESSION - accession number is a unique, unchanging code comparisons to other collections, notes of changes in LOCUS
assigned to each entry names, and other remarks.
• VERSION - primary accession number and a numeric version
number associated with the current version of the sequence data in
• Features
the record. This is followed by an integer key (a "GI") assigned to
the sequence by NCBI • SOURCE - contains information about organism, mapping,
• KEYWORDS - gene description chromosome, tissue alignment, clone identification
• SOURCE - common name of the organism or the name most • CDS - instructions on how to join sequences together to
frequently used in the literature make an amino acid sequence from the given coordinates.
• ORGANISM - formal scientific name of the organism (first line) Includes cross references to other databases
and taxonomic classification levels (second and subsequent lines) • GENE Feature - a segment of DNA identified by a name.
• REFERENCE - articles containing data reported in this entry • RNA Feature - used to annotate RNA on genomic sequence
• AUTHORS - authors of the citation (for example: mRNA, tRNA, rRNA)
• TITLE - full title of citation
• JOURNAL - journal name, volume, year, and page numbers of the • Sequence
citation
Introduction to Bioinformatics online course: IBT
Bioinformatics Resources & Databases: Abeir Shalaby

You might also like