Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Sequence and Structure Retrieval

Uploaded by

juhiyaadav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Sequence and Structure Retrieval

Uploaded by

juhiyaadav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Aim: To retrieve DNA, RNA, protein sequences and structures

from biological databases and to create various datasets.

Bioinformaticians store biological data (DNA, RNA, and proteins), in a digitized


format, namely in databases. Bioinformaticians have developed databases for
the global submission, maintenance, access, and sharing of data on
biomolecules. These databases keep the data in a structured manner.

The databases contain several types of biological data, such as DNA, RNA, and
protein sequences, structural information, gene expression data, molecular
interaction data, mutation data, phenotypic data, information about metabolic
pathways, and taxonomic information about biological organisms.

Databases can be classified into primary (archival), secondary (curated), and


composite databases.

 Primary Databases: These databases are constructed based on data


collected from laboratory experiments. After experiments the data are
validated and analyzed before uploading in biological databases. They are
classified based on the type of biological molecules like: -
o Nucleic acid databases (GenBank, EMBL, DDBJ, NDB)
o Protein databases (PIR, Swiss-Prot, TrEMBL, PDB)
o Metabolic pathway database (KEGG, EcoCyc, and MetaCyc) and
o Small molecule databases (PubChem, Drug Bank, ZINC, CSD)

 Secondary Databases: These databases are constructed based on


primary biological databases with additional information. Secondary
databases comprise data derived from the results of analyzing primary
data available on the primary databases. They are often referred to as
curated databases. Secondary databases often draw upon information
from numerous sources, including other databases (primary and
secondary), controlled vocabulary, and scientific literature.

 Composite databases store data of different primary databases, thus


obviates the need to search multiple primary databases for nucleotide
sequence, protein sequence, protein structure etc. Examples of some
composite databases are:-
1. nrdb (nonredundant database) combines and stores sequences from
GenBank (CDS translations), PDB, Swiss-Prot, PIR, and PRF.
2. INSD (International Nucleotide Sequence Database) stores nucleotide
sequences of EMBL, GenBank, and DDBJ.
3. UniProt (universal protein sequence database) is a collection of protein
sequences from PIR-PSD, Swiss-Prot, and TrEMB

NCBI
The National Center for Biotechnology Information (NCBI) is part of the United
States National Library of Medicine (NLM), a branch of the National Institutes
of Health (NIH). NCBI houses a series of databases relevant to the basic and
applied life sciences and is an important resource for bioinformatics tools and
services. Major databases include GenBank for DNA sequences and PubMed,
a bibliographic database for biomedical literature. The GenBank sequence
database is an open access collection of publicly available DNA and protein
sequences. GenBank is the actual database, and it can be searched several ways
such as the accession number, or using gene/protein names as keywords.

EMBL-EBI
The EMBL (European Molecular Biology Laboratory) Nucleotide Sequence
Database is a comprehensive database of DNA and RNA sequences directly
submitted from researchers and genome sequencing groups and collected from
the scientific literature and patent applications. In collaboration with DDBJ and
GenBank the database is produced, maintained and distributed at the European
Bioinformatics Institute (EBI) and constitutes Europe’s primary nucleotide
sequence resource.

PDB
PDB (Protein data bank) is a repository for 3D structural data obtained by x-ray
crystallography or NMR spectroscopy of proteins and nucleic acids. Research
Collaboratory for Structural Bioinformatics (RCSB) PDB provides a variety of
tools and resources for studying the structures of biological macromolecules
and their relationship with other sequences, its function and diseases caused if
any .

Retrieval of DNA, RNA, protein sequences from NCBI


1. Open NCBI (www.ncbi.nlm.gov).
2. In the drop-down menu, select “All databases”.
3. Type in name of gene for which sequence/structure is to be searched for, in
search box, and press the search button. [AQP3]

4. Click on the GENE result AQP3-aquaporin 3 (Gill blood group) – Homo


sapiens (human)
5. Expandable sections for the gene of interest are displayed.

6. Click on the section labelled as “NCBI Reference Sequences (RefSeq)”

For Retrieval of Nucleotide Sequence (FASTA & GenBank formats)


o Under genomic subhead NG_007476.1 RefSeqGene, click on the
hyperlink FASTA/GenBank to get sequence in required format.

o For Nucleotide FASTA Sequence: To download the sequence, click on


the hyperlink “Send to:” and select complete record and choose
destination “File’, and download in “FASTA” format.

o For Nucleotide GenBank Sequence: To download the sequence, click


on the hyperlink “Send to:” and select complete record and choose
destination “File’, and download in “GenBank” format.

For Retrieval of RNA and Protein Sequence (FASTA format)

Under mRNA and Protein subhead,

o Click on the hyperlink NM_001318144.2 in order to get the mRNA


sequence.
o Click on the hyperlink NP_001305073.1 in order to get the mRNA
sequence.
o Download the FASTA sequence for the mRNA in the same way as
described for downloading the FASTA file for nucleotide.
o Download the FASTA sequence for the protein sequence in the same
way as described for downloading the FASTA file for nucleotide.

Retrieval of Protein Structure from NCBI

7. Select “Structure “ in the dropdown menu and type the name for which
protein structure is to be searched for in the search bar. Click on search
button. Clicking on desired search result will show structure summary of
protein.
8. Download the structure in PDB format.

EMBL-EBI

Retrieval of DNA, RNA, protein sequences and


structure from EMBL-EBI
1. Open the database EMBL-EBI (https://www.ebi.ac.uk).
2. Type the name of gene of interest in the search tab, while having ALL
selected in the dropdown menu. Click search button.
3. The search results provide results related to Genomes and metagenomes, Nucleotide
sequences, Protein sequences, etc. Towards left of the webpage, the entire list can be seen.

4. Scroll down to search for the desired data (DNA, RNA, or Protein).
5. For retrieval of FASTA sequence, click on “in FASTA format”
6. Copy the FASTA sequence, paste in notepad and save the file.

7. For retrieval of protein structure, go to the https://www.ebi.ac.uk, and click on “About EBI
search” hyperlink, that appears under the search tab.
8. Scroll down to collaborations and click on PDBe.
9. Protein structure can be searched by the name or PDB ID.

Retrieval of protein structure from PDB


1. Open the database PDB (https://www.rcsb.org).
2. Type in the search box the protein name or PDB ID and click search.
3. Search results appear. Structure can be downloaded in various formats
which includes PDB format.
END EXERCISES
Exercise 1: Retrieve DNA, RNA, Protein Sequence and Protein Structure for provided
protein using NCBI

a. DNA FASTA for the gene AQP3


b. RNA FASTA for the gene AQP3
c. Protein FASTA for the gene AQP3
d. Protein Structure and PDB ID for AQP7

Exercise 2 : Retrieve DNA and Protein Sequences for the HIV-1 env gene using NCBI in
GenBank format

a. DNA GenBank for env gene of HIV-1


b. Protein GenBank for env gene of HIV-1
c.
Exercise 3 : Retrieve sequence in FASTA format for provided accession numbers and
determine the sequence type (DNA/RNA/mRNA/Protein) and name of the gene/protein.

a. NG_059281.1
b. NM_000518.5
c. NP_000509.1

Exercise 3 : Write two differences between FASTA and GenBank format.

Exercise 4 : What use can researchers make of the obtained sequences? Explain one
such application in detail.

Exercise 5 : Retrieve DNA, RNA and protein sequence from EMBL-EBI

a. DNA FASTA for the gene actin


b. RNA FASTA for the gene actin
c. Protein FASTA for the gene actin
d. Protein Structure and PDB ID for actin

Exercise 6 : Retrieve protein structure using the PDB ID 6KXW from PDB.

Exercise 7: Create datasets containing minimum 5 sequences for each of the following :-

a. DNA sequence dataset in FASTA format


b. RNA sequence dataset in FASTA format
c. Protein sequence dataset in FASTA format

You might also like