Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Data Mining & Sequence Retrieval Practical

The document outlines a practical session on data mining and sequence retrieval, focusing on databases for nucleotide and protein sequences such as NCBI, Uniprot, and PDB. It details experiments aimed at retrieving sequences in FASTA format, analyzing protein functions, and obtaining structural information. Additionally, it includes an assignment for students to explore a specific protein using the Uniprot database and gather various related data.

Uploaded by

h220265n
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Mining & Sequence Retrieval Practical

The document outlines a practical session on data mining and sequence retrieval, focusing on databases for nucleotide and protein sequences such as NCBI, Uniprot, and PDB. It details experiments aimed at retrieving sequences in FASTA format, analyzing protein functions, and obtaining structural information. Additionally, it includes an assignment for students to explore a specific protein using the Uniprot database and gather various related data.

Uploaded by

h220265n
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Mining

&
Sequence Retrieval
Practical Session 1

Mr. C. Mawere
MTech Bioinformatics (JNTUH, India)
BTech Biotechnology (CUT, Zimbabwe)
Databases
Primary Nucleotide Repository
NCBI ( http://www.ncbi.nlm.nih.gov)
EMBL (http:// www.ebi.ac.uk/embl)
DDBJ (http://www.ddbj.nig.ac.jp/)

Primary Protein Repository


• PIR (http://pir.georgetown.edu)

• Swissprot/Uniprot (http:// www.ebi.ac.uk/swissprot)

• Protein Data Bank (http://www.rcsb.org/pdb)


Secondary ‘pattern’ databases
PROSITE SWISS-PROT Regular expressions (patterns)
PRINTS SWISS-PROT/TrEMBL Aligned motifs (fingerprints)
Pfam SWISS-PROT/TrEMBL Hidden Markov Models (HMMs)
Profiles SWISS-PROT Weight matrices (profiles)
BLOCKS PRINTS/InterPro/Domo Weighted motifs (blocks)
IDENTIFY PRINTS/InterPro Permissive regular expressions

An Integrated Database
An integration database is a database which acts as
the data store for multiple applications, and thus
integrates data across these applications
e.g UNIPROT, NCBI
Database Records
A typical database record contains:
• The header; includes the sequence description, source of
organism, literature references, locus field, accession number,
taxanomic classification.

• The feature table; contains a description of all features in record


like coding sequences, exons, repeats, promoters e.t.c for
nucleotide sequences and domains, structure elements, binding
sites e.t.c for protein sequences. If the table includes a coding
DNA sequence (CDS), links to the translated protein sequence is
also mentioned therein.

• The sequence (per se) which is often more easily analyzed by the
computer.
Example of a
database record
Experiment- 1
AIM :To Retrieve the protein or DNA sequence in FASTA format
from the NCBI database and analyze the obtained data.

SOFTWARE USED : internet access, NCBI data base.

PROCEDURE:
STEP 1 : Open web browser and type the web address of the
required NCBI database.

STEP 2 : Explore the database and analyze the various information


available for PROTEIN or DNA sequence in the database.

STEP 3 : Save the output into a separate folder in FASTA format.


Data Mining & Sequence
Retrieval Output
NCBI Repository
Database Search
Database Search (Cont)
Searching for COVID-19 Delta variant
Searching Results
Database Record for SARS-COV-2
Note: CDS is a sequence of
nucleotides that corresponds
with the sequence of amino
acids in a protein. A typical CDS
starts with ATG and ends with a
stop codon. CDS can be a
subset of an open reading
frame (ORF).
FASTA Format Record
Saving the DNA sequence in FASTA
Experiment- 2
AIM: For a given protein sequence find the function, structural relevance
and annotation studies by using Uniprot / UniprotKB.

SOFTWARE USED : internet access.

PROCEDURE:
STEP 1: Open Uniprot Database www.unitprot.org.

STEP 2: Enter the protein Id in search tab and click on Find.

STEP 3: Click on the protein name displayed on the result page.

STEP 4: Observe the protein function and structural information of


protein sequence.
Uniprot Database
Uniprot Search Output
Uniprot Entry/Record

Function of ACE-2 shown in figure above


Taxonomy of ACE2
Expression of ACE2
Interaction of ACE2
Interaction of ACE2 (Cont)
Structural Relevance of ACE2
Structure of ACE2 (Cont)
Sequence Status of ACE2
Annotation info of ACE2
Annotation of ACE2 (Cont)

NB: Topology refers to the way in which constituent parts are interrelated or arranged.
Experiment- 3
AIM : For a given protein find the protein PDB code, release date,
resolution, classification and pub med, citation from PDB structure
database.

SOFTWARE USED : internet access, PDB database

PROCEDURE:
STEP 1: Open PDB Database www.pdb.org.

STEP 2: Enter the protein Id or molecule name or author name in


search tab and click on Find.

STEP 3: Click on the relevant PDB code displayed on the result page.
PDB Database Search
PDB Database Search (Cont)
PDB Output of SARS-COV-2 ID
SARS-COV-2 complex & ACE2
Literature Citation
Release Date of ACE2-SARS-COV-2
Assignment on Data Mining
Given the protein ID: P0DTC2 and using the Uniprot database:
1. Find the function of the protein

2. Retrieve the protein sequence in FASTA format.

3. Describe its interaction

4. Identify any of its protein structure ID, resolution, classification,


release date and literature citations.

5. Identify any drugs that can be used to inhibit this protein.

6. Describe the pharmacokinetics of any of the listed drugs.

You might also like