Data Mining & Sequence Retrieval Practical

The document outlines a practical session on data mining and sequence retrieval, focusing on databases for nucleotide and protein sequences such as NCBI, Uniprot, and PDB. It details experiments aimed at retrieving sequences in FASTA format, analyzing protein functions, and obtaining structural information. Additionally, it includes an assignment for students to explore a specific protein using the Uniprot database and gather various related data.

Uploaded by

h220265n

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Data Mining & Sequence Retrieval Practical

Uploaded by

h220265n

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Data Mining

&
Sequence Retrieval
Practical Session 1

Mr. C. Mawere
MTech Bioinformatics (JNTUH, India)
BTech Biotechnology (CUT, Zimbabwe)
Databases
Primary Nucleotide Repository
NCBI ( http://www.ncbi.nlm.nih.gov)
EMBL (http:// www.ebi.ac.uk/embl)
DDBJ (http://www.ddbj.nig.ac.jp/)

Primary Protein Repository

• PIR (http://pir.georgetown.edu)

• Swissprot/Uniprot (http:// www.ebi.ac.uk/swissprot)

• Protein Data Bank (http://www.rcsb.org/pdb)

Secondary ‘pattern’ databases
PROSITE SWISS-PROT Regular expressions (patterns)
PRINTS SWISS-PROT/TrEMBL Aligned motifs (fingerprints)
Pfam SWISS-PROT/TrEMBL Hidden Markov Models (HMMs)
Profiles SWISS-PROT Weight matrices (profiles)
BLOCKS PRINTS/InterPro/Domo Weighted motifs (blocks)
IDENTIFY PRINTS/InterPro Permissive regular expressions

An Integrated Database
An integration database is a database which acts as
the data store for multiple applications, and thus
integrates data across these applications
e.g UNIPROT, NCBI
Database Records
A typical database record contains:
• The header; includes the sequence description, source of
organism, literature references, locus field, accession number,
taxanomic classification.

• The feature table; contains a description of all features in record

like coding sequences, exons, repeats, promoters e.t.c for
nucleotide sequences and domains, structure elements, binding
sites e.t.c for protein sequences. If the table includes a coding
DNA sequence (CDS), links to the translated protein sequence is
also mentioned therein.

• The sequence (per se) which is often more easily analyzed by the
computer.
Example of a
database record
Experiment- 1
AIM :To Retrieve the protein or DNA sequence in FASTA format
from the NCBI database and analyze the obtained data.

SOFTWARE USED : internet access, NCBI data base.

PROCEDURE:
STEP 1 : Open web browser and type the web address of the
required NCBI database.

STEP 2 : Explore the database and analyze the various information

available for PROTEIN or DNA sequence in the database.

STEP 3 : Save the output into a separate folder in FASTA format.

Data Mining & Sequence
Retrieval Output
NCBI Repository
Database Search
Database Search (Cont)
Searching for COVID-19 Delta variant
Searching Results
Database Record for SARS-COV-2
Note: CDS is a sequence of
nucleotides that corresponds
with the sequence of amino
acids in a protein. A typical CDS
starts with ATG and ends with a
stop codon. CDS can be a
subset of an open reading
frame (ORF).
FASTA Format Record
Saving the DNA sequence in FASTA
Experiment- 2
AIM: For a given protein sequence find the function, structural relevance
and annotation studies by using Uniprot / UniprotKB.

SOFTWARE USED : internet access.

PROCEDURE:
STEP 1: Open Uniprot Database www.unitprot.org.

STEP 2: Enter the protein Id in search tab and click on Find.

STEP 3: Click on the protein name displayed on the result page.

STEP 4: Observe the protein function and structural information of

protein sequence.
Uniprot Database
Uniprot Search Output
Uniprot Entry/Record

Function of ACE-2 shown in figure above

Taxonomy of ACE2
Expression of ACE2
Interaction of ACE2
Interaction of ACE2 (Cont)
Structural Relevance of ACE2
Structure of ACE2 (Cont)
Sequence Status of ACE2
Annotation info of ACE2
Annotation of ACE2 (Cont)

NB: Topology refers to the way in which constituent parts are interrelated or arranged.
Experiment- 3
AIM : For a given protein find the protein PDB code, release date,
resolution, classification and pub med, citation from PDB structure
database.

SOFTWARE USED : internet access, PDB database

PROCEDURE:
STEP 1: Open PDB Database www.pdb.org.

STEP 2: Enter the protein Id or molecule name or author name in

search tab and click on Find.

STEP 3: Click on the relevant PDB code displayed on the result page.
PDB Database Search
PDB Database Search (Cont)
PDB Output of SARS-COV-2 ID
SARS-COV-2 complex & ACE2
Literature Citation
Release Date of ACE2-SARS-COV-2
Assignment on Data Mining
Given the protein ID: P0DTC2 and using the Uniprot database:
1. Find the function of the protein

2. Retrieve the protein sequence in FASTA format.

3. Describe its interaction

4. Identify any of its protein structure ID, resolution, classification,

release date and literature citations.

5. Identify any drugs that can be used to inhibit this protein.

6. Describe the pharmacokinetics of any of the listed drugs.

Solid Starts - First 100 Days
94% (18)
Solid Starts - First 100 Days
287 pages
Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Read People Like A Book by Patrick King-Edited
62% (65)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (212)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
36 Questions To Fall in Love 1
97% (31)
36 Questions To Fall in Love 1
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
71% (69)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Slope Stabilization and Erosion Control
No ratings yet
Slope Stabilization and Erosion Control
306 pages
UNIT II
No ratings yet
UNIT II
23 pages
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
No ratings yet
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
42 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Sequence and Structure Retrieval
No ratings yet
Sequence and Structure Retrieval
9 pages
6.1 Bioinformatics Databases and Tools - Introduction: Lecture 6: December, 28, 2001
No ratings yet
6.1 Bioinformatics Databases and Tools - Introduction: Lecture 6: December, 28, 2001
31 pages
Unit-5 Bioinformatics
No ratings yet
Unit-5 Bioinformatics
13 pages
Bioinformatics Databases
No ratings yet
Bioinformatics Databases
10 pages
Bioinformatic Database Record
No ratings yet
Bioinformatic Database Record
63 pages
Databases
No ratings yet
Databases
3 pages
Database
No ratings yet
Database
40 pages
Biological Databases Lab 1
No ratings yet
Biological Databases Lab 1
15 pages
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
No ratings yet
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
5 pages
4Bioinformaticsdatabases
No ratings yet
4Bioinformaticsdatabases
71 pages
BI Lab Manual(18-19)
No ratings yet
BI Lab Manual(18-19)
21 pages
Lecture_3
No ratings yet
Lecture_3
55 pages
lecture1_BIOF242_shuvadeep
No ratings yet
lecture1_BIOF242_shuvadeep
38 pages
Entrez
No ratings yet
Entrez
46 pages
Capture D'écran . 2023-03-14 À 00.15.22
No ratings yet
Capture D'écran . 2023-03-14 À 00.15.22
54 pages
Bioinformatics Biological Database
No ratings yet
Bioinformatics Biological Database
31 pages
Introduction To NCBI
No ratings yet
Introduction To NCBI
29 pages
Rese Rach
No ratings yet
Rese Rach
37 pages
38401062 Introduction
No ratings yet
38401062 Introduction
13 pages
Ahmed Saad Qatea / 4 Stage
No ratings yet
Ahmed Saad Qatea / 4 Stage
10 pages
Essential Info Notes-1
No ratings yet
Essential Info Notes-1
57 pages
Resumen Unidad 1 y 2 Bioinformatica
No ratings yet
Resumen Unidad 1 y 2 Bioinformatica
14 pages
Database Dalam Bioinformatika
No ratings yet
Database Dalam Bioinformatika
34 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
The EMBL Nucleotide Sequence Database
No ratings yet
The EMBL Nucleotide Sequence Database
5 pages
Bioinfo U3 Part 2
No ratings yet
Bioinfo U3 Part 2
3 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Unit II Bioinformatics
No ratings yet
Unit II Bioinformatics
25 pages
Lecture 5 Information Retrieval From Databases
No ratings yet
Lecture 5 Information Retrieval From Databases
22 pages
15GN402L_final_bioinformatics_lab_manual (1)
No ratings yet
15GN402L_final_bioinformatics_lab_manual (1)
68 pages
BMC Genomics: Fourmidable: A Database For Ant Genomics
No ratings yet
BMC Genomics: Fourmidable: A Database For Ant Genomics
5 pages
BIOINFORMATICS PRACTICAL FILE
No ratings yet
BIOINFORMATICS PRACTICAL FILE
12 pages
Adv Bi Unit 1
No ratings yet
Adv Bi Unit 1
39 pages
Bif401 Manual 2023
No ratings yet
Bif401 Manual 2023
27 pages
Practical 2 - Ncbi
No ratings yet
Practical 2 - Ncbi
3 pages
Bio Tools Booklet
No ratings yet
Bio Tools Booklet
5 pages
Bioinformatica Clinica
No ratings yet
Bioinformatica Clinica
25 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Mulder 2007
No ratings yet
Mulder 2007
13 pages
Structure and Function of Sars-Cov-2 Spike Protein: A Multiple Sequence Alignment (Msa) Study
No ratings yet
Structure and Function of Sars-Cov-2 Spike Protein: A Multiple Sequence Alignment (Msa) Study
11 pages
Molecular Genetics - Lab Manual - 22 May 2021
No ratings yet
Molecular Genetics - Lab Manual - 22 May 2021
36 pages
Data Retrival Systems
No ratings yet
Data Retrival Systems
3 pages
Biological Databases
No ratings yet
Biological Databases
13 pages
Bioinformatics
No ratings yet
Bioinformatics
47 pages
Module 1_Session 3_Part 2
No ratings yet
Module 1_Session 3_Part 2
36 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
Bioinformatics lecture 1
No ratings yet
Bioinformatics lecture 1
48 pages
Biological Database
No ratings yet
Biological Database
8 pages
Fat Noews Docx (2)
No ratings yet
Fat Noews Docx (2)
37 pages
Data Retrieval
67% (3)
Data Retrieval
17 pages
Lec 3 Terms and Definitions in Bioinformatics
No ratings yet
Lec 3 Terms and Definitions in Bioinformatics
8 pages
Bio in For Matics
No ratings yet
Bio in For Matics
26 pages
Bioinformatics Overview
100% (1)
Bioinformatics Overview
18 pages
CDD: NCBI's Conserved Domain Database
No ratings yet
CDD: NCBI's Conserved Domain Database
5 pages
The Stanford Microarray Database: Nucleic Acids Research, 2001, Vol. 29, No. 1 © 2001 Oxford University Press
No ratings yet
The Stanford Microarray Database: Nucleic Acids Research, 2001, Vol. 29, No. 1 © 2001 Oxford University Press
4 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Learning Linux Binary Analysis: Learning Linux Binary Analysis
From Everand
Learning Linux Binary Analysis: Learning Linux Binary Analysis
Ryan "elfmaster" O'Neill
4/5 (1)
Qualitative PPT 291118
No ratings yet
Qualitative PPT 291118
15 pages
12. Biotechnology MCQs
No ratings yet
12. Biotechnology MCQs
9 pages
Q&A Report From The Workshop - Exploring EMBL-EBI Sequence Analysis Tools and Managing Bioinformatics Workflows
No ratings yet
Q&A Report From The Workshop - Exploring EMBL-EBI Sequence Analysis Tools and Managing Bioinformatics Workflows
4 pages
Restriction-Fragment-Length-Polymorphism-(RFLP)-Technique
No ratings yet
Restriction-Fragment-Length-Polymorphism-(RFLP)-Technique
4 pages
Chapter 4 Lesson 3
No ratings yet
Chapter 4 Lesson 3
5 pages
Clustal
No ratings yet
Clustal
2 pages
Bioinformatics MCQs
No ratings yet
Bioinformatics MCQs
10 pages
An Introduction To Patterns, Profiles, Hmms and Psi-Blast
No ratings yet
An Introduction To Patterns, Profiles, Hmms and Psi-Blast
92 pages
2 Introduction To PDB
No ratings yet
2 Introduction To PDB
43 pages
Biotechnology Principles and Processes - DPP 05 (Of Lec-09) - Yakeen 2.0 2024 (Legend)
No ratings yet
Biotechnology Principles and Processes - DPP 05 (Of Lec-09) - Yakeen 2.0 2024 (Legend)
3 pages
Part 3 - Digital RNAseq Data Analysis
No ratings yet
Part 3 - Digital RNAseq Data Analysis
70 pages
Introduction To Bioinformatics Presentation
No ratings yet
Introduction To Bioinformatics Presentation
13 pages
14 Handbook of Plant Biotechnology
No ratings yet
14 Handbook of Plant Biotechnology
1 page
2 Biology For Engineers
No ratings yet
2 Biology For Engineers
15 pages
ITS As An Environmental DNA Barcode For Fungi: An: in Silico Approach Reveals Potential PCR Biases
No ratings yet
ITS As An Environmental DNA Barcode For Fungi: An: in Silico Approach Reveals Potential PCR Biases
9 pages
FALLSEM2022-23 BIT3001 ETH VL2022230101828 Reference Material II 13-09-2022 Clustal Omega FAQ
No ratings yet
FALLSEM2022-23 BIT3001 ETH VL2022230101828 Reference Material II 13-09-2022 Clustal Omega FAQ
10 pages
Where Did The BLOSUM62 Alignment Score Matrix Come From?: Primer
No ratings yet
Where Did The BLOSUM62 Alignment Score Matrix Come From?: Primer
2 pages
BBL 434 - Bioinformatics: D. Sundar
100% (1)
BBL 434 - Bioinformatics: D. Sundar
22 pages
Submitted By:: Lab-Pdb
No ratings yet
Submitted By:: Lab-Pdb
7 pages
Gmo Labeling Essay Assignment
No ratings yet
Gmo Labeling Essay Assignment
3 pages
GE-Lec 1
No ratings yet
GE-Lec 1
5 pages
PG2607 PJT11009 COL28204 Imputation Aware Design Whitepaper D1 CG - SMJ - AM
No ratings yet
PG2607 PJT11009 COL28204 Imputation Aware Design Whitepaper D1 CG - SMJ - AM
5 pages
CRISPR
No ratings yet
CRISPR
17 pages
My KFUEIT - Dashboard
No ratings yet
My KFUEIT - Dashboard
1 page
Computational Bioengineering
No ratings yet
Computational Bioengineering
480 pages
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
D 8187 Bul
No ratings yet
D 8187 Bul
4 pages