0% found this document useful (0 votes)

2 views

Lecture_3

The document discusses various data retrieval systems used in bioinformatics, including Entrez/GQuery, DBGET/LinkDB, and SRS, which allow users to search multiple databases simultaneously. It also covers the formats for biological sequences such as FASTA and GenBank, along with tools for retrieving protein sequences and structures, specifically highlighting the UniProt and Protein Data Bank resources. The presentation emphasizes the importance of these systems and formats in accessing and analyzing biological data efficiently.

Uploaded by

fifamb3003

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lecture_3

Uploaded by

fifamb3003

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/390534932

DATA RETRIEVAL (Bioinformatics) ARSHIA NAZIR Lecturer in Biology

Presentation · April 2025

CITATIONS READS

0 2

1 author:

Arshia Nazir
University of the Punjab
23 PUBLICATIONS 36 CITATIONS

SEE PROFILE

All content following this page was uploaded by Arshia Nazir on 06 April 2025.

The user has requested enhancement of the downloaded file.

DATA RETRIEVAL
(Bioinformatics)
ARSHIA NAZIR
Lecturer in Biology
Data Retrieval
 Data retrieval from different databases requires a search capability using a

data retrieval system (tool).

 Some common data retrieval systems are Entrez/GQuery, DBGET/LinkDB,

Sequence Retrieval System (SRS), and retrieval system from EMBL-EBI.

 Retrieval systems are capable of simultaneously searching multiple linked

databases in response to a single search query.

1) Entrez/Gquery
 Entrez (GQuery, or global query; http://www.ncbi.nlm.nih.gov/sites/gquery) is a user-
friendly, versatile, text-based search and retrieval system developed by the NCBI.

 It searches linked databases using a single word or combination of words entered as search
term.

 Thus, Entrez provides a global query system and forms a web of connections with the data
bases.

 Depending on the database selected for search and retrieval, the primary source of some of
the retrieved entries may be other related but specialized databases. For example, the
Nucleotide, RefSeq, EST, GSS, and Gene databases.
 Without the data retrieval system, simultaneous searching across multiple

databases by entering the search term only once is not possible and individual

databases have to be searched separately.

 The simultaneous search capability and all-in-one display of results from

multiple databases make the NCBI Entrez (GQuery) a user-friendly search

and retrieval system for general users.

Accessed on 5th April, 2025
2) DBGET/LinkDB
 DBGET/LinkDB (https://www.genome.jp/en/gn_dbget.html) is an integrated text-based
search and retrieval system for major biological databases at GenomeNet.

 GenomeNet is the Japanese network of database and computational services for genome
research and related biomedical research.

 It is operated by the Kyoto University Bioinformatics Center.

 DBGET searches and extracts entries from a wide range of molecular biology databases, and
LinkDB searches and computes links between entries in divergent databases.
DBGET/LinkDB is currently under a new development phase for integration of both
GenomeNet databases and outside databases.
Accessed on 5th April, 2025
3) SRS and EMBL-EBI
 A sequence retrieval system (SRS) is a tool for accessing and querying biological databases,
particularly those with flat file or text formats.

 SRS was originally developed by Etzold and Argos at the European Molecular Biology
Laboratory (EMBL) in the early 1990s.

 It's designed to work with flat file or text format databases, such as EMBL nucleotide
sequence databank, SwissProt protein sequence databank, and Prosite.

 SRS is a homogeneous interface to over 80 biological databases that has been developed at
the European Bioinformatics Institute (EBI).
Accessed on 5th April, 2025
Accessed on 5th April, 2025

Access to multiple data bases

FASTA Format
 A large number of bioinformatics databases contain explicit descriptions of proteins
or nucleic acids in the form of a sequence. For instance, nucleic acids are described
by the sequence of the constituent bases (ATGC). Protein sequences are described as
the sequence of amino acid building blocks of the protein (MNHGF).

 While the alphabet of sequences has been standardized, the actual formatting of
the sequence in text files differs from database to database.

 Sequence formats differ mostly in the layout and formatting of lines of sequence
codes.
 FASTA is one of the simplest and the most popular sequence formats because it
contains plain sequence information that is readable by many bioinformatics
analysis programs.

 It has a single definition line that begins with a right angle bracket (>) followed
by a sequence name.

 Sometimes, extra information such as gi number or comments can be given, which

are separated from the sequence name by a “|” symbol.

 The extra information is considered optional and is ignored by sequence analysis

programs. The plain sequence in standard one-letter symbols starts in the second
line.
GenBank Format
 To search GenBank effectively using the text-based method requires an
understanding of the GenBank sequence format.
 GenBank is a relational database. However, the search output for sequence
files is produced as flat files for easy reading. The resulting flat files contain
three sections – Header, Features, and Sequence entry.
 The Header section describes the origin of the sequence, identification of the
organism, and unique identifiers associated with the record.
 The “Features” section includes annotation information about the gene and
gene product, as well as regions of biological significance reported in the
sequence, with identifiers and qualifiers.
 The “Source” field provides the length of the sequence, the scientific name of
the organism, and the taxonomy identification number.
FASTA format
Retrieval of DNA/mRNA Sequences
 Information about an mRNA or gene can be retrieved by selecting the
“Nucleotide” database of NCBI.

 A search using the mRNA or gene name in the Nucleotide databases retrieves
many records.

 The Nucleotide database can be searched in different ways to focus the search
more narrowly, such as by utilizing the accession or GI number or even using
the names of the authors of a submission.

 Currently, the GenBank nucleotide record provides a link to graphics of the

sequence.
Details about gene symbols, full name and functions
Position of the gene on human chromosome
 The cursor can be held on one track at a time so that the information

about that track appears in the drop-down box (Figure on the Next Slide).

 In the graphics, the green boxes represent exons in RefSeq mRNA

sequence while the green lines with arrows show the introns.

 In GenBank, human insulin gene (1431 bp) was found to be consisted of

5 exons and 2 introns.

Exons
Gene and Protein Sequences for Human Insulin Pre-protein
Retrieval of Protein Sequence
UniProt
 The Universal Protein Resource (UniProt) provides a stable, comprehensive,
freely accessible, central resource on protein sequences and functional
annotation.

 The UniProt Consortium is a collaboration between the European

Bioinformatics Institute (EBI), the Protein Information Resource (PIR) and
the Swiss Institute of Bioinformatics (SIB).

 UniProt is updated and distributed every three weeks, and can be accessed
online for searches or download at http://www.uniprot.org.
 It has four components optimized for different uses as:
 The UniProt Knowledgebase (UniProtKB) is an expertly curated database, a
central access point for integrated protein information with cross-references to
multiple sources. UniProtKB comprises two sections:
UniProtKB/Swiss-Prot which is manually annotated and is reviewed and
UniProtKB/TrEMBL which is automatically annotated and is not reviewed.
 The UniProt Archive (UniParc) is a comprehensive sequence repository,
reflecting the history of all protein sequences.
 UniProt Reference Clusters (UniRef) merge closely related sequences based
on sequence identity to speed up searches.
 The UniProt Metagenomic and Environmental Sequences (UniMES)
database is a repository specifically developed for the newly expanding area of
metagenomic and environmental data.
1. Accessing the UniProt Search
• Go to the UniProt website (https://www.uniprot.org/)
• Locate the search bar at the top of the page.
• Select "UniProtKB" from the dropdown menu to the left of the search box.
2. Performing a Search
By Accession ID:
Enter the UniProt accession ID (e.g., P05067) directly into the search box.
By Sequence:
You can search for a protein sequence using a peptide sequence or a portion of the sequence.
By Keywords:
Use keywords related to the protein, organism, or function to find relevant entries.
Advanced Search:
Use the advanced search options to filter your results based on various criteria, such as sequence length,
organism, or features.
3. Retrieving Sequences
• Once you have performed your search and have a list of entries, click the "Download" button on the
query result page.
• Select the desired download format (e.g., Flat Text, XML, RDF/XML, tab-delimited, Excel, or
FASTA).
Possible Download Formats for Data
Gene Annotation
Tools accessed through UniProt
Gene and Protein Sequences for Human Insulin Pre-protein
Retrieval of Protein Structure
Protein Data Bank
 The Protein Data Bank (https://www.rcsb.org/) is the single worldwide archive
of structural data of biological macromolecules.
 The Protein Data Bank (PDB) was established at Brookhaven National
Laboratories (BNL) in 1971 as an archive for biological macromolecular
crystal structures.
 Today depositors to the PDB have varying expertise in the techniques of X-ray
crystal structure determination, NMR, cryoelectron microscopy and theoretical
modeling.
 In October 1998, the management of the PDB became the responsibility of the
Research Collaboratory for Structural Bioinformatics (RCSB).
1. Access the RCSB PDB Website
 Go to the RCSB PDB website (https://www.rcsb.org/).
2. Search for the Protein Structure
 By PDB ID: If you know the PDB ID (e.g., 1HE8), type it into the search bar and click
"Search".
 By Protein Name: You can also search by protein name or other keywords.
 Advanced Search: If you have specific criteria (e.g., ligand, sequence, author), use the
advanced search options.
3. Locate the Structure Summary Page
 Once you find the protein, click on the structure entry to go to its summary page.
4. Download the PDB File
 On the structure summary page, find the "Download Files" option.
 Select "PDB format" to download the structure file.
 Save the file to your computer.
5. Further Exploration
 The RCSB PDB website offers various tools for visualizing, analyzing, and exploring protein
structures.
PDB Home Page
PDB Results for Human Insulin
Structure Summary
Structure of Human Insulin Chain A + B
Download Options
Human Insulin Chain A (red) and
Chain B (Green)
Protein Annotation
Experimental Details About Insulin
Sequences of Human Insulin Chain A + B
Details of Genome Mapping
Thanks

View publication stats

BI W2 Ex Ans
No ratings yet
BI W2 Ex Ans
9 pages
Bioinformatics Quiz: Test Your Knowledge of Bioinformatics
56% (18)
Bioinformatics Quiz: Test Your Knowledge of Bioinformatics
16 pages
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
No ratings yet
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
22 pages
Sec1 Introduction to Bioinformatics
No ratings yet
Sec1 Introduction to Bioinformatics
20 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
lecture1_BIOF242_shuvadeep
No ratings yet
lecture1_BIOF242_shuvadeep
38 pages
Lecture 5 Information Retrieval From Databases
No ratings yet
Lecture 5 Information Retrieval From Databases
22 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
Lecture 5- DataBase
No ratings yet
Lecture 5- DataBase
18 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
CH12
No ratings yet
CH12
8 pages
Manual
No ratings yet
Manual
68 pages
Lecture2-DataMining for Bioinformatics
No ratings yet
Lecture2-DataMining for Bioinformatics
7 pages
I Am Sharing 'Document (2) ' With You
No ratings yet
I Am Sharing 'Document (2) ' With You
36 pages
unit 1
No ratings yet
unit 1
24 pages
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
No ratings yet
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
105 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
4Bioinformaticsdatabases
No ratings yet
4Bioinformaticsdatabases
71 pages
2006 09 01 - Lect01 - ch1 2 PDF
No ratings yet
2006 09 01 - Lect01 - ch1 2 PDF
104 pages
A Review Article On Bioinformatics Tools and Software
No ratings yet
A Review Article On Bioinformatics Tools and Software
14 pages
Database
No ratings yet
Database
40 pages
Bif501 Handouts PDF Bif
No ratings yet
Bif501 Handouts PDF Bif
197 pages
Data Retrieval
67% (3)
Data Retrieval
17 pages
Biological Information
No ratings yet
Biological Information
50 pages
Entrez
No ratings yet
Entrez
46 pages
Data Retrieval System: Text-Based Database Searching
No ratings yet
Data Retrieval System: Text-Based Database Searching
54 pages
Ncbi
No ratings yet
Ncbi
25 pages
Online Biological Databases: A/Prof. Ly Le
No ratings yet
Online Biological Databases: A/Prof. Ly Le
64 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Index: Auroras Technological and Research Institute
No ratings yet
Index: Auroras Technological and Research Institute
56 pages
Bioinfi U3 Part -1
No ratings yet
Bioinfi U3 Part -1
4 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
Bioinformatics Lab Assignment Group 3
No ratings yet
Bioinformatics Lab Assignment Group 3
7 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
8024 Bio Info
No ratings yet
8024 Bio Info
28 pages
Plant Biotechnology
No ratings yet
Plant Biotechnology
44 pages
module 4 merged
No ratings yet
module 4 merged
283 pages
Bioinformatics Question Bank for FAT
No ratings yet
Bioinformatics Question Bank for FAT
53 pages
Bi Workbook
No ratings yet
Bi Workbook
13 pages
Database Dalam Bioinformatika
No ratings yet
Database Dalam Bioinformatika
34 pages
Bioinformatics
No ratings yet
Bioinformatics
47 pages
Bioinfo Course Notes M1 2020 Dr Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 Dr Mbulli
56 pages
Biological Databases
No ratings yet
Biological Databases
39 pages
ok
No ratings yet
ok
29 pages
بحث المعلوماتية الحيوية
No ratings yet
بحث المعلوماتية الحيوية
39 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
IInd Sem Class1
No ratings yet
IInd Sem Class1
56 pages
#1 L1 BioDatabases
No ratings yet
#1 L1 BioDatabases
89 pages
BioinfoMethods I Lab01
No ratings yet
BioinfoMethods I Lab01
19 pages
M Lec 01 & 02 Biological Database
No ratings yet
M Lec 01 & 02 Biological Database
50 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
Nucleic_Acid_Databases
No ratings yet
Nucleic_Acid_Databases
37 pages
Bio Informatics
No ratings yet
Bio Informatics
46 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
Bioinformatics Database Resources: Icxa Khandelwal Pavan Kumar Agrawal Rahul Shrivastava
No ratings yet
Bioinformatics Database Resources: Icxa Khandelwal Pavan Kumar Agrawal Rahul Shrivastava
46 pages
Machine Learning and Deep Learning Approaches For Brain Disease Diagnosis Principles and Recent Advances
No ratings yet
Machine Learning and Deep Learning Approaches For Brain Disease Diagnosis Principles and Recent Advances
34 pages
Blast2Go Tutorial
No ratings yet
Blast2Go Tutorial
31 pages
Pra 1 Swiss Prot
No ratings yet
Pra 1 Swiss Prot
2 pages
Molecules 27 04643
No ratings yet
Molecules 27 04643
15 pages
Bioinformatics Databases
No ratings yet
Bioinformatics Databases
10 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
3 pages
Greco 2015
No ratings yet
Greco 2015
5 pages
Syllabus M.tech Computational Biology 2023 2024
No ratings yet
Syllabus M.tech Computational Biology 2023 2024
68 pages
Bioinformatics
No ratings yet
Bioinformatics
24 pages
Abhilash-SWISS MODEL Seminar 2023
No ratings yet
Abhilash-SWISS MODEL Seminar 2023
25 pages
UniproUGENE UserManual
No ratings yet
UniproUGENE UserManual
207 pages
CLC Main Workbench User Manual
No ratings yet
CLC Main Workbench User Manual
573 pages
Basic Bioinformatics Syllabus
No ratings yet
Basic Bioinformatics Syllabus
2 pages
Bio Python Tutorial
No ratings yet
Bio Python Tutorial
331 pages
UsersGuide1 8 PDF
No ratings yet
UsersGuide1 8 PDF
1,093 pages
2024 7487 Moesm1 Esm
No ratings yet
2024 7487 Moesm1 Esm
44 pages
Expert Protein Analysis System: Expasy
100% (1)
Expert Protein Analysis System: Expasy
14 pages
Protein Database
No ratings yet
Protein Database
3 pages
FASTA
No ratings yet
FASTA
33 pages
Lecture 4 Biological Databases
No ratings yet
Lecture 4 Biological Databases
29 pages
Microbial Genomics
No ratings yet
Microbial Genomics
10 pages
Overall Report of The Internship
No ratings yet
Overall Report of The Internship
11 pages
Imbalanced Dataset Classification and Solutions: A Review
No ratings yet
Imbalanced Dataset Classification and Solutions: A Review
29 pages
Bioinformatics Workshop LDH Worksheet-1
No ratings yet
Bioinformatics Workshop LDH Worksheet-1
4 pages
Data Base in Bioinformatics
No ratings yet
Data Base in Bioinformatics
30 pages
Презентация Microsoft Office PowerPoint
No ratings yet
Презентация Microsoft Office PowerPoint
17 pages
IJP Format
No ratings yet
IJP Format
48 pages