Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Lecture_3

The document discusses various data retrieval systems used in bioinformatics, including Entrez/GQuery, DBGET/LinkDB, and SRS, which allow users to search multiple databases simultaneously. It also covers the formats for biological sequences such as FASTA and GenBank, along with tools for retrieving protein sequences and structures, specifically highlighting the UniProt and Protein Data Bank resources. The presentation emphasizes the importance of these systems and formats in accessing and analyzing biological data efficiently.

Uploaded by

fifamb3003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture_3

The document discusses various data retrieval systems used in bioinformatics, including Entrez/GQuery, DBGET/LinkDB, and SRS, which allow users to search multiple databases simultaneously. It also covers the formats for biological sequences such as FASTA and GenBank, along with tools for retrieving protein sequences and structures, specifically highlighting the UniProt and Protein Data Bank resources. The presentation emphasizes the importance of these systems and formats in accessing and analyzing biological data efficiently.

Uploaded by

fifamb3003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/390534932

DATA RETRIEVAL (Bioinformatics) ARSHIA NAZIR Lecturer in Biology

Presentation · April 2025

CITATIONS READS

0 2

1 author:

Arshia Nazir
University of the Punjab
23 PUBLICATIONS 36 CITATIONS

SEE PROFILE

All content following this page was uploaded by Arshia Nazir on 06 April 2025.

The user has requested enhancement of the downloaded file.


DATA RETRIEVAL
(Bioinformatics)
ARSHIA NAZIR
Lecturer in Biology
Data Retrieval
 Data retrieval from different databases requires a search capability using a

data retrieval system (tool).

 Some common data retrieval systems are Entrez/GQuery, DBGET/LinkDB,

Sequence Retrieval System (SRS), and retrieval system from EMBL-EBI.

 Retrieval systems are capable of simultaneously searching multiple linked

databases in response to a single search query.


1) Entrez/Gquery
 Entrez (GQuery, or global query; http://www.ncbi.nlm.nih.gov/sites/gquery) is a user-
friendly, versatile, text-based search and retrieval system developed by the NCBI.

 It searches linked databases using a single word or combination of words entered as search
term.

 Thus, Entrez provides a global query system and forms a web of connections with the data
bases.

 Depending on the database selected for search and retrieval, the primary source of some of
the retrieved entries may be other related but specialized databases. For example, the
Nucleotide, RefSeq, EST, GSS, and Gene databases.
 Without the data retrieval system, simultaneous searching across multiple

databases by entering the search term only once is not possible and individual

databases have to be searched separately.

 The simultaneous search capability and all-in-one display of results from

multiple databases make the NCBI Entrez (GQuery) a user-friendly search

and retrieval system for general users.


Accessed on 5th April, 2025
2) DBGET/LinkDB
 DBGET/LinkDB (https://www.genome.jp/en/gn_dbget.html) is an integrated text-based
search and retrieval system for major biological databases at GenomeNet.

 GenomeNet is the Japanese network of database and computational services for genome
research and related biomedical research.

 It is operated by the Kyoto University Bioinformatics Center.

 DBGET searches and extracts entries from a wide range of molecular biology databases, and
LinkDB searches and computes links between entries in divergent databases.
DBGET/LinkDB is currently under a new development phase for integration of both
GenomeNet databases and outside databases.
Accessed on 5th April, 2025
3) SRS and EMBL-EBI
 A sequence retrieval system (SRS) is a tool for accessing and querying biological databases,
particularly those with flat file or text formats.

 SRS was originally developed by Etzold and Argos at the European Molecular Biology
Laboratory (EMBL) in the early 1990s.

 It's designed to work with flat file or text format databases, such as EMBL nucleotide
sequence databank, SwissProt protein sequence databank, and Prosite.

 SRS is a homogeneous interface to over 80 biological databases that has been developed at
the European Bioinformatics Institute (EBI).
Accessed on 5th April, 2025
Accessed on 5th April, 2025

Access to multiple data bases


FASTA Format
 A large number of bioinformatics databases contain explicit descriptions of proteins
or nucleic acids in the form of a sequence. For instance, nucleic acids are described
by the sequence of the constituent bases (ATGC). Protein sequences are described as
the sequence of amino acid building blocks of the protein (MNHGF).

 While the alphabet of sequences has been standardized, the actual formatting of
the sequence in text files differs from database to database.

 Sequence formats differ mostly in the layout and formatting of lines of sequence
codes.
 FASTA is one of the simplest and the most popular sequence formats because it
contains plain sequence information that is readable by many bioinformatics
analysis programs.

 It has a single definition line that begins with a right angle bracket (>) followed
by a sequence name.

 Sometimes, extra information such as gi number or comments can be given, which


are separated from the sequence name by a “|” symbol.

 The extra information is considered optional and is ignored by sequence analysis


programs. The plain sequence in standard one-letter symbols starts in the second
line.
GenBank Format
 To search GenBank effectively using the text-based method requires an
understanding of the GenBank sequence format.
 GenBank is a relational database. However, the search output for sequence
files is produced as flat files for easy reading. The resulting flat files contain
three sections – Header, Features, and Sequence entry.
 The Header section describes the origin of the sequence, identification of the
organism, and unique identifiers associated with the record.
 The “Features” section includes annotation information about the gene and
gene product, as well as regions of biological significance reported in the
sequence, with identifiers and qualifiers.
 The “Source” field provides the length of the sequence, the scientific name of
the organism, and the taxonomy identification number.
FASTA format
Retrieval of DNA/mRNA Sequences
 Information about an mRNA or gene can be retrieved by selecting the
“Nucleotide” database of NCBI.

 A search using the mRNA or gene name in the Nucleotide databases retrieves
many records.

 The Nucleotide database can be searched in different ways to focus the search
more narrowly, such as by utilizing the accession or GI number or even using
the names of the authors of a submission.

 Currently, the GenBank nucleotide record provides a link to graphics of the


sequence.
Details about gene symbols, full name and functions
Position of the gene on human chromosome
 The cursor can be held on one track at a time so that the information

about that track appears in the drop-down box (Figure on the Next Slide).

 In the graphics, the green boxes represent exons in RefSeq mRNA

sequence while the green lines with arrows show the introns.

 In GenBank, human insulin gene (1431 bp) was found to be consisted of

5 exons and 2 introns.


Exons
Gene and Protein Sequences for Human Insulin Pre-protein
Retrieval of Protein Sequence
UniProt
 The Universal Protein Resource (UniProt) provides a stable, comprehensive,
freely accessible, central resource on protein sequences and functional
annotation.

 The UniProt Consortium is a collaboration between the European


Bioinformatics Institute (EBI), the Protein Information Resource (PIR) and
the Swiss Institute of Bioinformatics (SIB).

 UniProt is updated and distributed every three weeks, and can be accessed
online for searches or download at http://www.uniprot.org.
 It has four components optimized for different uses as:
 The UniProt Knowledgebase (UniProtKB) is an expertly curated database, a
central access point for integrated protein information with cross-references to
multiple sources. UniProtKB comprises two sections:
UniProtKB/Swiss-Prot which is manually annotated and is reviewed and
UniProtKB/TrEMBL which is automatically annotated and is not reviewed.
 The UniProt Archive (UniParc) is a comprehensive sequence repository,
reflecting the history of all protein sequences.
 UniProt Reference Clusters (UniRef) merge closely related sequences based
on sequence identity to speed up searches.
 The UniProt Metagenomic and Environmental Sequences (UniMES)
database is a repository specifically developed for the newly expanding area of
metagenomic and environmental data.
1. Accessing the UniProt Search
• Go to the UniProt website (https://www.uniprot.org/)
• Locate the search bar at the top of the page.
• Select "UniProtKB" from the dropdown menu to the left of the search box.
2. Performing a Search
By Accession ID:
Enter the UniProt accession ID (e.g., P05067) directly into the search box.
By Sequence:
You can search for a protein sequence using a peptide sequence or a portion of the sequence.
By Keywords:
Use keywords related to the protein, organism, or function to find relevant entries.
Advanced Search:
Use the advanced search options to filter your results based on various criteria, such as sequence length,
organism, or features.
3. Retrieving Sequences
• Once you have performed your search and have a list of entries, click the "Download" button on the
query result page.
• Select the desired download format (e.g., Flat Text, XML, RDF/XML, tab-delimited, Excel, or
FASTA).
Possible Download Formats for Data
Gene Annotation
Tools accessed through UniProt
Gene and Protein Sequences for Human Insulin Pre-protein
Retrieval of Protein Structure
Protein Data Bank
 The Protein Data Bank (https://www.rcsb.org/) is the single worldwide archive
of structural data of biological macromolecules.
 The Protein Data Bank (PDB) was established at Brookhaven National
Laboratories (BNL) in 1971 as an archive for biological macromolecular
crystal structures.
 Today depositors to the PDB have varying expertise in the techniques of X-ray
crystal structure determination, NMR, cryoelectron microscopy and theoretical
modeling.
 In October 1998, the management of the PDB became the responsibility of the
Research Collaboratory for Structural Bioinformatics (RCSB).
1. Access the RCSB PDB Website
 Go to the RCSB PDB website (https://www.rcsb.org/).
2. Search for the Protein Structure
 By PDB ID: If you know the PDB ID (e.g., 1HE8), type it into the search bar and click
"Search".
 By Protein Name: You can also search by protein name or other keywords.
 Advanced Search: If you have specific criteria (e.g., ligand, sequence, author), use the
advanced search options.
3. Locate the Structure Summary Page
 Once you find the protein, click on the structure entry to go to its summary page.
4. Download the PDB File
 On the structure summary page, find the "Download Files" option.
 Select "PDB format" to download the structure file.
 Save the file to your computer.
5. Further Exploration
 The RCSB PDB website offers various tools for visualizing, analyzing, and exploring protein
structures.
PDB Home Page
PDB Results for Human Insulin
Structure Summary
Structure of Human Insulin Chain A + B
Download Options
Human Insulin Chain A (red) and
Chain B (Green)
Protein Annotation
Experimental Details About Insulin
Sequences of Human Insulin Chain A + B
Details of Genome Mapping
Thanks

View publication stats

You might also like