Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
73 views

Lecture 5 Information Retrieval From Databases

The document discusses various types of biological databases and information retrieval systems. It describes primary, secondary, and derived databases as well as different retrieval systems like Entrez and SRS. It also covers topics like sequence formats, conversion between formats, and addressing issues like redundancy and errors in databases.

Uploaded by

Veer khade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Lecture 5 Information Retrieval From Databases

The document discusses various types of biological databases and information retrieval systems. It describes primary, secondary, and derived databases as well as different retrieval systems like Entrez and SRS. It also covers topics like sequence formats, conversion between formats, and addressing issues like redundancy and errors in databases.

Uploaded by

Veer khade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Information

retrieval from
biological
databases &
Sequence formats
by
Dr. Aditya Kumar Padhi, Ph.D.

Laboratory for Computational


Biology & Biomolecular Design
Lecture-5 (LCBD),
School of Biochemical
Engineering, IIT (BHU)
Contents
• Pitfalls of biological databases

• Information retrieval systems for biological data

• Major types of retrieval systems

• Entrez

• Sequence retrieval system (SRS)

• Alternative sequence formats

• FASTA

• Conversion of sequence formats

• Conclusion
2
Types of biological databases

Primary databases Secondary databases Derived databases

Nucleotide Protein Protein Domain


sequence sequence structure and motif
database database database database
1. NCBI- 1. Swissprot 1. PDB 1. Prosite
GenBank 2. PIR 2. EBI-MSD 2. Blocks
2. DDBJ 3. GenePept 3. MMDB 3. COG
3. EMBL

Structure Gene expression Metabolic pathway Specialized


database database database database
1. GEO 1. KEGG 1. TGI
1. SCOPe
2. GXD 2. PathDB 2. GSOB
2. CATH
3. MGED 3. EMP 3. GPCRD
Pitfalls of biological databases
1) Overreliance

• One major drawback associated with biological databases is an overdependence


on sequence information and related annotations, without understanding the
reliability of the information.

• There are many errors in sequence databases (mostly due to a lack of good
quality sequencing techniques in earlier times).

• These types of errors can be passed on to other databases, causing the


propagation of errors.
Pitfalls of biological databases
2) Redundancy

• The second important major drawback is high levels of redundancy in the primary sequence
databases.

• These errors can also be passed on to other databases, causing the propagation of errors.

• Annotations of genes can also occasionally be false or incomplete.

• Some of these errors cause frameshifts that make whole gene identification difficult or protein
translation impossible.

• There is tremendous duplication of information in the databases, for various reasons.

• Causes include:

(1) repeated submission of identical or overlapping sequences by the same or different authors,
(2) revision of annotations,
(3) poor database management that fails to detect redundancy.
Steps taken to eliminate
redundancy
v By NCBI

• The NCBI has created a non-redundant database, called RefSeq, in which identical sequences
from the same organism and associated sequence fragments are merged into a single entry.

• Protein sequences derived from the same DNA sequences are explicitly linked as related entries.

• Sequence variants from the same organism with very minor differences, which may well be caused
by sequencing errors, are treated as distinctly related entries.

• This carefully curated database can be considered a secondary database.

v By SWISS-PROT

• The SWISS-PROT database also has minimal redundancy for protein sequences compared to most
other databases.

• If conflicts exist between various sequencing reports, they are indicated in the feature table of the
corresponding SWISS-PROT entry.
Steps taken to eliminate erroneous
annotations
• Often, the same gene sequence is found under different names resulting in multiple entries.

• Conversely, unrelated genes bearing the same name are found in the databases.

• To alleviate the problem of naming genes, reannotation of genes and proteins using a set of common,
controlled vocabulary to describe a gene or protein was developed.

• The goal is to provide a consistent and unambiguous naming system for all genes and proteins. A
prominent example of such a system is Gene Ontology.
Steps taken to eliminate erroneous
annotations

• Errors in an annotation can be particularly damaging because the large majority of new sequences are
assigned functions based on similarity with sequences in the databases that are already annotated.

• Some of these errors can be corrected at the informatics level by studying the protein domains and
families. However, others eventually have to be corrected using experimental work.
Information retrieval from databases
• Most popular retrieval systems for biological databases are

(1) Entrez

(2) Sequence Retrieval Systems (SRS)

• These provide access to multiple databases for the retrieval of integrated search results.

• Complex queries in a database often require the use of Boolean operators.

• Join a series of keywords using logical terms such as AND, OR, and NOT to indicate relationships.

o AND means that the search result must contain both words;

o OR means to search for results containing either word or both;

o NOT excludes results containing either one of the words.

• Most search engines of public biological databases use some form of this Boolean logic.
Entrez
• The NCBI developed and maintains Entrez, a biological database retrieval system.

• A gateway that allows text-based searches for a wide variety of data, including annotated
genetic sequence information, structural information, as well as literature, etc.

• The key feature of Entrez is its ability to integrate information, which comes from cross-
referencing between NCBI databases based on preexisting and logical relationships between
individual entries.

• Users do not have to visit multiple databases located in disparate places.

• For example,
o on a nucleotide sequence page, one can find cross-referencing links to the

o translated protein sequence,


o genome mapping data,
o the related PubMed literature information, and
o protein structures if available.
Entrez
Entrez
Entrez
• Entrez search engine has 4 main options/features to help narrow the search.

1) Limits: This helps to restrict the search to a subset or to a particular database. (e.g., the
field for author or publication date) or a particular type of data (e.g., chloroplast
DNA/RNA).

2) Preview/Index: This connects different searches with the Boolean operators and uses a
string of logically connected keywords to perform a new search. The search can also be
limited to a particular search field (e.g., gene name or accession number).

3) History: This provides a record of the previous searches so that the user can review,
revise, or combine the results of earlier searches.

4) Clipboard: This stores search results for later viewing for a limited time. To store
information in the Clipboard, the “Send to Clipboard” function should be used.
PubMed
• One of the databases accessible from Entrez is
a biomedical literature database known as
PubMed.

• This contains abstracts and in some cases full-


text articles from > 5,000 journals.

• For a complex search, a user can use the


Boolean operators or a combination of Limits
and Preview/Index features to conduct complex
searches.

• PubMed uses a list of tags for literature


searches. The search terms can be specified by
the tags which are joined by Boolean operators.
PubMed
Alternate sequence formats
FASTA (Fast alignment sequence test for application):

• In addition to the GenBank format, there are many other sequence formats.

• FASTA is one of the simplest and the most popular sequence formats because it contains plain
sequence information that is readable by many bioinformatics analysis programs.

• It has a single definition line that begins with a right-angle bracket (>) followed by a sequence name.
Sometimes, extra information such as protein or nucleotide ID or comments can be given separated
by a “|” symbol.

>1B1I_1|Chain A|HYDROLASE ANGIOGENIN|Homo sapiens (9606)


QDNSRYTHFLTQHYDAKPQGRDDRYCESIMRRRGLTSPCKDINTFIHGNKRSIKAICENKNG
NPHRENLRISKSSFQVTTCKLHGGSPWPPCQYRATAGFRNVVVACENGLPVHLDQSIFRRP

• The extra information is considered optional and is ignored by sequence analysis programs.

• The plain sequence in standard 1-letter symbols starts in the second line. Each line of sequence data
is limited to 60-80 characters.
Alternate sequence formats
Abstract Syntax Notation One (ASN.1):

• ASN.1 is a data mark-up language with a structure


specifically designed for accessing relational
databases.

• It describes sequences with each item of


information in a sequence record separated by tags
so that each sub-portion of the sequence record
can be easily added to relational tables and later
extracted.

• Though more difficult for people to read, this format


makes it easy for computers to filter and parse the
data.

• This format also facilitates the transmission and


integration of data between databases.
GenBank

NCBI GenBank/GenPept format showing the three major


components of a sequence file.
Conversion of sequence formats

• In sequence analysis and


phylogenetic analysis, there is a
frequent need to convert between
sequence formats.

• One popular computer program


for sequence format conversion
is “EMBOSS Seqret”.

• It recognizes sequences in
almost any format and writes a
new file in an alternative format.
Sequence retrieval system

• SRS is a retrieval system maintained by the EBI, which is comparable to


NCBI Entrez.

• It is not as integrated as Entrez but allows the user to query multiple


databases simultaneously.

• It also offers direct access to certain sequence analysis applications such as


sequence similarity search and Clustal sequence alignment.

• The search results contain the query sequence and sequence annotation as
well as links to literature, metabolic pathways, and other biological
databases.
Conclusion
• Various solutions to correct annotation and reduce redundancy, for example, merging
redundant sequences into a single entry or storing highly redundant sequences in a
separate database.

• NCBI databases accessible through Entrez are among the most integrated databases.

• Effective information retrieval involves the use of Boolean operators.

• Entrez has additional user-friendly features to help conduct complex searches.

• One can use NCBI-specific field qualifiers to conduct searches.

• To retrieve sequence information from NCBI GenBank, an understanding of the format


of GenBank sequence files is necessary.

• FASTA format in sequences is the most widely used sequence format.


Thank you

You might also like