Msczo 603
Msczo 603
Msczo 603
(MSCZO-603)
DEPARTMENT OF ZOOLOGY
SCHOOL OF SCIENCES
UTTARAKHAND OPEN UNIVERSITY
Phone No. 05946-261122, 261123
Toll free No. 18001804025
Fax No. 05946-264232, E. mail info@uou.ac.in
htpp://uou.ac.in
DR.NEERA KAPOOR DR. A.K.DOBRIYAL
PROFESSOR & HEAD PROFESSOR & HEAD
DEPARTMENT OF ZOOLOGY, DEPARTMENT OF ZOOLOGY
SCHOOL OF SCIENCES BGR CAMPUS PAURI
IGNOU MAIDAN GARHI, NEW DELHI HNB SRINAGAR GARHWAL
PROGRAMME COORDINATOR
DR. PRAVESH KUMAR (ASSOCIATE PROFESSOR)
DEPARTMENT OF ZOOLOGY
SCHOOL OF SCIENCES, UTTARAKHAND OPEN UNIVERSITY
HALDWANI, NAINITAL, UTTARAKHAND
EDITOR
1.1 Introduction
1.2 Scope and applications of bioinformatics
1.3 Primary, secondary and composite database
1.3.1 Nucleotide sequence database
1.3.2 Protein sequence database
1.3.3 Gene expression database and structural database
1.4 Summary
1.5 Terminal questions and answer
1.6 References
1.1 INTRODUCTION
Animal bioinformatics
Plant bioinformatics
9- It is specially used in human genome sequencing where large sets of data are being
handled.
10- Bioinformatics plays a major role in the research and development of the biomedical
field.
11- Bioinformatics uses computational coding for several applications that involve finding
gene and protein functions and sequences, developing evolutionary relationships, and
analyzing the three-dimensional shapes of proteins.
12- Research works based on genetic dieses and microbial disease entirely depend on
e- Metagenomics: The study of genetics from the environment and living beings and
samples.
f- Transcriptomics: It is the study of the complete RNA and DNA transcriptase.
g- Phylogenetics: The study of the relationships between groups of animals and humans.
beings.
i- Systems biology: Mathematical designing and analysis and visualization of large sets of
biodata.
j- Structural analysis: Modeling that determines the effects of physical loads on physical
structures.
k- Molecular modeling: The designing and defining of molecular structures by way of
computational chemistry.
l- Pathway analysis: A software description that defines related proteins in the metabolism
of the body.
Biological Databases- Importance
One of the hallmarks of modern genomic research is the generation of enormous amounts
of raw sequence data.
As the volume of genomic data grows, sophisticated computational methodologies are
required to manage the data deluge.
Thus, the very first challenge in the genomics era is to store and handle the staggering
volume of information through the establishment and use of computer databases.
A biological database is a large, organized body of persistent data, usually associated
with computerized software designed to update, query, and retrieve components of the
data stored within the system.
A simple database might be a single file containing many records, each of which includes
the same set of information.
Databases act as a store house of information.
Databases are used to store and organize data in such a way that information can be
retrieved easily via a variety of search criteria.
It allows knowledge discovery, which refers to the identification of connections between
pieces of information that were not known when the information was first entered. This
facilitates the discovery of new biological insights from raw data.
Secondary databases have become the molecular reference library over the
past decade or so, providing a wealth of information on just about any gene or gene
product that has been investigated by the research community.
It helps to solve cases where many users want to access the same entries of data.
Allows the indexing of data.
It helps to remove redundancy of data.
Example: A few popular databases are GenBank from NCBI (National Center for Biotechnology
Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein
Information Resource.
1- Primary database
a- Primary databases are also called as archival database.
b- They are populated with experimentally derived data such as nucleotide sequence, protein
sequence or macromolecular structure.
c- Experimental results are submitted directly into the database by researchers, and the data
are essentially archival in nature.
d- Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
Examples-
ENA, GenBank and DDBJ (nucleotide sequence)
Array Express Archive and GEO (functional genomics data)
Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)
2- Secondary database-
a- Secondary databases comprise data derived from the results of analyzing primary data.
b- Secondary databases often draw upon information from numerous sources, including
other databases (primary and secondary), controlled vocabularies and the scientific
literature.
c- They are highly curated, often using a complex combination of computational algorithms
and manual analysis and interpretation to derive new knowledge from the public record of
science.
Examples-
InterPro (protein families, motifs and domains)
UniProt Knowledgebase (sequence and functional information on proteins)
Ensembl (variation, function, regulation and more layered onto whole genome sequences)
3. Composite Databases:
1- The data entered in these types of databases are first compared and then filtered based
on desired criteria.
2- The initial data are taken from the primary database, and then they are merged together
data.
Examples
Examples of Composite Databases are as follows.
Japan) is in Japan.
e- All three accept nucleotide sequence submissions and then exchange new and updated
data on a daily basis to achieve optimal synchronization between them.
f- These three databases are primary databases, as they house original sequence data.
g- They collaborate with Sequence Read Archive (SRA), which archives raw reads from
a. GenBank
The GenBank sequence database is open access, annotated collection of all publicly
available nucleotide sequences and their protein translations. This database is produced and
maintained by the National Center for Biotechnology Information (NCBI) as part of
the International Nucleotide Sequence Database Collaboration (INSDC). receive sequences
produced in laboratories throughout the world from more than 100,000 distinct organisms.
GenBank has become an important database for research in biological fields and has grown in
recent years at an exponential rate by doubling roughly every 18 months.
b. EMBL (European Molecular Biology Laboratory)
The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is a
comprehensive collection of primary nucleotide sequences maintained at the European
Bioinformatics Institute (EBI). Data are received from genome sequencing centers, individual
scientists and patent offices.
c. DDBJ (DNA databank of Japan)
It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is
the only nucleotide sequence data bank in Asia. Although DDBJ mainly receives its data from
Japanese researchers, it can accept data from contributors from any other country.
2. Secondary databases of nucleotide sequences
a- Many of the secondary databases are simply sub-collection of sequences culled from one
or the other of the primary databases such as GenBank or EMBL.
b- There is also usually a great deal of value addition in terms of annotation, software,
relational DBMS.
c- A unique characteristic of the PIR-PSD is its classification of protein sequences based on
the superfamily concept.
d- The sequence in PIR-PSD is also classified based on homology domain and sequence
motifs.
e- Homology domains may correspond to evolutionary building blocks, while sequence
motifs represent functional sites or conserved regions.
f- The classification approach allows a more complete understanding of sequence function-
structure relationship.
b. SWISS-PROT
a- The other well known and extensively used protein database is SWISS-PROT. Like the
PIR-PSD, this curated proteins sequence database also provides a high level of
annotation.
b- The data in each entry can be considered separately as core data and annotation.
c- The core data consists of the sequences entered in common single letter amino acid code,
and the related references and bibliography. The taxonomy of the organism from which
the sequence was obtained also forms part of this core information.
d- The annotation contains information on the function or functions of the protein, post-
but also all biologically important molecules, such as nucleic acid fragments, RNA
molecules, large peptides such as antibiotic gramicidin and complexes of protein and
nucleic acids.
c- The database holds data derived from mainly three sources: Structure determined by X-
ray crystallography, NMR experiments, and molecular modeling.
c- The information corresponding to each entry in PROSITE is of the two forms the
patterns and the related descriptive text.
b. PRINTS:
In the PRINTS database, the protein sequence patterns are stored as A
fingerprint is a set of motifs or patterns rather than a single one.
The information contained in the PRINT entry may be divided into three sections. In addition
to entry name, accession number and number of motifs, the first section contains cross-links
to other databases that have more information about the characterized family.
The second section provides a table showing how many of the motifs that make up the
fingerprint occurs in the how many of the sequences in that family.
The last section of the entry contains the actual fingerprints that are stored as multiple aligned
sets of sequences; the alignment is made without gaps. There is, therefore, one set of aligned
sequences for each motif.
c. MHCPep:
MHCPep is a database comprising over 13000 peptide sequences known to bind the Major
Histocompatibility Complex of the immune system.
Each entry in the database contains not only the peptide sequence, which may be 8 to 10
amino acid long but in addition has information on the specific MHC molecules to which it
binds, the experimental method used to assay the peptide, the degree of activity and the
binding affinity observed , the source protein that, when broken down gave rise to this peptide
along with other, the positions along the peptide where it anchors on the MHC molecules and
references and cross-links to other information.
d. Pfam
a- Pfam contains the profiles used using Hidden Markov models.
b- HMMs build the model of the pattern as a series of the match, substitute, insert or delete
states, with scores assigned for alignment to go from one state to another.
c- Each family or pattern defined in the Pfam consists of the four elements. The first is the
annotation, which has the information on the source to make the entry, the method used
and some numbers that serve as figures of merit.
d- The second is the seed alignment that is used to bootstrap the rest of the sequences into
DNA MICROARRAY
RNA Sequencing
RNA sequencing (RNA-Seq) allows for quantitative determination of RNA expression levels.
The method features an advantage over microarrays in that it provides coverage of the entire
genome, including the various single-nucleotide polymorphisms (SNPs). In this method, RNA is
extracted from cells, and the mRNA is isolated. In some cases, the mRNA is fragmented at this
stage. The mRNA is then reverse transcribed into cDNA and then, if necessary, fragmented to
lengths compatible with the sequencing system. Once all the fragments are sequenced, the
transcripts (or reads) are assembled into genes. Although it is possible to assemble the
transcriptome de novo, it is usually more efficient to align the reads to a reference genome or
reference transcripts. As RNA-Seq is quantitative, a direct comparison between experiments can
be made.
IN SITU HYBRIDIZATION
In situ hybridization (ISH) provides high-resolution gene expression information within the
context of their natural location within an organ or organism. ISH uses a labeled cDNA fragment
(i.e., probe) to locate a specific DNA segment in a portion or section of a tissue (in situ). The
basic steps in ISH include cell permeabilization, hybridization of the labeled probe, and detection
of the probe, thereby revealing the location of the mRNA of interest. This process can be adapted
to a large scale system and the results are often shown in databases such as MGI, Gensat etc.
i. Before deciding to synthesize a new compound the database could be used to check how
many compounds with a particular chemical composition have been reported.
ii. After synthesizing and indexing the unit cell of a material the database can be searched to
see if a material with the same or a similar unit cell is already known.
iii. If a material is found in the database with a similar unit cell to the new material then its
structure may be close enough (i.e. same symmetry and similar unit cell contents) to be
used as the starting model for the Rietveld refinement of the new material.
iv. To verify the results of a structure refinement the database can be consulted to find
structures that have comparable bond distances, bond angles or coordination
environments to the new structure.
The structures in the databases have been solved using X-ray, neutron and electron diffraction
techniques on samples that are generally single crystals, but with the advances in structural
solution using powder diffraction data, may be powders. There are some entries whose structures
are predicted from computational modeling and some determined using NMR spectroscopy,
these entries generally occur for protein samples.
1.4 SUMMARY
1.7 REFERENCES
Xiong J. (2006). Essential Bioinformatics. Texas A & M University. Cambridge
University Press.
Arthur M Lesk (2014). Introduction to bioinformatics. Oxford University Press. Oxford,
United Kingdom.
https://www.ebi.ac.uk/training/online/course/bioinformatics-terrified-2018/primary-and-
secondary-databases
https://www.omicsonline.org/scholarly/bioinformatics-databases-journals-articles-ppts-
list.php
https://www.ncbi.nlm.nih.gov/books/NBK44933/
https://sta.uwi.edu/fst/dms/icgeb/documents/1910NucleotideandProteinsequencedatabase
sDGL3.pdfphys.1
https://www.nature.com/subjects/protein-databases
UNIT 2: DATABASE AND SEARCH TOOL
CONTENTS
2.1 Objectives
2.2 Introduction
2.3 Computational tools and biological databases
2.3.1 National Centre for Biotechnology Information (NCBI)
2.3.2 European Bioinformatics Institute (EBI)
2.3.3 EMBL Nucleotide Sequence Database
2.3.4 DNA Data Bank of Japan (DDBJ)
2.3.5 Swiss-Prot
2.4 Summary
2.5 Terminal Questions and Answers
2.1 OBJECTIVES
After studying this module, you shall be able to:
Determine orthologs and paralogs for a protein of interest, assign putative function.
A new bacterial genome is sequenced, how many genes have related genes in other species.
Determine if a genome contains specific types of proteins.
Determine the identity of a DNA or protein sequence.
What is the identity of a clinical pathogen?
Determine if particular variant has been described before.
Many pathogens, especially viruses, mutate rapidly. We should like to know if we have a
new strain.
2.2 INTRODUCTION
Bioinformatics is an interdisciplinary field that develops methods and software tools for
understanding biological data. As an interdisciplinary field of science, bioinformatics combines
computer science, statistics, mathematics, and engineering to analyze and interpret biological
data. Bioinformatics has been used for in silico analyses of biological queries using mathematical
and statistical techniques. Bioinformatics derives knowledge from computer analysis of
biological data. These can consist of the information stored in the genetic code, but also
experimental results from various sources, patient statistics, and scientific literature. Research in
bioinformatics includes method development for storage, retrieval, and analysis of the data.
Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary, using
techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry,
physics, and linguistics. It has many practical applications in different areas of biology and
medicine.
The National Center for Biotechnology Information (NCBI) is part of the United States National
Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is
located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by
Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an
important resource for bioinformatics tools and services. Major databases include Gene Bank for
DNA sequences and PubMed, a bibliographic database for the biomedical literature. Other
databases include the NCBI Epigenomics database. All these databases are available online
through the Entrez search engine. NCBI was directed by David Lipman, one of the original
authors of the BLAST sequence alignment program and a widely respected figure in
bioinformatics. He also led an intramural research program, including groups led by Stephen
Altschul (another BLAST co-author), David Landsman, Eugene Koonin, John Wilbur, Teresa
Przytycka, and Zhiyong Lu. David Lipman stood down from his post in May 2017.
Gene Bank
NCBI has had responsibility for making available the GenBank DNA sequence database since
1992.Gene Bank coordinates with individual laboratories and other sequence databases such as
those of the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan
(DDBJ).
Since 1992, NCBI has grown to provide other databases in addition to Gene Bank. NCBI
provides Gene, Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D
protein structures), dbSNP (a database of single-nucleotide polymorphisms), the Reference
Sequence Collection, a map of the human genome, and a taxonomy browser, and coordinates
with the National Cancer Institute to provide the Cancer Genome Anatomy Project. The NCBI
assigns a unique identifier (taxonomy ID number) to each species of organism. The NCBI has
software tools that are available by WWW browsing or by FTP. For example, BLAST is a
sequence similarity searching program. BLAST can do sequence comparisons against the Gene
Bank DNA database in less than 15 seconds.
NCBI Bookshelf
BLAST is an algorithm used for calculating sequence similarity between biological sequences
such as nucleotide sequences of DNA and amino acid sequences of proteins. BLAST is a
powerful tool for finding sequences similar to the query sequence within the same organism or in
different organisms. It searches the query sequence on NCBI databases and servers and posts the
results back to the person's browser in chosen format. Input sequences to the BLAST are mostly
in FASTA or Gene bank format while output could be delivered in variety of formats such as
HTML, XML formatting and plain text. HTML is the default output format for NCBI's Web-
page. Results for NCBI-BLAST are presented in graphical format with all the hits found, a table
with sequence identifiers for the hits having scoring related data, along with the alignments for
the sequence of interest and the hits received with analogous BLAST scores for these
ENTREZ
The Entrez Global Query Cross-Database Search System is used at NCBI for all the major
databases such as Nucleotide and Protein Sequences, Protein Structures, PubMed, Taxonomy,
Complete Genomes, OMIM, and several others. Entrez is both indexing and retrieval system
having data from various sources for biomedical research. NCBI distributed the first version of
Entrez in 1991, composed of nucleotide sequences from PDB and Gene Bank, protein sequences
from SWISS-PROT, translated Gene Bank, PIR, and PRF, PDB and associated abstracts and
citations from PubMed. Entrez is specially designed to integrate the data from several different
sources, databases and formats into a uniform information model and retrieval system which can
efficiently retrieve that relevant references, sequences and structures.
GENE
Gene has been implemented at NCBI to characterize and organize the information about genes. It
serves as a major node in the nexus of genomic map, expression, sequence, protein function,
structure and homology data. A unique Gene ID is assigned to each gene record that can be
followed through revision cycles. Gene records for known or predicted genes are established
here and are demarcated by map positions or nucleotide sequence. Gene has several advantages
over its predecessor, Locus Link, including, better integration with other databases in NCBI,
broader taxonomic scope, and enhanced options for query and retrieval provided by Entrez
system.
PROTEIN
Protein database maintains the text record for individual protein sequences, derived from many
different resources such as NCBI Reference Sequence (RefSeq) project, Gene Bank, PDB and
UniProtKB/SWISS-Prot. Protein records are present in different formats including FASTA and
XML and are linked to other NCBI resources. Protein provides the relevant data to the users
Such as genes, DNA/RNA sequences, biological pathways, expression and variation data and
literature. It also provides the pre-determined sets of similar and identical proteins for each
sequence as computed by the BLAST. The Structure database of NCBI contains 3D coordinate
sets for experimentally-determined structures in PDB that are imported by NCBI. The Conserved
Domain database (CDD) of protein contains sequence profiles that characterize highly conserved
domains within protein sequences. It also has records from external resources like SMART and
Pfam. There is another database in protein known as Protein Clusters database which contains
sets of proteins sequences that are clustered according to the maximum alignments between the
individual sequences as calculated by BLAST.
Pubchem database
PubChem database of NCBI is a public resource for molecules and their activities against
biological assays. PubChem is searchable and accessible by Entrez information retrieval system.
Additionally, the EMBL-EBI hosts training programs that teach scientists the fundamentals of
the work with biological data and promote the plethora
their research, both EMBL-EBI and non-EMBL-EBI-based.
BIOINFORMATIC SERVICES
One of the roles of the EMBL-EBI is to index and maintain biological data in a set of databases,
including Ensembl (housing whole genome sequence data), UniProt (protein sequence and
annotation database) and Protein Data Bank (protein and nucleic acid tertiary structure database).
A variety of online services and tools is provided, such as Basic Local Alignment Search Tool
(BLAST) or Clustal Omega sequence alignment tool, enabling further data analysis.
BLAST
BLAST is an algorithm for the comparison of bio macromolecule primary structure, most often
nucleotide sequence of DNA/RNA and amino acid sequence of proteins, stored in the
bioinformatics
available sequences against the query by a scoring matrix such as BLOSUM 62. The highest
scoring sequences represent the closest relatives of the query, in terms of functional and
evolutionary similarity.
The database search by BLAST requires input data to be in a correct format (e.g. FASTA,
GenBank, PIR or EMBL format). Users may also designate the specific databases to be searched,
select scoring matrices to be used and other parameters prior to the tool run. The best hits in the
BLAST results are ordered according to their calculated E value (the probability of the presence
of a similarly or higher-scoring hit in the database by chance).
CLUSTAL OMEGA
Clustal Omega is a multiple sequence alignment (MSA) tool that enables to find an optimal
alignment of at least three and maximum of 4000 input DNA and protein sequences. Clustal
Omega algorithm employs two profile Hidden Markov models (HMMs) to derive the final
alignment of the sequences. The output of the Clustal Omega may be visualized in a guide tree
(the phylogenetic relationship of the best-pairing sequences) or ordered by the mutual sequence
similarity between the queries. The main advantage of Clustal Omega over other MSA tools
(Muscle, ProbCons) is its efficiency, while maintaining a significant accuracy of the results.
ENSEMBL
Based at the EMBL-EBI, the Ensembl is a database organized around genomic data, maintained
by the Ensembl Project. Tasked with the continuous annotation of the genomes of model
organisms, Ensembl provides researchers a comprehensive resource of relevant biological
information about each specific genome. The annotation of the stored reference genomes is
automatic and sequence-based. Ensembl encompasses a publicly available genome database
which can be accessed via a web browser. The stored data can be interacted with using a
graphical UI, which supports the display of data in multiple resolution levels from karyotype,
through individual genes, to nucleotide sequence.
Originally centered on vertebrate animals as its main field of interest, since 2009 Ensembl
provides annotated data regarding the genomes of plants, fungi, invertebrates, bacteria and other
species, in the sister project Ensembl Genomes. As of 2020, the various Ensembl project
databases together house over 50,000 reference genomes.
PDB
UniProt
UniProt is an online repository of protein sequence and annotation data, distributed in UniProt
Knowledgebase (UniProt KB), UniProt Reference Clusters (UniRef) and UniProt Archive
(UniParc) databases. Originally conceived as the individual ventures of EMBL-EBI, Swiss
Institute of Bioinformatics (SIB) (together maintaining Swiss-Prot and TrEMBL) and Protein
Information Resource (PIR) (housing Protein Sequence Database), the increase in the global
protein data generation led to their collaboration in the creation of UniProt in 2002.
The protein entries stored in UniProt are cataloged by a unique UniProt identifier. The
annotation data collected for the each entry are organized in logical sections (e.g. protein
function, structure, expression, sequence or relevant publications), allowing a coordinated
overview about the protein of interest. Links to external databases and original sources of data
are also provided. In addition to standard search by the protein name/identifier, UniProt webpage
houses tools for BLAST searching, sequence alignment or searching for proteins containing
specific peptides.
The main missions of the Service Programme of the EBI centre on building, maintaining and
providing biological databases and information services to support data deposition and
exploitation. In this respect a number of databases are operated, namely the EMBL Nucleotide
Sequence Database (EMBL-Bank), the Protein Databases (SWISS-PROT and TrEMBL), the
Macromolecular Structure Database (MSD) and Array Express for gene expression data plus
several other databases many of which are produced in collaboration with external groups.
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is the European
member of the tri-partide International Nucleotide Sequence Database Collaboration
DDBJ/EMBL/GenBank. Main data sources are large-scale genome sequencing centers,
individual scientists and the European Patent Office (EPO). Direct submissions to EMBL-Bank
are complemented by daily data exchange with collaborating databases DDBJ (Japan) and
GenBank (USA).
The EMBL database is growing rapidly as a result of major genome sequencing efforts. Within a
12 month period the database size has increased from about 6.7 million entries comprising 8255
million nucleotides (Release 63, June 2000) to over 12 million entries and 12 820 million
nucleotides (Release 67, June 2001). During the same period the number of organisms
represented in the database has risen by >30% to over 75,000 species.
Databases at EBI
Nucleotide databases
a. European Nucleotide Archive (ENA): ENA receives nucleotide data from a variety of
sources, including small scale sequencing studies, sequencing centers and the INSDC
(i.e.Genbank and DDBJ). In order to better manage the sequencing resources, ENA has been
divided in several sub-databases such as
C.EGA: The European Genome Phenome Archive (EGA) stores data from studies that are
carried out with an objective to understand the linkages between genotype and phenotype,
especially from biomedical research. This database is analogous to the dbGaPdatabase at NCBI.
Such data may have been generated from Genome wide association studies (GWAS). As the
studies and datasets generally deals with disorders such as cancer, coronary artery defects,
hypertension, Rheumatoid arthritis and diabetes, strict control during submission and public
access is implemented on ethical grounds (as it contains information about patients and subjects
taking part in the study) to prevent misuse or data.
D. ENA- Genome: This database contains the completed genome sequence data from a
variety of organisms such as:
Archaea and archeal virus
Bacteria
Eukaryotes
Organelles
Phages
Plasmid
Viroids
EMBL-EBI developed the ENSEMBL genomes tool to browse, analyses and visualize the
genome sequencing data. Currently, there are close to 350 completed genome sequences
available for browsing, analysis and downloading. The sequence analysis tools at ENSEMBL
genome server provides tools for analysis at all levels of genome organization, such as whole
genome, chromosome, genome segment, gene and transcript level. The genome visualization and
analysis tool at ENSEMBL genome also provides links to molecular function, gene ontology,
protein summary and structure tables.
e. Several other databases such as Immuno Polymorphism database (IPD) (such as IMGT/HLA,
IMGT/LIGM, IPD-MHC, IPD-KIR e etc), Meta genomics and Patent data resources are also part
of the nucleotide resources at EBI-EMBL. IMGT/HLA database is the nucleotide sequence
database for human major histo-compatibility complex HLA. This database is a part of the
International Immuno Genetics Project (IMGT) and the data has been subdivided into the
following five classes of alleles of HLA (http://www.ebi.ac.uk/ipd/imgt/hla/stats.html):
IPD-HPA is the database for human platelet antigens IPD-KIR is the database for human Killer
cell Immunoglobulin like Receptors and contains information about 614 EBI-Metagenomic
contains sequence information from microflora samples that have been collected from various
environments. Some such examples include core gut microflora, aquatic microflora from
Antarctica, glaciers, ocean samples, meat samples and so on. The metagenome sequences are
analyzed to reveal the frequency of predicted CDS (coding DNA sequence), their GO (genome
Ontology) annotation, putative proteins with biochemical, cellular and molecular functions.
History
DDBJ was established in the year 1986 at the National Institute of Genetics (NIG), Japan with
support from the Japanese Ministry of Education, Culture, Sports, Science and Technology
(MEXT). Later on for its efficient functioning, Center for Information Biology (CIB) was
established at NIG in 1995. In 2004, NIG was made a member of Research Organization of
Information and Systems.
Roles of DDBJ
As a member of INSDC, primary objective of DDBJ is to collect sequence data from researchers
all over the world and to issue a unique accession number for each entry. The data collected from
the submitters is made publically available and anyone can access the data through data retrieval
tools available at DDBJ. Everyday data submitted at either DDBJ or EMBL or NCBI is
exchanged, therefore at any given time these three databases contain same data.
Activities of DDBJ
Collection of sequences
The sequences collected from the submitters are stored in the form of an entry in the database.
Each entry consists of a nucleotide sequence, author information, reference, organism from
which the sequence is determined, properties of the sequence etc.
Retrieval of data is as important as submission and one of the main objectives of any database is
to provide the users with the required information. Any database contains enormous amount of
information and retrieving the required information is also a tricky task which depends on right
use of search strings. DDBJ hosts a number of tools for data retrieval like get entry (database
retrieval by unique identifiers) and All-round Retrieval of Sequence and Annotation (ARSA).
Unique identifiers required for retrieval through get entry can be accession number, gene name
etc. Following are the steps along with snapshots showing data retrieval from DDBJ using get
entry.
(http://www.ddbj.nig.ac.jp/searches-e.html)
4. Type in the accession number in the search box and click on search.
2.3.5 SWISS-PROT
Biological database can be defined as biological information stored in an electronic format and
can be easily accessed throughout the world. These databases can be classified into various
categories depending upon data type, data source, maintainer status etc. A variety of databases
contain nucleotide and/or protein sequences data that are pertinent to a specific gene. Protein
databases are specific to protein sequences. There are three important publicly accessible protein
databases: Protein Information Resource (PIR), Swiss-Prot and Protein Data Bank (PDB).
Whereas PIR and Swiss-Prot contain protein sequences, PDB is a structural database of
biomolecules.PIR is considered as a primary database whereas Swiss-Prot falls into secondary
database category. The aim of this chapter is to explain Swiss-Prot database and strategies to
retrieve information from this database. Some of the tools and databases that are linked to each
entry will also be discussed briefly.
HISTORY
Swiss-Prot is an annotated protein sequence database which was formulated and managed by
Amos Bairoch in 1986. It was established collaboratively by the Department of Medical
Biochemistry at the University of Geneva and European Molecular Biology Laboratory (EMBL).
Later it shifted to European Bioinformatics Institute (EBI) in 1994 and finally in April 1998, it
became a part of Swiss Institute of Bioinformatics (SIB) (Bairoch and Apweiler, 1998). In 1996,
TrEMBL was added as an automatically annotated supplement to Swiss-Prot database (Bairoch
and Apweiler, 1996). Since 2002, it is maintained by the UniProt consortium and information
about a protein sequence can be accessed via the UniProt website (http://www.uniprot.org/)
(Apweiler et al., 2004). The Universal Protein Resource (UniProt) is the most widespread protein
sequence catalog comprising of EBI, SIB and PIR (UniProt Consortium, 2009).
FEATURES
Swiss-Prot database is characterized for its high quality annotation which comes at a price of
lower coverage. It provides information about the function of protein, its domain structure, post
translational modifications (PTM) etc. In other words, it imparts whole information about a
specific protein. Swiss-Prot database is curated to make it non- redundant. Therefore, this
database contains only one entry per protein. As a result, the size of Swiss-Prot is very less as
compared to DNA sequence databases. Figure 1 shows the development of the size of this
database. The high quality annotation and minimum redundancy distinguish Swiss-Prot from
other protein sequence databases.
1. High Quality Annotation: It is achieved through manually creating the protein sequence
entries. It is processed through 6 stages:
a. Sequence curation: In this step, identical sequences are extracted through blast search and
then the sequences from the related gene and same organism are incorporated into a single entry.
It makes sure that the sequence is complete, correct and ready for further curation steps.
d. Family based curation: Putative homologs are determined by Reciprocal Blast searches and
phylogenetic resources which are further evaluated, curated, annotated and propagated across
homologous proteins to ensure data consistency.
e. Evidence attribution: All information incorporated to the sequence entry during manual
annotation is linked to the original source so that users can trace back the origin of data and
evaluate it.
f. Quality assurance, integration and update: Each completely annotated entry undergoes
quality assurance before integration into Swiss-Prot and is updated as new data become
available.
2. Minimum redundancy: During manual annotation, all entries belonging to identical gene
and from similar organism are merged into a single entry containing complete information. This
results in minimal redundancy.
3. Integration with other Databases: Swiss-Prot is presently cross- referenced to more than 50
specialized databases. This extensive interlinking allows Swiss Prot to play a major role as a
connecting link between various biological databases.
4. Documentation: Swiss-Prot Database contains a large number of index files and specialized
document files.
2.4 SUMMARY
Databases are a source of vast amount of information generated from various sequencing
projects. There are numerous kinds of databases available on web, but for protein sequence
analysis, PIR, Swiss-Prot and PDB are the most relevant.
2.5 REFERENCES
Apweiler R, Bairoch A, Wu CH (2004) "Protein sequence databases". Current Opinion in
Chemical Biology 8(1): 76 80.
Bairoch A, Apweiler R (1996) "The SWISS-PROT protein sequence data bank and its
new
Supplement TREMBL". Nucleic Acids Research 24 (1): 21 25.
Bairoch A, Apweiler R (1998) The SWISS-PROT protein sequence data bank and its
Supplement TrEMBL in 1998. Nucleic Acids Research 26(1): 38-42.
UniProt Consortium (2009) The Universal Protein resource (UniProt). Nucleic Acids
Research 37: D169-D174.
"Background | European Bioinformatics Institute". Ebi.ac.uk. 16 May 2018. Retrieved 29
October 2019.
"Jobs at EMBL-EBI". Retrieved 20 June 2016.
"Scientific report" (PDF). www.embl.de. 2017. Retrieved 29 October 2019.
BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences. (2018).
Annual Report, p. 6. Retrieved 26 March 2020.
"NCBI BLAST at EMBL-EBI". www.ebi.ac.uk. Retrieved 3 November 2021.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (October 1990). "Basic local
alignment search tool". Journal of Molecular Biology. 215 (3): 403 410. Doi:
10.1016/S0022-2836(05)80360-2. PMID 2231712.
Wheeler D, Bhagwat M (2007). BLAST QuickStart. Humana Press. PMID 17993672.
"Clustal Omega at EMBL-EBI". www.ebi.ac.uk. Retrieved 3 November 2021.
"Clustal Omega Documentation at EMBL-EBI". www.ebi.ac.uk. Retrieved 3 November
2021.
Sievers F, Higgins DG (January 2018). "Clustal Omega for making accurate alignments
of many protein sequences". Protein Science. 27 (1): 135 145. doi:10.1002/pro.3290.
PMC 5734385. PMID 28884485.
"Ensembl homepage". www.ensembl.org. Retrieved 3 November 2021.
Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, et al. (January
2021). "Ensembl 2021". Nucleic Acids Research. 49 (D1): D884 D891.
doi:10.1093/nar/gkaa942. PMC 7778975. PMID 33137190.
"About the Ensembl Project". www.ensembl.org. Retrieved 3 November 2021.
"Protein Data Bank: the single global archive for 3D macromolecular structure data".
Nucleic Acids Research. 47 (D1): D520 D528. January 2019. doi:10.1093/nar/gky949.
PMC 6324056. PMID 30357364.
"About PDBe". www.ebi.ac.uk. Retrieved 3 November 2021.
"About UniProt". www.uniprot.org. Retrieved 3 November 2021.
"UniProt: the universal protein knowledgebase in 2021". Nucleic Acids Research. 49
(D1): D480 D489. January 2021. doi:10.1093/nar/gkaa1100. PMC 7778908. PMID
33237286.
2. What are the major domains under which NCBI databases and tools are organized?
6. Which database stores genomic structural variation information? What is the comparable?
database at NCBI?
3.1 Objectives
3.2 Introduction
3.6 Summary
Know about using the Sequence Similarity search tools: BLAST and FASTA.
3.2 INTRODUCTION
The course of evolution proceeds in small incremental stages i.e. instead of large scale
disruptions that span entire genomes, evolution favors small variations spread throughout the
genome. Off-course it is difficult to actually define the physical boundaries of what constitutes
majority of the changes are small, it is possible for us to detect similar regions with the genome
through alignment. We also presume that regions that share considerable levels of similarity as
measured through alignment must have shared ancestry or have common evolutionary history.
Such regions are termed as homologous sequences. Homology can be further sub-divided into
orthology and paralogy which are shared evolutionary history either by speciation or through
duplication. A note of caution: Two sequences can also share high similarity without sharing
recent ancestry. Such sequences are termed as xenologs and are generally acquired through
horizontal gene transfer. An alignment attempts to create a matrix of rows and columns where
each row denotes a sequence and each column is occupied by similar characters derived from
each sequences or a gap. Pair wise alignment attempts to align two sequence at-a-time, whereas
multiple sequence alignment (MSA) attempts to align more than two sequences. If there are
several sequences are derived from organisms having a common shared ancestry or evolutionary
history, we expect that these sequences will exhibit similarity but will not be exactly identical i.e.
we expect to find similar characters or residues and also some differences. The differences or
dissimilarities encountered are a result of mutational events; more the time since common
ancestry, more the number or accumulated mutation and therefore more the number of dissimilar
residues. The number of changes is therefore directly proportional to evolutionary time.
Therefore alignment tools will try to generate the matrix such that there are more identical and/or
similar residues. It may be worthwhile to point out in case a mutational event or events lead to
Given the complexity involved because of length and types of changes observed in sequences, it
is impossible to derive alignment manually and we have to rely on various algorithms and
software for an automated alignment process.
Pair wise alignment employs two distinct strategies for alignment or similarity searching; termed
and dissimilar residues over a short block of sequences with maximal identity; whereas global
alignment tries to identify an average identity over the entire length of the sequences; local
alignment algorithm was developed by Temple Smith and Michael Waterman (1981), whereas
Saul Needleman and Christian Wunsch developed the algorithm for global alignment. Sequence
alignments employ matrices to find an optimal alignment and the following section will
introduce you to some of the matrices commonly used.
Analysis of the sequence data is one of the major challenges of computation biology and is the
first step towards understanding molecular basis of development and adaptation. Several types of
analysis can be performed that range from
Functional information
RNA analysis
Expression analysis
Structure
Functional information
Protein level
Domain finding
Structure prediction
Evolution
Function
Genome level
Comparative genomics
Genome annotation
DNA sequence analysis constitutes one of the major applications in bioinformatics. Some of the
basic objectives of performing sequence analyses are
Sequence retrieval
Exploring importance of residues (nucleotides and amino acids) that are important for
structure and function
Central to the process of searching for similar sequences from database and retrieval are concepts
of homology that are derived from evolutionary relationships. DNA data can be used to retrieve
similar sequences that have diverged upto 600 million years ago! Sequences can be retrieved
from NCBI database by using the identity of the sequence in the form of accession number
Genbank formatted files contains detailed annotation and the associated sequence Sequence
similarity search is performed using a suite of tools called Local
Alignment Search Tool. Two distinct types of sequence similarity searches can be performed
Local and Global. Needleman and Wunsch developed the GLOBAL alignment algorithm (1970)
whereas Michael Waterman and Temple Smith co-developed the Smith- Waterman sequence
alignment algorithm for LOCAL alignment (1981). Global alignment attempts to find an
rast
Sequence similarity searches, performed via alignment are a measure of relatedness i.e.
sequences that are evolutionary closely related will align over larger distances; in other words
similarity is a function of evolutionary relatedness. Similarity searches carried out against subject
sequences in the database are based on pairwise alignment, i.e. between two sequences at-a-time.
One of the two sequences is always
retrieved from the database changes. Similarity being a function of evolutionary relationship can
also be extended for employing sequence alignments to evaluate molecular phylogeny via
multiple sequence alignment.
Once we have retrieved a number of sequences using BLAST (see earlier chapters on this),
c. what residues are conserved across all the sequences and thereby may be of
functional importance?
Multiple sequence alignment can be employed to answer these questions. Multiple sequences
with each other in a pairwise manner so as to arrive at an output that attempts to align such that
all the similar or identical residues from the various sequences appear in the same column. Gaps
are introduced to arrive at an optimal multiple alignments.
There are five different methods to perform Multiple Sequence Alignment with some
representative software in parentheses:
a. Exact method
The most common tool for multiple sequence alignment is Clustal that can be either be used as a
web-based service or the software can be downloaded from http://www.clustal.org/. It employs
progressive alignment as to perform a MSA. Clustal first creates a global pairwise alignment for
all sequence pairs with alignment/similarity scores and then starts the MSA with the two
sequences with highest score and progressively adds more and more sequences to complete the
alignment. Along with the MSA, Clustal also
relationship among the sequences analysed.
Clustal can be downloaded from www.clustal.org and installed on any computer. Also prepare a
text file containing FASTA formatted sequences for alignment. Such sequences could have been
identified through BLASTN or BLASTP.
can be viewed in Clustal itself or the alignment file can be viewed using any text editor such as
PAM or Percent Accepted Mutations also sometimes expanded as Point Accepted Mutations
was developed by Margaret Dayhoff and was published in 1978. She and her co-workers
analyzed 71 families belonging to 34 super-families of evolutionarily related proteins by
comparing their sequences over their entire lengths (diverged over various time scales) and
observed 1572 changes. These changes were tabulated or used to create a matrix of 20x20
(number of amino acids). The observation that one amino acid could be substituted by another
amino acid prompted the authors to term these as Point Accepted Mutation to signify that the
substituted amino acid is accepted by natural selection. This is deemed to be an outcome of two
distinct processes:
BLOSUM matrix: One of the drawbacks associated with the PAM matrix was the fact that the
PAM250 matrix was generated by multiplying PAM1 matrix to itself 250 times and thus
although it is meant for long evolutionary scale, it is only an approximation derived by
compounding values obtained over short evolutionary time scale. Additionally, an unintended
major drawback was the fact that in 1978 and earlier, very few protein sequences were available
and therefore, PAM matrices were based on a very small set of proteins of similar nature and
thus may not represent the entire spectrum of amino acid changes or substitution. Steven
Henikoff and Jorga G Henikoff in 1992 published a new and updated amino acid substitution
matrix termed as BLOSUM matrix. The matrix was derived from analysis of BLOCKS database
of protein domains (blocks.fhcrc.org/) which contains ungapped multiple aligned regions of
domains or most conserved segments of proteins. Henikoff and Henikoff (1992) used nearly
2000 ungapped aligned sequences from more than 500 groups of related proteins to devise the
BLOCKS SUBSTITUTION MATRIX or BLOSUM matrix. The BLOSUM matrix was found to
be more accurate and closer to observed changes than PAM matrix because a. The use of large
and evolutionary diverse dataset meant more realistic estimation of substitution probabilities, and
b. Probabilities were based on conserved domains that are under greater selection pressure and
thereby reflecting true estimation of substitution rates. Like PAM matrices, Henikoff and
Henikoff also computed the log-odd ratio of substitution probabilities for varying evolutionary
time scale to generate several matrices such as BLOSUM 45, BLOSUM 50, BLOSUM 62,
BLOSUM 80, and BLOSUM 90. The scheme of numbering in BLOSUM reflects the
thereby more suited for analysis of closely related protein sequences. BLOSUM62 is now default
matrix used by several sequence similarity tools including BLAST.
A comparison of PAM and BLOSUM reveals that PAM is based on global alignment of proteins
whereas BLOSUM is based on local alignment of conserved domains. The numbering in scheme
in PAM and BLOSUM is also opposite, lower number in PAM means less divergence and in
BLOSUM means less conservation; higher numbers in PAM denotes high divergence whereas in
BLOSUM means high level of conservation. PAM matrix is preferred for global alignment of
proteins whereas BLOSUM matrices are preferred for local alignment.
3.6 SUMMARY
The concept of sequence alignment that estimates similarity or relatedness is based on the
fundamental principles of evolution. Before attempting to perform sequence alignment, it is
imperative to understand that mutations occur at the level of DNA with non-coding regions
accumulating mutations at a higher rate than coding regions, and that not all mutation lead to an
alteration in the amino acid sequence. Given this background, the information content of DNA
and proteins are thus variable. Orthologous sequences are likely to share more similarity
compared to paralogs because of their evolutionary history. Sequence similarity and alignment
can be performed either in a pairwise manner or using multiple sequence alignment. The
objective of alignment is to create a matrix with rows and columns; the rows represent the
taxonomic units with an objective of placing similar or identical sequence data in a single
column. While creating the alignments, unitary matrix is used to compute the mutational
substitution rates for DNA whereas PAM and BLOSSUM matrices are employed to compute
Mutational probability Indices in case of proteins. Tools such as BLASTN and BLASTP are
used for pairwise sequence alignment, and CLUSTAL, MUSCLE, MAAFT and Expresso are
employed for multiple sequence alignment. The output generated upon Multiple sequence
alignment can be further viewed either as an alignment file (using any text viewer such as
Wordpad or notepad) and as tree file using TreeView.
a. Accession number
b. Query
c. Subject
6. Which of the following accumulate mutation at higher rate: Non-coding or coding DNA?
REFERENCES
Altschul S.F, Gish W, Miller W, Myers E W and Lipman D J. Basic Local Alignment
Search Tool. J. Mol. Biol. 215, 403-410 (1990)
Bioinformatics and Functional Genomics: 2nd Edition, Jonathon Pevsner (2009),
WileyBlackwell
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. A model of evolutionary change in
proteins. In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 -352
(1978)
Henikoff, S. and Henikoff, J. Amino acid substitution matrices from protein blocks
Proc.Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919 (1992).
Robert C Edgar (2004) MUSCLE: a multiple sequence alignment method with reduced
time and space complexity. BMC Bioinformatics: 5:113 doi:10.1186/1471-2105-5-113
Katoh, Misawa, Kuma, Miyata (2002) MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059-3066)
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H,
Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG.
(2007). Clustal W and Clusta
SEQUENCE ANALYSIS
CONTENTS
4.1 Objectives
4.2 Introduction
4.9 Summary
4.1 OBJECTIVES
4.2 INTRODUCTION
The National Center for Biotechnology Information was established in 1988 at the National
Institute of Health (NIH) as part of the National Library of Medicine (NLM) and is located at
Bethedsa, Maryland, USA. This association of NCBI with NIH and NLM is reflected in its web-
address (www.ncbi.nlm.nih.gov). NCBI was set up to collate information, create databases and
conduct research in the field of molecular biology especially for biomedical data, and develop
computational tools. Since then, the database and computational tools have expanded to include
diverse organisms including plants so as to encompass not only data from biomedical field but
also include agriculture, food and other plant derived resources. NCBI has now emerged as the
primary source of free public-access data encompassing a wide range of disciplines ranging from
literature, sequence information, expression profile data, protein sequence and structure,
chemical structure and bioassays, taxonomy; in addition, NCBI has developed a variety of
analysis tools that are available for free download and use.
d. developing and maintaining collaborations with academia, industry and other governmental
agencies at national and international level through visitors program
e. fostering scientific communication through sponsoring and organizing meetings, workshops
and lectures
The resources at NCBI are categorized into major groups and following are some of the broad
sets of various databases and tools developed, curated and hosted at NCBI:
Submissions:
Genbank: BankIt
Genbank: Barcode
Genbank: Sequin
SNP submission
BioProject Submission
Databases:
DNA and RNA (Refseq, nucleotide, EST, GSS, WGS, PopSet, trace archive, SRA): Proteins
(Reference sequences, GenPept, UniProt/SwissProt, PRF, PDB, Protein clusters, GEO, Structure,
UniGene, CDD): Genomes (Map Viewer, Genome workbench, Plant Genome Central, Genome
GEO-BLAST
Genetic codes
ORF finder
Splign
Others:
Entrez is the single point database search and retrieval system that allows a user to perform the
NCBI relies on submission of accurately annotated and curated data submitted by the research
community. The data can be grouped into two major types - sequence and non-sequence. The
diverse types and categories of data hosted at NCBI require that these are deposited into one of
the many databases in an appropriate format with annotations. The following section will
introduce you to the several forms of biological data and the submission gateways at NCBI.
Submission of sequence data:
The field of computational biology has experienced tremendous / exponential growth on account
of deluge of nucleotide sequence data. This in turn has been helped by the advancements in
automated sequencing capabilities, vastly improved chemistry of sequencing and greatly reduced
cost.
The sequence data generated under a variety of research objectives or goals such as sequencing
genomes), population genetics studies, sequence variation or barcoding projects, and other
i. Genbank
iii. dbSNP
iv. dbVar
v. GEO
Sequence types
Using BankIt: It is a web-based tool that is preferred for submission of single or a small set
of sequences, and has relatively simple features for annotation. The submitter has to register at
NCBI and after filling the requisite details can deposit the sequence/s using the web-based tool.
Using Sequin: Sequin is the preferred standalone option for submission if the sequences to be
submitted
Pre-submission work can be done offline i.e even in the absence of network
The sequence file prepared using Sequin Program allows users to view the sequence and its
associated features in Genbank and Graphical view, in addition to several other formats.
chloroplastic maturase K, tRNA Lysine (trnK), large subunit of RUBISCO (rbcL), and ITS
region from nuclear rDNA. The inherent sequence conservation and variability in such loci aids
researchers in species identification. As per the policy of NCBI, all sequences originating from
Mitochondrial oxidase loci can be submitted through the DNA-barcode submission tool, while
the rest of the sequences need to be submitted through the BankIt tool.
Batch submission: Sequences that have been generated through high-throughput sequencing
projects such as single pass sequencing of cDNAs (EST), genomic survey sequences (GSS) and
genomic mapping projects (STS) are to be submitted through either ftp (file transfer protocol) or
first registered with NCBI for allotment of project ID. Small genomes such as chloroplast,
mitochondria, plasmids, phages and viral do not require registration and can be submitted using
Sequin tool. Large prokaryotic genome sequences have to be formatted as a FASTA file
followed by adding annotation features. Annotation requires that the following information must
i. Genes
In addition to these mandatory annotation features, optional features can also be submitted such
The sequence can typically be prepared for submission through Sequin or using tbl2asn and then
Eukaryotic genomes need FASTA formatted sequence files to be annotated with the following
features mandatorily:
Genes
mRNA features
Transcript ID
The annotation can be prepared using Sequin and submitted using Genome Submission tool.
NCBI also accepts deposition of information for sequence based reagents such a primer pairs,
siRNA, probe-sequences into Probe database. Such information must be accompanied by probe
unique identifier, name and probe type. In addition, optional information on target can also be
provided.
Submission of High throughput sequences derived from transcript survey sequence assemblies
and metagenomic studies are to be deposited to transcriptome shotgun assembly (TSA) archive
clinical studies, chemical substances, structure and bioassay data, manuscripts etc. The Gene
Expression Omnibus (GEO) database is meant for deposition and cataloguing of a variety of
functional genomics and quantitative data generated via high throughput technologies. Some
mRNA sequencing
ChIP-sequencing
Methyl sequencing
GEO accepts microarray data that have been generated using microarray chips manufactured by
Affymetrix, Agilent, Nimblegen, Illumina and also custom made by users. The user needs to
i. Array or sequencer
ii. Array template or array design i.e. the identity of the spots
iv. Protocol
vi. Raw and/or processed data file of intensity values or sequence counts
The experiments must have perfomed following Minimum Information About Microarray
(http://www.ncbi.nlm.nih.gov/geo/info/MIAME.html):
Sample annotation
Experimental deisgn
Array annotation
RT-PCR experiments
Protein Array
also be accepted)
Protocol
Sample information
Fold-change data
dbGaP (database of Genotype and Phenotype) collates data originating from several studies that
have analysed the relation between genotype and phenotype, including Genomewide association
studies (GWAS), medical sequencing, molecular diagnostic studies and also from other genetic
studies that are non-clinical in nature. As some of this data may have confidentiality and ethical
aspects including identity of participants in clinical studies, a data access committee (DAC) and
data use certification (DUC) regarding ethical treatment, biosafety approval, and confidentiality
is required.
As most of the GWAS deal with understanding the relationship between genetic factors and
The submitter must deposit de-identified subject identity and consent for each subject that have
participated in the study, genetic data such as sequence and/or array information, and phenotype
data. Phenotype data may consist of body site, Histological type etc.
PubChem database accepts information about chemical substance, structure and bio-activity and
for ease of usage has been further sub-divided into PcSubstance, PcCompound and PcBioAssay
database. Before submission, a user has to register at PubChem with an option to open a test or a
deposition account. A test account allows users to first validate all steps of submission and
format of submission without actually depositing and releasing the data; a deposition account on
the other hand will allow the user to deposit and release the information into the database.
In order to successfully deposit into PubChem database, the user should provide the following
information:
Biological properties
Chemical reactions
Metabolic pathway
Physical properties
Toxicological information
Original and novel findings that have been peer reviewed by subject experts and accepted for
publication in a research journal can be deposited in the PubMed Central database using NIH
Each data (sequence, literature, microarray, structure, genome sequence, primer etc.) that is
deposited in the NCBI is allotted a unique identifier in the form of an accession number.
What is the relationship between the sequence similarity and structure similarity in
biological proteins?
Proteins with high sequence identity and high structural similarity tend to possess functional
similarity and evolutionary relationships, yet examples of proteins deviating from this general
DNA sequence provides the code for the amino acid sequence. The amino acid sequence
determines the structure of the protein, which affects the function of the protein.
Introduction
Mutation is the basis of evolution driven by the process of selection. All life forms are expected
to be part of a tree of life, which should be able to explain their origin and evolution. Practically,
this may not happen due to extinction of species and further complications arising from ways by
which organisms can acquire genes (e.g. lateral transfer of genes). Phylogenetics exploits
available comparative information to generate trees, which can explain evolution. Traditionally
morphological features were used to compare data and generate trees. More recently molecular
sequences are used for comparisons among species, helping in defining species, families and
leading to loss of phylogenetic information. DNA sequences comprise coding and non-coding
regions that have differing rates of evolution. The rate of evolution also depends on the type of
organism.
Comparison of sequences can only be done after aligning them. Without alignment it is very
difficult to decide which nucleotide/amino acid should be compared with which one (homology).
Proteins show two types of changes- synonymous and non-synonymous. A synonymous change
does not result in change in the coded amino acid.
Positive and negative selection
Traditionally, any change which is favored by natural selection is called positive selection. It is
favored by natural selection because it helps in the survival of organism. Similarly, any trait
which is not favored by natural selection is normally eliminated and is called negative selection.
Similar kind of selection also operates for molecular sequences. It is common among genes to go
through duplication. A duplicated copy of gene is free to undergo mutation and create variation.
This variation goes through positive/negative selection and often leads to neo-functionalization,
leading to new genes with new functions.
Understanding Trees
Cladograms vs Phylograms
Trees fall under two categories Cladogram and Phylogram. Cladogram just provide the
information about relationship between different organisms while phylograms also provide a
measure of the amount of evolutionary change, as seen in the branch-lengths. Due to this fact,
branch length has no meaning in cladograms while it has meaning in phylograms.
The root in a tree denotes the ultimate common ancestor and provides direction in time. At
times, it is not possible to have this information hence there are both types of algorithms
available- those we do apply a common ancestor hypothesis and those we does not. A
common way to decide the root of tree is by using an outgroup. An outgroup is a taxon
from a group closely related to the ingroup, which includes the taxa under study.
Another way to identify the root is to use midpoint as the rooting point for the longest
branch.
Tree Terminology
Trees can be described based on branches and nodes. Terminal branches represent Operational
When two terminal branches are directly connected to each other, they are called sister branches.
If two lineages (branches) originate from one internal node, it is called bifurcation or dichotomy.
If there are more than two branches are coming out of one internal node, this is called as
polytomy and tree is said to be multifurcating.
Various methods have been proposed to build a phylogenetic tree. We will only consider three
here: distance based method (UPGMA and NJ), maximum parsimony (MP) and maximum
likelihood (ML).
Distance Method
Distance based methods start with calculating pairwise distances between sequences based on
pairwise alignment. These distances form a distance matrix which is used to generate the tree.
Commonly known methods to generate the tree from this matrix are Unweighted Pair Group
Method using Arithmetic mean (UPGMA) and Neighbor Joining (NJ). Distance based methods
are fast but overlook substantial amount of information in a multiple sequence alignment.
It is no longer a popular method and distance based tree now use NJ as a method of choice. In
UPGMA is a progressive clustering method. All the sequences are first considered in calculating
the matrix. Now closest taxa are considered as a group. Again matrix is calculated considering
this group as a node, subsequent to which taxa with minimum distance are considered as a group.
Now matrix is calculated again and so on...continue till only two groups are formed and connect
them also. UPGMA assumes that rate of nucleotide or amino acid substitution is constant due to
which branch length reflects actual dates of divergence. This assumption is often not true hence
not rooted because it does not assume a constant rate of evolution but can be rooted using an out
group.
Corrections: Observed distances are not always a good measure of evolutionary distance.
Because they do not take into account hidden changes due to multiple hits. Due to this reason
common corrections are Jukes Cantor and Kimura-2 parameter models. The Jukes-Cantor one
parameter model considers that each nucleotide is free to convert to others with equal rates for
transition and transversion hence any nucleotide has equaled chance to covert to other three. It
Usually, transition rate is higher than transversion rate. Kimura two parameter models adjust
pairwise distances taking into account the transition transversion ratio. Various other models
Maximum Parsimony
Parsimony based method work on the principle of choosing the most parsimonious tree. The
maximum parsimony works on the idea of minimizing the number of evolutionary changes. It
works as follows:
Identify informative sites in a dataset. Sites which represent alternative possibilities for
Construct trees. All possible trees are constructed and evaluated. Score is based on
Distance and Maximum parsimony method are often criticized for lack of a statistical approach.
Both these methods do have criteria to select trees but are unable to calculate the probability of
one tree being the true tree over the other. Various methods have been proposed to overcome this
drawback. Two such methods are provided by likelihood and Bayesian approaches.
In simplistic terms, likelihood can be considered as the probability assigned to each dataset
(observed characters such as nucleotides) generated for a particular hypothesis (tree and model
of evolution). In a way this is similar to maximum parsimony because each tree is assigned a
score, but this score is a likelihood score based on statistical analysis. The best tree is the one,
which has highest probability for a particular model of how changes occur. Both maximum
parsimony and maximum likelihood are computationally exhaustive exercise and hence are slow.
calculate the probability of observing data for a given hypothesis, in Bayesian method,
It is calculated by dividing the minimum possible number of steps by the observed number of
steps. If the minimum number of steps is the same as the observed number of steps, then the
character will have a CI of 1.0.
Possible reasons for these inconsistencies are: disparities in evolutionary rates among lineages.
uneven taxonomic sampling. single explosive radiation of major eukaryotic taxa.
Introduction
techniques to analyse and interpret a biological problem. Major research efforts in the area of
1926. The total DNA present in a given cell is called genome. In most cells, the genome is
packed into two sets of chromosomes, one set from maternal and another one set from paternal
inheritance. These chromosomes are composed of 3 billion base pairs of DNA. The four
nucleotides (letters) that make up DNA are A, T, G, and C. Just like the alphabets in a sentence
in a book make words to tell a story, same do letters of the four bases A, T, G, C in our
genomes.
Genomics is the study of the genomes that make up the genetic material of organism. Genome
studies include sequencing of the complete DNA sequence in a genome and also include gene
annotation for understanding the structural and functional aspects of the genome. Genes are the
parts of your genome that carry instructions to make the molecules, such as proteins that are
responsible for both structural and functional aspects of our cells. The first organism that was
completely sequenced was Haemophilus influenzae in 1995 that led to sequencing of many more
The Human Genome Project (HGP) was global effort undertaken by the U.S. Department of
Energy and the National Institutes of Health with a primary goal of determining the complete
genome sequence in a human cell. It also aimed at identifying and mapping the genes and the
Some key findings of the draft (2001) and complete (2004) human genome sequences included-
2. Gene expression studies helped us in understanding some diseases and disorders in man.
6. It is estimated that only 483 targets in the human body accounted for all the pharmaceutical
1. Genome variation among individuals in a population can lead to new ways to diagnose, treat,
2. Genome studies help us to provide insights into understanding human disease biology.
health care and agriculture. Understanding the sequence of genomes can provide insights in the
identification of unique and critical genes involved in the pathogenesis of microorganisms that
invade us and can help identifying novel drug targets to offer new therapeutic interventions.
Increasing knowledge about genomes of plants can reduce costs in agriculture, for example, by
reducing the need for pesticides or by identification of factors for development of plants under
stress.
4. HGP studies also included application of research on the ethical, legal and social implications
The genome sequence and the genes mapped are stored in databases available freely in the
Internet. The National Centre for Biotechnology Information (NCBI) is a repository of the
gene/protein sequences and stores in databases like GenBank. This large volume of biological
data is then analyzed using computer programs specially written to study the structural and
Prediction Methods
Computational approaches for prediction of genes is one of the major areas of research in
bioinformatics. Finding genes by the traditional molecular biology techniques becomes time
consuming process. Two classes of prediction methods for identifying genes from non-genes
in a genome are generally adopted: similarity or homology based searches and ab initio
prediction.
Gene discovery in prokaryotic genomes becomes less time consuming as compared to prediction
of protein coding regions in higher eukaryotic organisms due to the absence of intervening
Comparative genomics is the analysis and comparison of genomes from two or more different
organisms. Comparative genomics is studied to gain a better understanding of how a species has
One of the most widely used sequence similarity tool made available in the public domain is
Basic Local Alignment Search Tool (BLAST). BLAST is a set of programs designed to perform
Functional genomics attempts to study gene functions and interactions. Functional genomics
seeks to address questions about the function of DNA at the levels of genes, RNA at the levels of
Pharmacogenomics
and help us in the creation of personalized medicine to create and design drugs based on an
and tuberculosis.
The advancement of the field of molecular biology has been principally due to the capability to
sequence DNA. Over the past eight years, massively parallel sequencing platforms have
transformed the field by reducing the sequencing cost by more than two folds. Previously,
technique used to sequence genomes of several organisms. In contrast, NGS platforms rely on
fragments from a single sample. The former facilitates the sequencing of an entire genome in less
than a day. The speed, accessibility and the cost of newer sequencing technologies have
These technologies reveal large scale applications outspreading even genomic sequencing. The
most regularly used NGS platforms in research and diagnostic labs today have been-the Life
Technologies Ion Torrent Personal Genome Machine (PGM), the IlluminaMiSeq, and the Roche
454 Genome Sequencer. NGS platforms rapidly generate sequencing read data on the gigabase
scale. So the NGS data analysis poses the major challenge as it can be time-consuming and
require advanced skill to extract the maximum accurate information from sequence data. A
massive computational effort is needed along with in-depth biological knowledge to interpret
Proteins are linear polymer of amino acids joined by peptide bonds. Every protein adopts a
unique three-dimensional structure to form a native state. It is this native 3D structure that
confers the protein to carry out its biological activity. Proteins play key roles in almost all the
biological process in a cell. Proteins are important for the maintenance and structural integrity of
cell.
There are four levels of protein structure. The primary structure of a protein is the arrangement
of linear sequence of amino acids. The patterns of local conformation within the polypeptide are
referred to as as secondary structure. The two most common types of secondary structure
loop regions. The tertiary structure represents the overall three dimensional structure of these
elements and the protein folds into its native state. The quaternary structure includes the structure
of a multimeric protein and interaction of its subunits. Figure illustrates the hierarchy in protein
structure.
Experimental determination of the tertiary structure of proteins involves the use of X-ray
crystallography and NMR. In addition, computational techniques are exploited for the structural
prediction of native structures of proteins. There has been an exponential growth of both the
biological sequence and structure data, mainly due to the genome sequencing projects underway
in different countries around the world. As of Oct 2013, there are 94,540 structures in the protein
There are three different methods of protein 3D structure prediction using computational
approaches
proteins share very similar structure, as during the course of evolution, structures are more
based on the good alignment between query sequence and the template. In general we can predict
a model when sequence identity is more than 30%. Highly homologous sequences will generate
2. Protein Threading
employed to model a protein. Threading predicts the structure for a protein by matching its
sequence to each member of a library of known folds and seeing if there is a statistically
3. Ab initio method
Ab initio protein modeling is a database independent approach based exploring the physical
properties of amino acids rather than previously solved structure. Ab-initio modeling takes into
consideration that a protein native structure has minimum global free energy.
4.9 SUMMARY
The National Center for Biotechnology Information (NCBI), established in 1988 has emerged as
one of the largest repositories of biological data and related literature from a diverse range of
submission of several forms of datasets. The various forms of data have been further categorized
as sequence based and non-sequence based such as microarray, phenotype, chemical structures
etc. The different forms of datasets can be submitted using the appropriate submission tools;
tools such as BankIt, Sequin and Barcode are meant for submission of nucleotide data. Whereas
Sequin is a stand-alone sequence submission tool, BankIt and Barcode are web-based; In
addition web based tools can also be used for deposition of Whole /Complete Genome, trace files
and Short Read archive nucleotide data; Gene expression data, RT-PCR data, mRNA sequence
using the GEO web-deposit gateway. Along with the datasets, the depositor must also submit
experimental details and designs. Upon acceptance of datasets (nucleotide, gene expression,
chemical structures etc) after strict quality control and verification, the team at NCBI assigns a
unique number or Identifier termed as Accession number. The datasets are organized and
catalogued in the most appropriate database and can be accessed using keywords of the accession
number.
Different methods have been proposed for studying phylogeny. Earlier methods were distance
based and considered constant evolutionary rates. These methods used more exhaustive and
computationally exhaustive methods like maximum parsimony. These methods are now being
supplemented or replaced with more sophisticated statistical methods like maximum likelihood
and Bayesian method. The benefits and pitfalls of these methods are still debated and their
applicability may depend upon the situation. A basic understanding of these methods is a must
Metabolomics and Systems biology finds useful applications in agriculture, health sector and
environmental issues.
3. The three major thrust areas of research include genome and transcriptome and proteome
4. Many softwares/tools are being developed and are available freely over the internet to locate
5. Bioinformatics and computational biology help in reducing the cost and time for designing
new drugs and are nowadays routinely now used in pharmaceutical companies.
2. Prepare a list of databases dealing with literature and their characteristic features.
4. Compare the features of two sequence submission tools, BankIT and Sequin.
6. What is the difference between a distance based method (NJ) and maximum parsimony (MP)
methods?
8. Differentiate between maximum parsimony (MP) and maximum likelihood method (ML).
10. What is the difference between the gene organization in prokaryotes and eukaryotes?
5. Young DC (2009) Computational Drug Design. John Wiley & Sons, Inc. ISBN: 978-0-
470-12685-1.
7. Hiroaki Kitano (2001) Foundation of Systems Biology. MIT Press. ISBN: 0-262-
11266-3.
8. Bioinformatics and Functional Genomics: 2nd Edition, Jonathon Pevsner (2009), Wiley
Blackwell
UNIT 5: INTRODUCTION TO BIOSTATISTICS
Contents
5.1 Objectives
5.2 Introduction
5.7 Summary
Statistical symbol
Scope & applications
Collection, organization and representation of data
Importance of statistics in biological research
5.2 INTRODUCTION
Statistics is the science of figures which deals with collection, analysis and interpretation of data.
Data is obtained by conducting a survey or an experiment study. The use of statistics in biology
is known as Biostatistics or biometry.
Purpose and scope of statistics: The purpose of statistics is not only to collect numerical data but
is to provide a methodology for handling, analysing and drawing valid inferences from the data.
It has wide application in almost all sciences social as well as physical such as biology,
psychology, education, economics, planning, business management, mathematics etc.
While studying various aspects of problems of statistics one has to come across several statistical
terms. Few important statistical terms are given below:
1. Population:
quite different from the popular idea. Biometric study regards the population of some limited
region as its universe. The population in a statisticalinvestigation refers to any well-defined
group of individuals or of observations of a particular type. In short one can say that a group of
study element is called population. For example all fishes of one species present in a particular
pond could be a population. All patients of a hospital suffering from AIDS may be considered as
population while few patients are used as study elements.
2.Sample: In case of large population, it becomes practically impossible to collect data from all
the members. In order to study the Haemoglobin percentage (Hb %) of patients of a hospital, it
will be more convenient and quicker to collect data from few patients. Here patient taken for
study are sample.
Sample may be defined as fraction of a population drawn by using a suitable method so that it
can be regarded as representative of the entire population.
3. Variable: In everyday life, we come across living beings and phenomena, which vary in a
number of ways, even though they belong to the same general category or type. Measurement of
characteristics is called variable.
Animals of some species may differ in their length, weight, age, sex, Hb %, YO2 intake,
fecundity (Rate of reproduction), RBCs count, habits, personality traits etc. The above mentioned
(i) Discrete or discontinuous variable is one where the values of the variables differ from one
(ii) Continuous variable can assume all values within a certain interval and as such are divisible
5. Statistics: Description of the properties of a population in terms of its parameters can be done
with the help of statistical methods.
The term statistics is used to denote summary value of any quantity that is calculated from
sample data. A statistics that serves as an estimate of the parameter, population mean
7. Data: A set of values recorded on one or more observational unit is called data. First step of
statistical study is the collection of data. In scientific research work data is collected only from
personal experimental study. Data collected by personal investigation is called primary data.
x: Deviation
c: Correction
df: Degree of freedom
O: Observed number
E: expected number
P: Probability
%: Per cent
w: Assumed mean
Q: Quartile deviation
Applications of Biostatistics
To find the action of drug on human A drug is given to humans to check whether the
changes produced are due to the drug or by chance.
To compare the action of two different drugs or two successive dosages of the same drug.
To find the relative potency of a new drug with respect to a standard drug.
In Medicine
To compare the efficacy of a particular drug, operation or line of treatment for this, the
percentage cured, relieved or died in the experiment and control groups, is compared and
difference due to chance or otherwise is found by applying statistical techniques.
To find correlation between two attributes such as cancer and smoking or filariasis and
social class.
To identify signs and symptoms of a disease or syndrome.
Cough in typhoid is found by chance and fever is found in almost every case.
To test usefulness of vaccines in the field- Percentage of attacks or deaths among the
vaccinated subjects is compared with that among the unvaccinated ones to find whether
the difference observed is statistically significant.
Clinical medicine
Preventive medicine:
The methods used in dealing with statistics in the fields of medicine, biology and public
health for planning, conducting and analyzing data.
In carrying out a valid and reliable health situation analysis, including in proper
summarization and interpretation of data.
In proper evaluation of the achievements and failures of a health programs.
In Biotechnology
In Genetics
Statistical and probabilistic methods are now central to many aspects of analysis of
questions is human genetics.
The find an extensive applications of statistical methods in human genetics is * Human
Genome Project * Linkage Analysis * Sequencing.
Analysis of DNA, RNA, protein, low- molecular-weight metabolites, as well as access to
bioinformatics databases.
In Dental Science
To find the statistical difference between means of two groups. Ex: Mean plaque scores
of two groups.
To assess the state of oral health in the community and to determine the availability and
utilization of dental care facilities.
To indicate the basic factors underlying the state of oral health by diagnosing the
community and find solutions to such problems.
To determine success or failure of specific oral health care programs or to evaluate
theprogram action.
To promote oral health legislation and in creating administrative standards for oral health
care delivery.
In Environmental Science
Statistical data is a set of facts expressed in quantitative form. The data can be obtained through
primary sources or secondary source. Data obtained by the investigator from personal
experimental study is called primary data.
If the data is obtained from secondary sources such as, journals, magazines, paper, etc. it is
known as secondary data. In scientific work only primary data are used.
Observation Method
Observation method is used when the study related to behavioral science. This method is planned
systematically. It is subject to many controls and checks.
Interview Method
Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should read, reply and
subsequently return the questionnaire. The questions are printed in the definite order on the form.
A good survey should have the following features:
Should have good physical appearance such as colour, quality of the paper to attract the
attention of the respondent
Schedules
This method is slightly different from the questionnaire method. The enumerators are specially
appointed for the purpose of filling the schedules. It explains the aims and objects of the
investigation and may remove misunderstandings, if any have come up. Enumerators should be
trained to perform their job with hard work and patience.
Secondary data is collected by someone other than the actual user. It means that the information
is already available, and someone analyses it. The secondary data comprised of magazines,
newspapers, books, journals, etc.
Government publications
Public records
Business documents
Diaries
Letters
Unpublished biographies, etc.
Presentation of data:
Data obtained by the researcher can be displayed in tabular form, diagrams and through charts.
Display of data in tabular form, diagrams and through charts. Display of data in tabular form is
called classification of data and through charts is known as charting of data.
Process to arrange and present primary data in a systematic way is called classification of data.
Data may be grouped or classified in following various ways:
(i) Geographical; i.e., according to area or region. If we take into account production of fish or
lac or silk state wise, this would be called geographical classification.
Egg production of a poultry farm for five years are given below which is an example of
chronological classification:
95-96 1590
96-97 1672
97-98 1882
98-99 1961
99-2000 2233
(iii) Qualitative; i.e., according to attributes or quality. For example, if a species of fish in a
pond is to be classified in respect to one attribute say sex, we can classify them into two groups.
One is of males and other is of females.
When the classification is done with respect to one attribute, which is simple or dichotomous in
nature, two classes are formed, one possessing the attribute and the other not possessing the
attribute. This type of qualitative classification is called simple or dichotomous classification.
When we classify fishes simultaneously with respect to two attributes, i.e, sex and infected
(IV) Quantitative; i.e., according to magnitudes. For example, the thickness of a plant may be
classified according to their growth rate. Quantitative data may be of two types:
(a) Continuous data: It covers all values of a variable. Hb % of a person can be expressed in
any values such as 13 mg/100 c.c., 13.1 mg/100c.c. and so on. Water percentage in the body of a
species may be 65 %, 65.1 %, 65.2 %, 65.3 % and so on.
(b) Discrete data: The term discrete data is limited to discontinuous numerical values of a
variable. It can be done only in whole number. For example number of persons in a family or
there are 4 ½
(Four and half) persons in my family or there are 500 ½ books in this library.
Quantitative data is grouped or classified and presented in the form of a frequency distribution
table. The frequency distribution table presents the quantitative data very concisely indicating the
number of repetition of observations. It records how frequently a variable occurs in a group
study.
Following raw data is obtained in an investigation. 100 pea plants bore pods ranging from 15 to
41 in a garden of pea plants.
33, 31, 28, 15, 17, 17, 16, 18, 16, 18, 20, 22, 24, 25, 31, 27, 30, 29, 33, 28, 20, 22, 23, 25, 41, 39,
30, 36, 37, 27, 33, 28, 31, 29, 32, 31, 29, 34, 19, 22, 25, 40, 19, 21, 24, 30, 26, 37, 27, 28, 32, 32,
31, 29, 34, 21, 23, 25, 40, 26, 38, 27, 26, 33, 28, 34, 29, 30, 30, 35, 29, 23, 29, 26, 38, 27, 32, 28,
34, 35, 29, 30, 33, 32, 35, 29, 24, 26, 38, 27, 36, 28, 34, 29, 35, 30, 33, 32, 36, 37.
15, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 22, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26,
26, 26, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30,
30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34,
34, 35, 35, 35, 35, 35, 36, 36, 36, 37, 37, 37, 38, 38, 38, 39, 39, 40, 40, 41.
Our first step in the preparation of frequency distribution table is to arrange them in ascending
order of magnitude. The data is then said to be in array. The above raw data table A is arranged
in ascending order of magnitude as shown in raw data table B.
Steps for the preparation of a discrete frequency distribution table may be taken as follows:
A table of two columns is prepared. First column contains variables and second column contains
repetition number of variable i.e. frequency of variables.
In above data variable 15 is obtained only once. Therefore frequency 1 is mentioned against
variable 15. Variable 16 is obtained twice; therefore, frequency 2 is mentioned against this
variable. In the same fashion frequencies of all variables of above data are mentioned and a
frequency distribution table 1.1 is obtained.
Table 1
16 2 30 7
17 2 31 5
18 2 32 6
19 2 33 6
20 2 34 5
21 2 35 4
22 3 36 3
23 3 37 3
24 3 38 3
25 4 39 2
26 5 40 2
27 6 41 1
28 7
For convenience discrete frequency table may be prepared with the help of tally mark. Following
steps have to be taken to prepare discrete frequency table using tally mark:
A table of three columns is prepared. In first column variables are mentioned. In second column
repetition (frequency) of each variable is denoted by tally mark. In third column, total of tally
mark, of each variable is written which is of course the frequency of variable.
If variable appears only once then tally mark I is mentioned, for second repetition II, for third III
but for fifth a cut of fourth IV is mentioned.
Following simple frequency table 1.2 is prepared using raw data B in array with the help of tally
mark.
Table 1.2
15 I 1
16 II 2
17 II 2
18 II 2
19 II 2
20 II 2
21 II 2
22 III 3
23 III 3
24 III 3
25 IIII 4
26 IIII 5
27 IIII I 6
28 IIII II 7
29 IIII IIII 9
30 IIII II 7
31 IIII 5
32 IIII I 6
33 IIII I 6
34 IIII 5
35 IIII 4
36 III 3
37 III 3
38 III 3
39 II 2
40 II 2
41 I 1
To make data comprehensible one should classify or group identical values of the variables into
ordered class intervals.
To illustrate, the construction of a frequency distribution table in class interval, consider the raw
data B, which represents the pods per plant in a garden.
Here we first decide about the number of classes into which data are to be grouped. Ordinarily,
the number of classes should be between 5 and 20, but this may be done arbitrarily. The number
of classes depends on the number of observations with larger number of observations with
larger number of observations one can have more classes.
The width or range of class is usually called class-interval and is denoted by h. The width of
class-interval must be of uniform size.
After deciding about class-interval we calculate range (The highest score H minus lowest score L
or length of class interval) (H-L). From Raw data B, Range of score R = 41-15 = 26 (Range is
denoted by R).
Now following formula may be applied to get the approximate number of classes which should
expect to group the given observations.
Number of classes k = Range of scores / Class interval = R/h.
Mid-point of class interval: Class mid-point is the sum of highest and lowest limits of class-
interval divided by two. Thus the mid-point falls in the middle of upper and lower level of class-
interval.
5.7 SUMMARY
Biostatistics is the application of statistical principles to questions and problems in medicine,
public health or biology. In other circumstances in would be important to make comparisons
among groups of subjects in order to determine whether certain beha
exercise, etc.) are associated with a greater risk of certain health outcomes. It would, of course,
be impossible to answer all such questions by collecting information (data) from all subjects in
the populations of interest. A more realistic approach is to study samples or subsets of a
population. The discipline of biostatistics provides tools and techniques for collecting data and
then summarizing, analysing, and interpreting it. Consequently, in biostatistics one analyses
samples in order to make inferences about the population. This module introduces fundamental
concepts and definitions for biostatistics. A survey research can be objectivist or subjectivist in
nature. An objectivist approach is a more rigid and scientific approach. In this the hypothesis is
tested using publicly standard procedure. There is little or no latitude available to deviate from
the stated procedures or questions. Data Analysis and Data Presentation have a practical
implementation in every field. The transformed raw data assists in obtaining useful information.
The presentation is the key to success. Once the information is obtained the user transforms the
data into a pictorial presentation so as to be able to acquire a better response and outcome.
Question2. What do you mean by data, population, sample, variable, parameter, class interval,
frequency distribution, cumulative frequency distribution, primary data, and secondary data?
Question3. The lowest and highest levels of few class intervals are given below. Mention length
and mid-points of each class interval:
REFERENCES
Mahajan BK 2002 (Methods in Biostatistics) (6th edition)
Zaman SM, HK Rahim and M Howlader 1982. (Simple Lessons from Biometry), BRRI
Research methodology methods and techniques, C. R, Kothari, New Age International
Limited Publisher,
UNIT 6: MEASURES OF CENTRAL TENDENCY AND
VARIABILITY
CONTENTS
6.1 Objectives
6.2 Introduction
6.9 Summary
6.2 INTRODUCTION
Central tendency may be considered as a synonym of average. Average is a general term which
describes the general value of series, around which all other observations are dispersed.
1. Mathematical average
2. Average of positions
Arithmetic mean
Geometric mean
Harmonic mean
2. Average of positions: Average exhibited by position is called average of positions. There are
two types of average positions.
Median
Mode
Median indicates the average position of a series. In a series all observations are arranged in
ascending or descending order and the middle observation is called the median.
Mode is that value which is repeated maximum times in a series. In other words we can say that
the mode is that value which has the maximum frequency.
Standard deviation is the most important and widely used measure of dispersion. It is denoted by
a Greek letter sigma. The standard deviation
the squared deviations of measurements from their mean. It has accordingly often been called the
root mean square deviation.
The Chi- square test was developed by Prof. A. R. Fisher in 1870. Karl Pearson improved
-square test in its modern form in 1900. Chi-square is derived from the Greek letter
-test is used not only to test the significance of difference between two means but also
to test the significance of product moment correlation, point- biserial correlation, rank difference
correlations etc.
Here, = Mean
X = observations
N = Number of observations
Example: In a class there are 20 students and they have secured a percentage of 88, 82, 88, 85,
84, 80, 81, 82, 83, 85, 84, 74, 75, 76, 89, 90, 89, 80, 82, and 83.
Find the mean percentage obtained by the class.
Solution:
Mean = Total of percentage obtained by 20 students in class/Total number of students
= [88 + 82 + 88 + 85 + 84 + 80 + 81 + 82 + 83 + 85 + 84 + 74 + 75 + 76 + 89 + 90 + 89 + 80 +
82 + 83]/20
= 1660/20
= 83
Hence, the mean percentage of each student in the class is 83%.
Grouped data: When data is presented in frequency distribution, it can be obtained by two
obtained.
Merits mean: Arithmetic mean is the most important measures of central tendency because
It covers all the observations.
It can be calculated easily and it expresses a simple relation between the whole and the
parts
It does not get affected by the fluctuations of sampling.
Demerits of mean:
the prod
GM = n 1 × a2 × ... × an)
Harmonic mean: Harmonic mean is the reciprocal of arithmetic mean of given observation.
Average of position:
Median indicates the average position of a series. In a series all observations are arranged in
ascending or descending order and the middle observation is called the median. The median is
most suitable for expressing qualitative data such as colour, health, intelligence etc. Median is
calculated differently for ungrouped and grouped data. Ungrouped data: Median of ungrouped
data is calculated by two different methods: When scores are in odd number, formula to obtain
median is as follows:
Median = ( ) th item When the data is continuous and in the form of a frequency distribution,
the median is calculated through the following sequence of steps.
Median is a better indicator average than mean when one or more of the lowest or the
highest observations are wide apart or not so evenly distributed.
It can be calculated easily and can be exactly located.
The value of the median is not influenced by abnormally large or small values or the
change of any one value of the series.
It can also be used in qualitative measures.
Demerits of Median:
Mode is that value which is repeated maximum times in a series. In other words we can say that
the mode is that value which has the maximum frequency. Mode can be obtained by two
methods: Determination of mode at a glance: The value which is repeated maximum times in a
series is considered as mode. The value occurring most frequently in a set of observations is its
mode. In other words, the mode of data is the observation having the highest frequency in a set
of data. There is a possibility that more than one observation has the same frequency, i.e. a data
set could have more than one mode. In such a case, the set of data is said to be multimodal.
In the case of grouped frequency distribution, calculation of mode just by looking into the
frequency is not possible. To determine the mode of data in such cases we calculate the modal
class. Mode lies inside the modal class. The mode of data is given by the formula:
Where,
l = lower limit of the modal class
h = size of the class interval
f1 = frequency of the modal class
f0 = frequency of the class preceding the modal class
f2 = frequency of the class succeeding the modal class
Let us take an example to understand this clearly.
Let us learn here how to find the mode of a given data with the help of examples.
Example 1: Find the mode of the given data set: 3, 3, 6, 9, 15, 15, 15, 27, 27, 37, and 48.
Solution: In the following list of numbers,
3, 3, 6, 9, 15, 15, 15, 27, 27, 37, 48
15 is the mode since it is appearing more number of times in the set compared to other numbers.
Example 2: Find the mode of 4, 4, 4, 9, 15, 15, 15, 27, 37, 48 data sets.
Solution: Given: 4, 4, 4, 9, 15, 15, 15, 27, 37, 48 is the data set.
As we know, a data set or set of values can have more than one mode if more than one value
occurs with equal frequency and number of time compared to the other values in the set.
Hence, here both the number 4 and 15 are modes of the set.
Example 3: Find the mode of 3, 6, 9, 16, 27, 37, and 48.
Solution: If no value or number in a data set appears more than once, then the set has no mode.
Hence, for set 3, 6, 9, 16, 27, 37, 48, there is no mode available.
Example 4: In a class of 30 students marks obtained by students in mathematics out of 50 is
tabulated as below. Calculate the mode of data given.
Solution:
The maximum class frequency is 12 and the class interval corresponding to this frequency is 20
30. Thus, the modal class is 20 30.
Lower limit of the modal class (l) = 20
Size of the class interval (h) = 10
Frequency of the modal class (f1) = 12
Frequency of the class preceding the modal class (f0) = 5
Frequency of the class succeeding the modal class (f2) = 8
Substituting these values in the formula we get;
6.4 MEAN DEVIATION
Mean deviation may be defined as the mean of all the deviations, in a given set of data obtained
from an average. All the deviations are treated as positive. Mean deviation is treated as positive.
Example: Hb% of 10 patients of a ward of a hospital were obtained as 5, 7, 8, 10, 14, 12, 13, 5,
8, 8. Compute the mean deviation.
Calculation: Following 4 steps have to be taken to calculate mean deviation of ungrouped data.
= 5+7+8+10+14+12+13+5+8+8/10
= 90/10 = 9
Step 2: Following table is prepared to obtain deviations between each score and mean.
= 4+2+1+1+5+3+4+4+1+1
=26
Grouped data: Following formula is used to obtain mean deviation from grouped data:
MD = f.x/f
Demerits: It is less reliable because positive and negative signs are ignored.
mean of the squared deviations of measurements from their mean. It has accordingly often been
called the root mean square deviation. Standard deviation is calculated differently in ungrouped
and grouped data.
Ungrouped data: Following formula is used where deviation is obtained from mean.
Note: Standard deviation is computed by using N-1 in the denominator of the above formula in
place of N if size of the sample is less than 30. If size of sample is more than 30 then previous
formula i.e.
The above formula calls for following six steps in computation in fixed order:
2.
2 2/N-1.
Step 6. Extract the square root of the result of step 5. This is standard deviation.
Solved Example Haemoglobin percent g/100mL of Heteropneustes fossilis was recorded as 23,
22, 20, 24, 16, 17, 18,
Calculation. Following table 4.7 having four columns are prepared on the basis of above
of observations
N = 9.
16 16-20 -4 16
17 17-20 -3 9
18 18-20 -2 4
19 19-20 -1 1
20 20-20 0 0
21 21-20 +1 1
22 22-20 +2 4
23 23-20 +3 9
24 24-20 +4 16
2
= 60
Here the size of the sample is less than 30. Therefore the following formula is applicable.
2
/N-1
On putting the values in the above formula-
-1
= 2.75
Following formula is used to obtain standard deviation by long method from grouped data:
The above formula calls for following steps in computation in fixed order.
Step 5- Multiply each squared deviation with corresponding frequency, finding f.x2.
2 2
Step 7- 2
Solved example- Weight of testis of 50 frogs is given below with their frequency. Find the
standard deviation.
Deviation x of each score (from mid-point) obtained from this actual mean using formula X- .
For instance deviation
x=X-
2/
= 1.02
Merits:
(i) It summarizes the deviation of a large distribution from mean in one figure used as a unit of
variation.
(ii) It indicates whether the variation of difference of an individual from the mean is real or by
chance.
(iv) It helps in finding the suitable size of sample for valid conclusions.
Demerits: It gives weightage to only extreme values. The process of squaring deviations and
then taking square root involves lengthy calculation.
2=V
Suppose that we have a sample of one case with only one score. There is no possible basis for
individual difference in such a sample; therefore there is no variance and variability. Consider a
second individual with his score in the same test or experiment. We now have one difference.
Consider a third case and we then have two additional differences, three altogether. There are as
many differences as there are possible pairs of individuals. We could compute all these inter pair
differences and could average them to get a single, representative value. We could also square
them and then average them. It is most easy to find a mean of all scores and to use that value as a
common reference point.
Each difference then becomes a deviation from that reference point and there are only as many
deviations as there are only as many deviations as there are individuals. Either the variance or the
S.D. is a single representative value for all the individual differences when taken from a common
reference point.
Example: Hb% of 10 patients in a ward was recorded as 7, 8, 9, 10, 11, 12, 13, 14.5, 15 and
15.5g/100mL. Find out the variance of the data.
= X/N
Ans.
Measurement of relative dispersion (Coefficient of variation) Measures of dispersion gives us an
idea about the extent to which variations are scattered around their central value. Therefore, two
distributions having the same central values can be compared directly with the help of various
measures of dispersion. If for example, an analysis of seed number per unit in two batches of 10
fruits in a garden, batch 1 has a mean score of 70 and standard deviation of 1.25 and batch II
have a mean score 80 with standard deviation of 2, 4 then it is clear that batch I having a lesser
value of S.D. are more consistent in producing seed number than batch II.
On the other hand we have a situation when two or more distributions having unequal means or
different units of measurements are to be compared in respect of their variability. For making
such a comparison we use a statistic called relative dispersion or coefficient of variation (c.v.).
Formula for coefficient of variation is as follows:
Example: Mean values of Hb % of 20 males and 20 females were calculated as 13.5 and 14
mg/100mL. SD of males as 3 and 4 respectively. Find coefficients of variation of both male and
female. Mention which sex is more variable and which more consistent
Deductions: Females are variable than males in respect of Hb %. In other words, contrary to
females, males are more consistent in Hb %.
Merits and demerits of variance:
1) It is easy to calculate.
a) The unit of expression of variance is not the same as that of the observations, because variance
obtained in cm variance will be in
square cm.
b) Variance is usually a large number compared to the values of observations. Therefore variance
is now seldom used to express the variability.
square test.
Standard error and student t tests are parametric tests and are applied to only quantitative data. In
biological experiments a non-parametric test is very commonly called chi-square test. It is
applied only for qualitative data such as colour, health, intelligence, cure response of drugs etc.
which do not have numerical values.
The Chi- square test was developed by Prof. A. R. Fisher in 1870. Karl Pearson improved
-square test in its modern form in 1900. Chi-square is derived from the Greek letter
( ) and pronounced as
Definition: Chi-square test is the test of significance of overall deviation square in the observed
and expected frequencies divided by expected frequencies.
1.As an alternate test to find significance of difference in two or more than two proportions. Chi
square test is very useful test which is applied to find significance in the same type of data with
two more advantages:
a) To compare the values of two binomial samples even if they are small such as oxygen
consumption in 5 control and 5 thyroxin injected fishes of the same species.
2. As a test of association between two events in binomial or multinomial samples. Chi- square
measures the probability of association between two discrete attributes. Two events can be
studied for their association such as iron intake and Hb%, season and fecundity, T4 injection and
oxygen consumption, nutrition and intelligence, weight and diabetes etc. There are two
possibilities, either they influence or they do not.
Association table: Table is prepared by enumeration of qualitative data. Since one wants to know
the association between two sets of events, the table is also called association table.
Four fold or 2 2 contingency table: When there are only two samples, each divided into two
classes, it is called fourfold, four cell of 2 2 contingency table.
Total 15 85 100
Multifold Association Table: The association of two sets of events having more than two classes
will be larger than a fourfold or four cell contingency table.
Goodness of fit reveals the closeness of observed frequency with those of the expected. Thus it
helps to answer whether something (physical or chemical factors) did or did not have an effect. If
observed and expected frequency are in complete agreement with each other than the chi-square
will be zero. But it rarely happens in biological experiments. There is always some degree of
deviation.
random. (ii) Data should be qualitative. (iii) Observed frequency should not be less than five.
Method to draw inferences: 2 is more
value is insignificant.
The quantity in the denominator which is one less than the independent number of observations
in a sample is called degree of freedom. If there are 2 classes (For example control and T4
injected, male and female) the degree of freedom would be 2 - 1 = 1. If there are three classes
then d.f = 3 - 1 = 2, in case of 4 classes d.f. = 4 - 1 = 3 and so on.
2 = {(fo-fe) / fe}.
Make a contingency table and note the observed frequency (fo or O) in each class of one event,
row wise i.e. horizontally and then the numbers in each group of the other event, column wise
i.e. vertically.
Determine the expected frequency (fe or E) in each group of the sample on the assumption of
null hypothesis (Ho) i.e. no difference in the proportion of the group from that of the population.
Find the difference between the observed and expected frequency in each cell (fo-fe) or (O-E).
-fe) 2/ fe}
freedom under
different probabilities 0.5, 0.1, 0.05, 0.01, 0.001 etc.
calculated value is less than the table value then it is considered insignificant.
Example: In a monohybrid cross between tall (TT) and dwarf (tt) 1574 tall and 554 dwarf were
obtained. Suggest if a ratio of 3 : 1 is suitable or not.
= 2128.
-fe)2/fe}
= 0.303 + 0.909
= 1.212. Ans
Here, d.f. = 2 - 1 = 1
Significance,
This shows that the two series of frequencies, observed and expected, are not in agreement with
the theoretical ratio of 3 : 1.
Example- In a cross between black male and gray female Drosophila the offspring obtained
gray offsprings. Expected number is calculated from the fact that gray body color is dominant
and the expected ratio of this nature is 1 : 1 [Total number of offspring are 60].
Table:
BLACK GRAY
Observed number 25 35
Expected number 30 30
(O-E) 25-30 = -5 35-30 = +5
(O-E)2 = (25-30)2 = 25 (35-30)2 = 25
-fe)2 / fe}
= 25/30 + 25/30
= = 10/6 = 1.66.
Applications of t-test: The significance of difference between two means is obtained differently
in uncorrelated or unpaired t-test and paired t-test.
[Note: Standard error of the difference between two uncorrelated means is calculated differently
during t-test]
Following steps have to be taken to test the significance of difference between two uncorrelated
means:
2
1 bar + SE X 22 bar
Here, SED = Standard error of the difference between the two sample means.
SE 1bar = Standard error of the first mean.
SE 2 bar = Standard error of the second mean.
or -1
To obtain SE one have to obtain the value of combined ( ). Following formula is used to obtain
2 2
1 2 / N1-N2
Example: Body length of 10 fishes of a species of fish was obtained from two ponds
(population) of Gaya town. They were measured as:
Pond A: 20cm, 24cm, 20cm, 28cm, 22cm, 20cm, 24cm, 32cm, 24cm and 26 cm.
Pond B. 12 cm, 10cm, 8cm, 10cm, 6cm, 4cm, 14cm, 20cm, 10cm, and 6cm.
Calculate the mean difference in total body length between the two pounds of fish is significant
or not.
6.9 SUMMARY
Central tendency may be considered as a synonym of average. Average is a general term which
describes the general value of series, around which all other observations are dispersed.
The Chi- square test was developed by Prof. A. R. Fisher in 1870. Karl Pearson improved
-square test in its modern form in 1900. Chi-square is derived from the Greek letter
Question1. Three groups with 20 patients in each were administered analgesics A, B and C.
Relief was noted in 20, 10 and 6 cases respectively. Is this difference due to the drug or by
chance?
of fit?
Question4. Define and explain mean, median and mode. Mention formula to find out mean,
median and mode both for ungrouped and grouped data.
Question5. Calculate mean and median from the following ungrouped data:
Question6. Compute N (No of observation) when mean = 5 and sum of means = 30.
Question7. Calculate mean, median and mode from the data given in following table
Table A
REFERENCES
Mahajan BK 2002 (Methods in Biostatistics) (6th edition)
Zaman SM, HK Rahim and M Howlader 1982. (Simple Lessons from Biometry), BRRI
Research methodology methods and techniques, C. R, Kothari, New Age International
Limited Publisher,
7.1 OBJECTIVES
To study
Types of correlation.
Regression analysis.
7.2 INTRODUCTION
Correlation means association of two or more facts. In statistics correlation may be defined as
variables is called bivariate distribution and the distribution involving more than two variables is
called multivariate distribution. In statistics we study the degree of correlation between two or
more variables. Sometimes two variables are measured in the same individual such as length and
weight, oxygen consumption and body weight, Body weight and Hb% etc. At other times the
same variable is measured in two or more related groups such as tallness in parents and
offspring, intelligent quotient (IQ) in brothers and in corresponding sisters (siblings) and so on.
F. Galton coined the term regression in 1885 to explain the data obtained during the study of
remain in middle
We have studied that in order to draw a relationship; observations of two variables are plotted in
the form of dots in a scatter diagram. A straight line is drawn which will approach as close as
possible to all these points in the graph. The statistical analysis employed to find out the exact
position of the straight line is known as the linear regression analysis. The main objective of
regression analysis is to predict the value of one variable using the known value of the other. The
existence of a relationship between the independent variable X and the dependent variable Y can
be expressed in a mathematical form known as the linear regression analysis.
The main objective of regression analysis is to predict the value of one variable using the known
value of the other. The existence of a relationship between the independent variable X and
dependent variable Y can be expressed in a mathematical form known as the regression equation.
The equation expressed by a straight line is called the linear regression equation.
I. Perfect Positive Correlation: The two variables denoted by letter X (Body length) and
Y (Body weight) are directly proportional and fully correlated with each other. Both
variables rise or fall in the same proportion. Examples of perfect or total correlation is
very rare in nature but some approaching to that extent are there such as day length and
temperature; rain and humidity; body weight and height; age and height; age and weight
etc, upto certain age. The imaginary mean line rising from the lower ends of both X and
Y axes forms a straight line. When the scatter diagram is drawn all the points fall around
the mean line.
II. Moderately positive correlation: The two variables denoted by X (Age of husband) and
Y (Age of wife) are partially positively correlated. Values of correlation coefficient (r) lie
between 0 and +1, i.e., 0<r<1. Other examples of positive correlation may be infant
mortality rate and overcrowding, tallness of plants and the quantity of manure used,
nutrition and death rate in pregnancy etc.In such moderately positive correlation, the
scatter will be there around an imaginary mean line, rising from the lower ends of both X
and Y variables.
III. Perfect negative correlation: The variables denoted by letter X (Temperature) and Y
(Lipid content of body of a species of fish) are inversely proportional to each other, i.e.,
when one (X) rises the other (Y) falls in the same proportion. The correlation coefficient
(r)= -1 to 0. Examples of perfect negative correlation are also very rare in nature but
some approaching to that extent is there such as temperature and lipid content of the
body, RBCs number and Hb%, T4 injection and oxygen.
IV. Moderately negative correlation: The two variables denoted by X (Economic
condition of States) and Y (case of tuberculosis). In this case values of correlation
coefficient lie between -1 and 0 such as income and infant mortality rate, age and vitality
in adults etc. In such moderately negative correlation, the scatter will be there around an
imaginary mean line rising from the extreme values of the variable.
V. Absolutely no correlation. In this case the value of correlation coefficient (r) is zero,
indicating that no linear relationship exists between the two variables. There is no
imaginary mean line indicating a trend of correlation. X is completely independent of Y
such as Hb% and body weight; Body weight and IQ etc. In absolutely no correlation X
variable is completely independent of Y variable. In this case points are so scattered that
no imaginary line can be drawn.
In bivariate distribution, the correlation may be positive or negative, and linear or curvilinear.
Two variables co-varying in the same direction are positively correlated. For example, we expect
a positive correlation between height and weight of a group of individuals.
Correlation between the two variables in opposite directions is negatively correlated. The
increase in one variable results in a decrease in the other. For example, an increase in the number
of caterpillars results in a corresponding decrease in leaves of plants.
The correlation of two variables which can be expressed by a straight line is called linear
correlation. In perfect linear correlation the amount of change in one variable bears a constant
ratio to the amount of change in the other. For example, the length of five fishes of a species and
their snout length is measured in cm. The measurement is given below:
The above correlation indicates that each individual scored 1 cm more on test Y. This means that
the correlation between the above two variables is expressible in the form Y=X+1, which is an
expression representing a straight line, i.e. a perfect positive linear relationship, in which
correlation between X and Y will be +1. (Though it rarely happens in biological experiments)
The correlation of cores of two variables based on some quality shown by a curve line on a graph
is called curvilinear correlation.
what
extent two or more things are related and to what extent variations in one go with variations in
1. Scatter diagram method: Scatter diagram or dot diagram is a graphic device for drawing
certain conclusions about the correlation between two variables. In preparing a scatter diagram,
the observed pairs of observations are plotted by dots on a graph paper by taking the
measurements on variable X along the horizontal axis and that on variable Y along the vertical
axis. The placement of these dots on the graph reveals the change in the variable as to whether
they change in the same or in opposite directions. Scatter diagram showing various degrees of
correlation.
I. Perfect Positive Correlation: The two variables denoted by letter X (Body length) and
Y (Body weight) are directly proportional and fully correlated with each other. Both
variables rise or fall in the same proportion. Examples of perfect or total correlation is
very rare in nature but some approaching to that extent are there such as day length and
temperature; rain and humidity; body weight and height; age and height; age and weight
etc, upto certain age. The imaginary mean line rising from the lower ends of both X and
Y axes forms a straight line. When the scatter diagram is drawn all the points fall around
the mean line.
Moderately positive correlation: The two variables denoted by X (Age of husband) and
Y (Age of wife) are partially positively correlated. Values of correlation coefficient (r) lie
between 0 and +1, i.e., 0<r<1. Other examples of positive correlation may be infant
mortality rate and overcrowding, tallness of plants and the quantity of manure used,
nutrition and death rate in pregnancy etc.In such moderately positive correlation, the
scatter will be there around an imaginary mean line, rising from the lower ends of both X
and Y variables.
III. Perfect negative correlation: The variables denoted by letter X (Temperature) and Y
(Lipid content of body of a species of fish) are inversely proportional to each other, i.e.,
when one (X) rises the other (Y) falls in the same proportion. The correlation coefficient
(r)= -1 to 0. Examples of perfect negative correlation are also very rare in nature but
some approaching to that extent are there such as temperature and lipid content of the
body, RBCs number and Hb%, T4 injection and oxygen.
IV. Moderately negative correlation: The two variables denoted by X (Economic
condition of States) and Y (case of tuberculosis). In this case values of correlation
coefficient lie between -1 and 0 such as income and infant mortality rate, age and vitality
in adults etc. In such moderately negative correlation, the scatter will be there around an
imaginary mean line rising from the extreme values of the variable.
V. Absolutely no correlation. In this case the value of correlation coefficient (r) is zero,
indicating that no linear relationship exists between the two variables. There is no
imaginary mean line indicating a trend of correlation. X is completely independent of Y
such as Hb% and body weight; Body weight and IQ etc. In absolutely no correlation X
variable is completely independent of Y variable. In this case points are so scattered that
no imaginary line can be drawn.
algebraic methods of finding correlation between two variables. The coefficient of correlation (r)
gives an idea about the degree of linear relationship between two variables. Formula to obtain
coefficient of correlation (r) is used as follows:
r= x×y/ x2 × y2
Where X is the independent variable normally represented by the abscissa and Y is the
dependent variable represented by the ordinate. x and y are the deviations from the respective
means (as used for other purposes determination of variance and standard deviation).
from their respective means by the square root of the products of the sums of squares of
deviations from the respective means of two variables.
x = deviation of X variable
y = deviation of Y variable
Example: The length and weight of 7 groups of fishes of a species is given below. Find out the
correlation coefficient of the two variables.
Length of body 11.7 cm, 13.9 cm, 15.5 cm, 17.8 cm, 18.5 cm, 19.2 cm, 21 cm.
Weight of the body 7.10 g, 12.42 g, 15.35 g, 23.20 g, 28.45 g, 32.25 g and 39.84 g.
N=7 X = Y = x2 = y2 = x×y =
66.64 821.19 223.56
117.6 158.81
Calculation: For calculation of correlation coefficient from above data (ungrouped series) a
table is prepared with help of following steps;
3) Find out the actual mean of X and Y with the help of a formula.
4) Find the deviation of all scores of X and Y. Formula to find out deviation from actual
mean (Score - Mean). Put all values of deviations in column 4 against their scores for
variable X and in column 5 for variable Y.
5) Find the square of x and y and put them in column 6 and 7 respectively.
Put the multiplication value of x and y of each score in the last column i.e. 8th column. All
values of x×y are summed and given in the last column.
r = x×y / x2×y2
= 223.56 / 233.93
= 0.96.
Inference: The calculated value of correlation coefficient (r) is 0.96. One has to see the
-2 i.e 7-2=5.
iables i.e.
length and weight of the body is in complete +ve correlation.
2) To find out the measures of error, present during the use of a regression line for prediction, is
another aim of regression analysis.
Linear regression: Linear regression between values of two variables are possible only when one
unit change in the independent variable (X) influences change in the definite quantity in the
dependent variable (Y). This change may be on the positive or negative side beyond the mean.
The lines of the best fit passing through the middle of points on the plotted graph are drawn.
These lines are called regression lines. The two regression lines are drawn one is X, Y and the
other is Y on X indicating conditions of moderately +ve and moderately ve
correlation
respectively. The two regression lines interest at the point where perpendiculars drawn from the
means of X and Y variables meet.
When there is perfect correlation (r = +1 or -1) the two regression lines will coincide or become
one straight line. Though perfect correlation is not possible in biological experiments. When the
correlation is not possible in biological experiments. When the correlation is partial, the lines
will be separate and diverge forming an acute angle at the meeting point of perpendiculars drawn
from the means of two variables. Lesser the correlation, greater will be the divergence of angle.
Steepness of the lines indicates the extent of correlation. Close to the correlation greater is the
steepness of regression lines X on Y and Y on X.
Composition of regression lines is based on least square assumptions. The general condition for
regression analysis is based on lines of the best fit. It is called least squared error.
Regression equation: The existence of a relationship between the independent variable X and the
dependent variable Y can be expressed in a mathematical form is known as the regression
equation. These equations represent the regression lines.
Regression equation of Y on X indicates the changes in the values of X for changes given in Y.
Likewise regression equation of X and Y indicates the changes in the values of Y for changes
given in X.
Regression equation of X on Y:
X=a+by
Regression equation of Y on X:
Y=a+bx
In both equations x and y are values of variables whereas a and b are constant. Constant a is
intercepted i.e. it is that point where the regression line touches Y axis. In other words, the
distance between the touching point of the regression line on the Y axis from the point of origin
ve
is a. If correlation is + ve regression lines touch the Y axis above of origin and in case of
regression line touches Y axis below point of origin.
x = a + b×y
a= b
Likewise, y = a + b x
Or a = y + b
is the mean of x series and y is the mean of y series. Constant b exhibits the slope of the line. It
is the value of angle made by the regression line and its horizontal line (X-axis). In other words b
is gradient or slope. It means for the measurement of any distance on X axis:
Change in values of Y axis/ Distance on axis. In the given graph the position of a and b has been
made clear from the equation y = a + bx.
This clears that determination of any special straight lines depends on the value of a and b and
the best least square line can be obtained only when the real value of a and b is determined.
Values of a and b can be obtained by following two normal equations.
1. Regression analysis provides estimates of values of the dependent variable from values of the
independent variable. The device used to accomplish this estimation procedure is the regression
line. The regression line describes the average relationship existing between X and Y variables,
i.e., it displays mean values of X for given values of Y. The equation of this line, known as the
regression equation, provides estimates of the dependent variable when values of the
independent variable are inserted into the equation.
2. A second goal of regression analysis is to obtain a measure of the error involved in using the
regression line as a basis for estimation. For this purpose the standard error of estimate is
calculated. This is a corresponding value estimated from the regression line. If the line fits the
data closely, that is, if there is little scatter of the observations around the regression line, good
estimates can be made of the Y variable. On the other hand, there is a great deal of scatter of the
observations around the fitted regression line; the line will not produce accurate estimates of the
dependent variable.
3. With the help of regression coefficients we can calculate the correlation coefficient. The
square of correlation of coefficient (r), called coefficient of determination, measures the degree
of association or correlation that exists between the two variables. It assumes the proportion of
variance in the dependent variable that has been accounted for by the regression equation. In
general, the greater the value of r2, the better is the fit and the more useful the regression
equation as a predictive device.
7.5 SUMMARY
Correlation means association of two or more facts. In statistics correlation may be defined as
variables is called bivariate distribution and the distribution involving more than two variables is
called multivariate distribution. In statistics we study the degree of correlation between two or
more variables. Sometimes two variables are measured in the same individual such as length and
weight, oxygen consumption and body weight, Body weight and Hb% etc.
Regression analysis is a branch of statistical theory that is widely used in almost all the scientific
disciplines. In economics it is the basic technique for measuring or estimating the relationship
among economic variables that constitute the essence of economic theory and economic life. The
uses of regression are not confined to economics and business fields only. Its applications are
extended to almost all the natural, physical, and social sciences.
Question 2: What is regression? Differentiate between correlation and regression. Explain the
method of least square to estimate the regression coefficient in a linear regression of Y on X.
Question 3: What is the purpose of regression analysis? What do you mean by linear regression?
Explain regression equation.
Question 4: The body length and girth of 7 groups of a species of fish in cm is as follows. Find
the regression equation.
13.9 4.2
15.7 4.7
15.8 4.7
17.5 5.2
18.1 5.4
19.9 6.0
22.0 6.5
7.7REFERENCES