Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
291 views

Manual PDF

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
291 views

Manual PDF

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to

Bioinformatics and Systems Biology

August 13-20, 2012.

Organized by
Bioclues Organization
www.bioclues.org
An affiliate of International Society for Computational Biology (ISCB.org)
and Asia Pacific Bioinformatics Network (APBioNet.org)
Hyderabad, India

Introduction to Bioinformatics and Systems Biology Page# 1 of 53


Foreword

This manual is not a concise version of the taught program that is delivered during the workshop. Some select
and important topics of interest that would interest the participants have been dealt in a pragmatic way. Therefore
this manual may not be construed as a full reference for the participants whence their hands-on session. In
addition, there will be exercises and summary delivered to the participants separately at the end of each day.

Prashanth Suravajhala, PhD and Team Bioclues


Founder, Bioclues.org

Four avenues of Bioclues:

M – Mentoring
O – Outreach
R – Research
E – Entrepreneurship

Introduction to Bioinformatics and Systems Biology Page# 2 of 53


Contributors for the manual

 Raghunath, Keshavachandran, Professor and Head, Bioinformatics Centre, KAU, Thrissur, India.
 Jayaraman Valadi, PhD. Scientist Emeritus, CSIR, India.
 Tiratha Raj Singh, PhD, Vice President, Bioclues.org. Sr. Lecturer, JUIT, Solan, HP, India
 Pritish Varadwaj, PhD. Co founder, Bioclues.org. Professor, IIIT Allahabad, India
 Arun Gupta, M.Tech. Visiting Faculty, DAV University, Indore, India.
 Sundararajan VS, Nanyang Technological University, Singapore.
 Mohana Lata Paul, CCMB Alumni and Kakatiya University
 Renuka Suravajhala (PhD), Roskilde University, Denmark
 Shidhi, PhD fellow, Kerala University, Trivandrum, India.
 Shrish Tiwari, Ph D. CCMB, Hyderabad.
 Prashanth Suravajhala, PhD.

Introduction to Bioinformatics and Systems Biology Page# 3 of 53


Section 1: Introduction: What and the how of Bioinformatics
Dr. Paulien Hogeweg founded the Theoretical Biology and Bioinformatics group at Utrecht University in 1977.
The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper in 1978 whence studying informatic
processes in biotic systems. So how different is Computational Biology from Bioinformatics? The term
Bioinformatics is a tool while Computational Biology is regarded as a greater discipline (science). Several
Bioinformatics tools have been established in the recent-past which paved way in uprooting Computational
Biology today – Systems Biology, Genome and Protein Informatics. However with the advent of these
disciplines, all have been dealt under a big tag called Computational Systems Biology.

Figure 1: The hypotheses to the dogmas’ explaining how Bioinformatics has evolved.

Thrust areas of Bioinformatics but not just limited to the subject alone
 Sequence analysis
 Genome annotation
 Computational evolutionary biology
 Gene and protein expression analysis

Introduction to Bioinformatics and Systems Biology Page# 4 of 53


◦ Mutations
 Comparative genomics
 Structural Biology
◦ Predicting structures
◦ Docking
◦ Biological modeling and high-throughput
 High-throughput computing
 Interactions and Functional Genomics
◦ Molecular to Atomic (structural)
◦ Visualization

Today Bioinformatics research is known in several areas whereas Agricultural Bioinformatics is steadfastly
increasing and developing. With an approximate 10 plant genomes and more than 100 plant/live-stock genomes
being sequenced/sequenced, there is a need for bioinformatics to be leveraged. Arabidopsis thaliana has been the
reference genome not just for plants but for all higher eukaryotes and mammalian genomes. Genome Informatics
in these areas has resulted in development of host of tools. However, there is a lacuna of research in Protein-
Protein Interaction (PPI) studies in Agriculture which so far has been limited to identifying the function of
proteins and genes through Quantitative Trait Loci (QTL) and Qualitative Trait Loci (QuTL)

The three genebanks: A consortium of genomic repertoire


The National Centre for Biotechnology Information (NCBI) at Bethesda, USA, The European Molecular
Biology Laboratory (EMBL) based in Heidelberg, Gemany and the Dna Data Bank of Japan (DDBJ) are the
three repositories and consortium genome databases that update the entries from time to time. The NCBI
“GenBank” is the trademark identity

Predicting genes in silico


Identification of genes in a long stretch of DNA sequence is a daunting task. The biggest challenge for
bioinformatics is to annotate the human genome. Many programs have evolved to predict protein coding regions
of the DNA sequence. They all have in common, to varying degrees, the ability to differentiate between gene
features like, Exons, Introns, Splicing sites, Regulatory sites etc. Gene prediction methods predicts gene coding
region in the query sequence and then annotate the sequences based on gene structure and location. The central
dogma machinery in prokaryotes and eukaryotes are different. In prokaryotes, only simple regulatory features
need to be considered whereas in eukaryotes, it is made complicated by the presence of intervening sequences

Introduction to Bioinformatics and Systems Biology Page# 5 of 53


known as introns. However, in eukaryotes, there are more regulatory features through which one can predict the
gene: Poly adenylation (Poly A) sites, Promoters, transcription factors, splice sites, alternative splicing and GC
islands etc which are used as landmarks to identify the presence of a gene. These regions has got its own
sequence features like the splice sites always starts with GT and ends with AG. Likewise Promoters can be
identified by the presence of certain signature bases, viz. TATA box, CAAT box etc. The open reading frames
(ORF) can be made known by the presence of start (ATG) and stop (TAG, TGA, TAA) codons. So gene
prediction programs are coded such that the programs are able to find out these features in the given query
sequence which therefore serve as landmarks for gene prediction methods. The main objective of gene prediction
is to identify the protein coding region in the given stretch of DNA.
Gene prediction methods
 Laboratory based approach
 Feature based approach
 Homology based approach
 Statistical and HMM based approach

Laboratory based methods: Experimental procedures for locating genes in new DNA are based on the
following:
1. Identification via hybridization to mRNA or cDNA (Northern blotting: It involves gel separation and
transferring of RNA into nitrocellulose membrane for detection of specific RNA by a labeled molecular
probe)
2. Exon trapping: It is a molecular biology technique to identify potential exons in a fragment of eukaryote
DNA of unknown intron-exon structure.

Feature based approach: Typical features include splice sites, promoter region (e.g. TATA box, CAAT box and
GC box), identification of ORFs start and stop codons etc. The best gene prediction programs tends to be species
specific, trained on examples of known genes in different organisms. Other typical features include codon bias
(Codon Bias is the tendency for an organism to use certain codons more than others to encode a particular
amino acid), donor/ receptor sites and coding frame length. The key to the analysis of unknown DNA sequence
is the identification of ORFs. Web based gene recognition system such as GRAIL, Gene ID and Gene Parser
work by searching for various features of genes and then identifying those regions which score high enough.

Homology based approach


Searching for a known homolog is the most widely understood means of identifying new protein coding genes.
Such searches depend on evolutionary relatedness and are widely applicable. A major advantage of finding

Introduction to Bioinformatics and Systems Biology Page# 6 of 53


homologous product is by some of the biology of the genes may already be elucidated at that time. It serves to
search for the following:
o Ancient Conserved Regions (ACR)
o Expressed Sequence Tags (ESTs) are short regions of mRNA which are reversely transcribed using
reverse transcriptase enzyme into DNA segments called cDNAs. These copies of DNA are cloned and
maintained as cDNA libraries in bacteria. Such cDNAs are sequenced and deposited in what is called as
EST database. The best known EST database is dbEST of NCBI
o Protein motifs
o Known proteins (based on sequence comparison methods, viz. BLAST and FASTA sequence)
Homology based gene prediction systems find similarities to previously identified coding regions. A different
homology based approach to identify totally unknown genes is to compare two whole genomes (one for which
the genes are predicted) and look for the conserved regions.

Statistical and HMM approach


All genes have in them certain grammatical structures in them. Using a statistical approach of probability, a
profile is created for each of prokaryotic and eukaryotic genes. These profiles are able to detect the gene features
in the query DNA sequence. Programs such as GCG (Genetics Computer Group) identify protein coding regions
using statistics of codon usage. The statistical basis for Codon usage of DNA is:
All possible combination = 4x. (x is the no: of bases in the pattern).
Probability of finding n-mer = 1/4x. (x-mer is the pattern found in gene.)
Eg. For statistical approach are HMM (Hidden Markov Models), and NN (Neural Networks)

Gene prediction Tools


 GENSCAN is widely known prediction program which is well regarded. Organism specific versions of
genscan are available for invertebrates (Drosophila) and plants (Maize and Arabidopsis) which help to
predict percentage of Isochore (A region of genomic DNA sequence in which G+C compositions are
relatively uniform). The only lacuna with GenScan is occasionally it results in lots of false positives
thereby decrease in prediction of accuracy particularly in some non vertebrates.
 GRAIL- Gene Recognition and Analysis Internet Link is the most widely used ORF identification
tool. It provides analysis of protein coding potential of a DNA sequence. It identifies each potential
exon candidates as an ORF bounded by a pair of acceptor/ donor sites. It provides analysis of protein
coding regions, poly A sites and promoters, predicts encoded protein sequences, and provides database
searching facilities. Further versions also exist for grail: GRAIL 1a, GRAIL II, GRAIL-EXP. GRAIL is
species specific and is used in human, Mus musculus, Arabidopsis thaliana and Drosophila

Introduction to Bioinformatics and Systems Biology Page# 7 of 53


melanogaster.
 GeneMark determine the protein-coding potential of a DNA sequence by using species specific
parameters of the HMM models of coding and non-coding regions.

Exercises and To-Dos


 Plant Genome Central: The NCBI's Plant Genome Central (PGC) is the ultimate resource for all crop
related information. Identify and analyze the genome of you interest on how Bioinformatics could
leverage and handle huge data resources.
 Explore and compare the three genebanks (HINT: Query a protein or gene of interest. Check the
identities of the queried gene/protein across the consortium databases).
 Identify at least three Plant/Agri-bioinformatics resources. Apply various tools as aforementioned and
compare them.

Select References:
 The Plant Genome Central: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html
 The EMBL: http://www.embl.de or http://www.ebi.ac.uk/embl/
 The DDBJ: http://www.ddbj.nig.ac.jp/

Introduction to Bioinformatics and Systems Biology Page# 8 of 53


Section 2: Databases in Bioinformatics
A database is a collection of entries maintaining useful information. Relational databases are linked-in databases
which are used to compare different entries embedded in the form of rows and columns. In the recent-past,
several bioinformatics databases have been created and used. The interdisciplinary nature of bioinformatics has
enabled the use of a variety of discipline-specific databases apart from the databases housing genomics and
proteomics. The databases, be it the public access or commercial databanks follow characteristic features that
could be shared among them. The three gene banks (discussed in the earlier section) entail not only nucleotide
and protein sequences, but also a gamut of several sequence repositories ranging from DNA to ESTs, RNAs etc.
However, there are many other category specific databases which are as follows:

Protein Sequence Databases


 SwissProt, maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European
Bioinformatics Institute (EBI) is a database of protein sequences that uses SRS (Sequence Retrieval
System) through ExPASy Server.
 Protein Information Resources (PIR) works closely with Munich Information Center for Protein
Sequences (MIPS) and Japanese International Protein Information Database (JIPID), International
Protein Sequence Database (PSD).

Protein Structural Databases


 Protein Data Bank (PDB) is a repository of protein structures that stores three-dimensional atomic
coordinates of proteins and nucleic acids wherein the data is obtained by experimental methods like
NMR, x-ray crystallography etc. In the recent-past several modeling studies have been deciphered which
accounted to further piling of the databases. However, the structural models using homology/ ab initio
are no longer accepted thence.

 Molecular Modeling Data Base (MMDB) is an NCBI’s Entrez database which emphasizes in adding
structure data to Entrez so that the information is easily accessible to biologists thereby facilitating
comparative analysis involving 3-D structure.

Introduction to Bioinformatics and Systems Biology Page# 9 of 53


Apart from the aforementioned databases, there are specialized databases that are meant for various purposes.
For example, organismal databases account to whole lot of information about genes and proteins containing
in the organism of interest. A few of the specialized databases are listed below:
 Gramene, a comparative genome mapping database for all grasses/cereals
 Rice Genome Research Project (RGP)
 Plant Satellite Repeat Database (PlantSat)
 Mouse Genome Informatics (MGI)
 The Institute for Genome Research (TIGR)
 PlantQTL-GE, a database system for identifying Quantitative Trait Loci (QTL) candidate genes in rice
and Arabidopsis. The database is further being expanded to a host of other databases.

All the databases embed sequences and are usually presented in standard formats which include the following
(shown along with examples):

 FASTA
>WheatSSR1
MAVTQTAQACDLVIFGAKGDLARRKLLPSLYQLEKAGQLNPDTRIIGVGRADWDKAAYTKVVREA
LETFMKETIDEGLWDTLSARLDFCNLDVNDTAAFSRLGAMLDQKNRITINYFAMPEECQVYRIDHY
LGPARVVMEKPLGTSLATSQKEFANDQVGEYFTVLNLLALRPSTFGAICKGLGEAKLNAKNSLFVN
NWDNRTIDHVEITV

 GDE
%5HIB_CAVPO008892|WheatSSR1
MAVTQTAQACDLVIFGAKGDLARRKLLPSLYQLEKAGQLNPDTRIIGVGRADWDKAAYTKVVREA
LETFMKETIDEGLWDTLSARLDFCNLDVNDTAAFSRLGAMLDQKNRITINYFAMPEECQVYRIDHY
LGPARVVMEKPLGTSLATSQKEFANDQVGEYFTVLNLLALRPSTFGAICKGLGEAKLNAKNSLFVN
NWDNRTIDHVEITV
 NBRF/PIR (National Biomedical Research Foundation/Protein Information Resource).
>P1; Wheat SSR1QTL integrated.
MAVTQTAQACDLVIFGAKGDLARRKLLPSLYQLEKAGQLNPDTRIIGVGRADWDKAAYTKVVREA
LETFMKETIDEGLWDTLSARLDFCNLDVNDTAAFSRLGAMLDQKNRITINYFAMPEECQVYRIDHY
LGPARVVMEKPLGTSLATSQKEFANDQVGEYFTVLNLLALRPSTFGAICKGLGEAKLNAKNSLFVN
NWDNRTIDHVEITV
Pointers

Introduction to Bioinformatics and Systems Biology Page# 10 of 53


 Two terms are quite important to support the databases: Annotation and Curation.
 Annotation is expansion of data entries based on the context and prototype whereas curation is the edited
entry in context to annotation. Perfect databases harbors’ curated entries and do not contain repetitive
entries (Read non-redundancy).
◦ Manual annotation is the one that is manually entered into the database whereas
◦ Automated annotation is based on the context-based or wiki-based information.
 A Biologist can use simple excel or access entries using structured query language (SQL) and make a
database. However, in the recent past, Hypertext Pre Processing (PHP) is being used to negate the huge
list of data entries.
 The NCBI LinkOut links the NCBI item to various external databases or repositories. Owing to huge
repositories, cloud computing architecture is being enabled where huge datasets can be shared in real
time.

Exercises
 Analyze various bioinformatics databases, understand the intricacies of it. Make a short list of
fundamental concepts that a perfect bioinformatics database should have.
 Make a first hand study of a Relational DataBase Management System (RDBMS): Use SQL/MS excel
to develop a small database of at least 10 rows and 5 columns. Query the contents using SQL.
 Annotate your database and learn how to NCBI Link Out the database you developed.

Select References
 The NCBI Link Out : http://www.ncbi.nlm.nih.gov/projects/linkout/
 http://bioinformatics.oxfordjournals.org/content/25/12/1475.full

Introduction to Bioinformatics and Systems Biology Page# 11 of 53


Section 3: Homology and similarity searches
Homology does infer similarity but similarity does NOT infer homology. Similarity in principle essentially
doesn’t indicate identicality whereas homology points identicality. In other words, all homologous sequences
are similar whereas all similar sequences are NOT homologous. For example, all proteins or genes falling under
a big cloud of DNA repair proteins are similar to each other whereas the proteins, viz. MSH1, MSH2 and MLH1
etc. are not homologous to each other which mean that the isoforms of the aforesaid proteins are homologous to
each other. Several tools are used to establish homology: Fast Alignment (FASTA) and Basic Local Alignment
Search Tool (BLAST) are the two well known inferential homology tools. The Figure 2 infers homology (Local
Alignment (LA) and Global Alignment (GA)) based on two loci

Figure 2a: The locus 2’s alignment with locus 1 is used as a standard to find coding regions and
alignment

Introduction to Bioinformatics and Systems Biology Page# 12 of 53


1. Global Alignment

2. Local alignment

Figure 2b: Global Alignment vs. Local Alignment

Through sequence alignment, one can align two sequences thereby scoring the similarities and differences at
each and every nucleotide or amino acid. This is known as pair wise alignment. One of the pairs could be a new
or unknown sequence whereas the other(s) would be a sequence whose structure and functions are known. An
example is shown below from the matrix:
Sequence1: EKIUHWTGFRGHC VNM LCIPEI UYTF
Sequence2: EKIUH STGFR GHC V- MLCIPEIUYTF

The summation of scores (for similarity, dissimilarity and gap penalties) gives the overall score for a particular
alignment.

Pointers:

Introduction to Bioinformatics and Systems Biology Page# 13 of 53


 Homology can be inferred for DNA or protein or occasionally RNA sequences. The sequences are
indicated in several formats, viz. raw sequences format, the NCBI format, EMBL format and most well
known FASTA format (which starts with “>” and followed by sequences containing residues or bases).
 The three primary methods of producing pair wise alignments are dot-matrix methods, dynamic
programming, and word methods however; multiple sequence alignment techniques can also align pairs
of sequences.
 Expectant value or e-value is the indicator on how many “homologous” sequences match the best match
of each other. E value less than 1 is considered the best and ideal indicator. It is unto the discretion of
the user to evaluate the results based on e value alone.
 Score (READ total score) is referred based on the number of residues or bases that match the alignment.
High Score and Total Score are used to decipher the score within and across the target database
sequences.
 Positives (positive scores indicated as “+” in the alignment) are those that are equilogous of those bases
or residues replacing the similar amino acids or bases.
Exercises

Figure 3

You may use the Figure 3 as a prototype to answer few questions from the below-mentioned exercises:
 Use BLAST and FASTA to find homologous sequences of interest.
 Can we compare a sequence against multiple sequences (target entries)? HINT: Use BLAT ~ Blast like

Introduction to Bioinformatics and Systems Biology Page# 14 of 53


Alignment Tool at Proweb.org
 What are the applications of homology tools in comparison with sequence alignment? Make a pilot
project with a set of your favorite genes or proteins.
 Does e value <1 always mean that all the hits you obtained from the alignment are good?
 Can we infer and identify Single Nucleotide Polymorphisms (SNPs) from alignment? What other tools
may be used?
 Also try to acclimatize yourself with various other BLAST options (HINT: Use your favorite genes
against the PGC).
 Use Global Alignment (GA) to compare genomes as a whole. What are the problems you pose when you
perform GA to LA?
 Higher the conservation, greater is the similarity of the sequences? Is it true and vice versa?
 After you explore and establish homology from your target sequences, why not try and compare them
using multiple sequence alignment (MSA)? Use the following
◦ ClustalW/T-Coffee/ClustalX/Clustal packages
◦ COBALT.
◦ Phylip
◦ Compare your results and observe how many of them are conserved, semi conserved and none (“*”,
“.” and “:”). Discuss with your peers and formulate a problem.
◦ From the Table 1 below, do you think all the tools that you have used so far would help you to
annotate proteins better?

Introduction to Bioinformatics and Systems Biology Page# 15 of 53


Table 1: Methods to annotate proteins

Substitution Matrices (SM):


 The rate at which the amino acid changes its position infers the SM. Substitutions are generally
evolutionarily significant especially eventing mutations.
 Some of the evolutionary models with respect to sequences and structural contexts
 Alpha
 Beta helices and pleated sheets
 Transmembrane structures
 PAM-Percent accepted or Point Accepted Mutations
 BLOSUM-BLOcks SUbstitution Matrix
 BLOSUM62
 BLOSUM90

The PAM vs. BLOSUM


PAM matrices: Percent Accepted Mutations or Point Accepted Mutations
 Explicit i.e. replacements are counted on the branches of a phylogenetic tree
 Based on mutations observed throughout a global alignment,
 All mutations are counted the same
 Higher #s , higher the evolutionary distance and higher the rate
 PAM150
BLOSUM matrices are the BLOCks SUbstituted Matrices
(The best is BLOSUM62)
 Implicit
 Based only on highly conserved regions in series of alignments
 Different counts for mutations
 Higher #s ~ lower the distance.

Introduction to Bioinformatics and Systems Biology Page# 16 of 53


Section 4: Ontologies to Biowikis
Ontology is a Meta term or Meta physics to describe an entity. There have been many studies on ontology to
relate terms thereby describing a function: Gene Ontology is one among the best referential ontologies and helps
the researchers to annotate the genes and proteins. Wikipedia is a well known wordbook. Recently, there has
been an enormous development of wikis in Biology termed as Biowikis. The terms and references have been
well added from time to time. Wikis with a biological subject matter are customized for analysis, presentation
and collection of specific biological data types. For example, wiki.bioinformatics.org

WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways. While they
present a new model for pathway databases that enhances and complements ongoing efforts, such as KEGG,
Reactome and Pathway Commons, it invited broader participation in the form of community annotation ranging
from students to senior experts in each field to add entries.

Figure 4: Moores' law and the why of Biowikis (Courtesy: Dan Bolser/Broad Institute)

Exercises
 How do Biowikis help the community annotation drive?
 Explore Protein data Bank Wiki (PDBWiki), Protepedia and host of other tools. Tabulate them and
make pros and cons of all the wikis.
 Use Internet Relay Chat to explore and discuss wikis: irc://irc.freenode.net/#bioinformatics
 Why not start your own Wiki project? (For example: http://wiki.bioclues.org)
Select References

Introduction to Bioinformatics and Systems Biology Page# 17 of 53


 Brohée S, Barriot R, Moreau Y. (2010) Biological Wikis: combining wikis with databases.
Bioinformatics. 26(17):2210
 The Sequence Wikis: http://www.seqwiki.org
Section 5: Designing primers in silico
Primers are a set of DNA sequences which initiate the clone of the desired sequences. Primer designing steps
start with Initiation-Annealing-Extension through which a couple of forward and reverse primers for the strand
synthesis are needed.
Traits for an ideal primer:
 Melting Temperature (Tm)
 GC content
 Length of the primer
 Specificity
 The intra-primer and inter-primer homology
Questions to ponder:
 What are Sequenced Tagged Sites (STS)? How are they helpful in making of PCRs and the primers?
 Use Primer3 to design primers in silico
 What is Primer-Blast?
 What is E-PCR? Discuss
 Are there any predefined primers enlisted in the form of databases?
 Design a primer for your favorite gene using paper-and-pen mode.
 How to check mispriming in a template?
 What are the challenges in designing RT PCR primers?

Select References:
 Steve Rozen and Helen J. Skaletsky (2000) Primer3 on the WWW for general users and for biologist
programmers. In: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in
Molecular Biology. Humana Press, Totowa, NJ, pp 365-386.
 Schuler,GD. (1997). Sequence mapping by electronic PCR.
 Rotmistrovsky K, Jang W, Schuler GD. (2004). A web server for performing electronic PCR.
 Thornton B, Basu C. Real-time PCR (qPCR) primer design using free online software. Biochem Mol
Biol Educ. 2011 Mar; 39(2):145-54. doi: 10.1002/bmb.20461.

Introduction to Bioinformatics and Systems Biology Page# 18 of 53


Section 6: Bioinformatics for evolution
Evolution is a process of acquiring a progeny from the parent. During the process, genes may be retained or
transferred from parent to the offspring. The transfer of such genes within the organism is termed as Vertical
Gene Transfer (VGT) whereas the transfer of genes from one organism to the other eventing the evolution is
called Horizontal Gene Transfer (HGT). Eventing the HGT may or may not involve
substitutions/deletions/insertions which are described through synonymous and non synonymous substitutions
(See Figure 5). Phylogeny is used to infer gene transfer or trace sequences that are “similar” or “non-similar”
sequences

Figure 5: Flow chart of


synonymous and non synonymous substitutions

Questions to ponder
 What is Maximum Likelihood? How Phylogenetic Analysis does help us to describe the ML for
sequences that event HGT?
 Use PAML/Codeml software to explore and find novel genes from set of your favorite genes.
 From Figure 6 below, identify and explore the overlapping domains and proteins. Use PAML and Clustal
analyses to correlate which sequences are similar? (Also use pen-and-paper mode)

Introduction to Bioinformatics and Systems Biology Page# 19 of 53


Figure 6
Section 7: Bioinformatics for Microarrays
DNA microarrays are minute dots of immobilized DNA on a probe surface, which can analyze the expression
levels of many genes simultaneously. Since the advent robotics dotting technology, array spotting enables us to
produce very high density array plate, in principle now it is possible to analyze the expression of all genes
simultaneously. Generally Microarrays are made by immobilized precisely measured quantities of EST
(unlabeled) to a glass/inert transparent material slide. While in use these plates/ slides are exposed with labeled
(color codes), single-stranded m-RNA or c-DNA mixtures from the cells/ tissue of interest. Thus the expression
experiment provides a surface for complementary DNA or RNA molecules hybridization thus attaching the
fluorescent molecules to particular spots on the array. So depending on the brightness and colors (red/ green) we
can analyze the array slide in spot acquisition device and which further is represented by a matrix of expression
value. These values are quantitative measurement of m-RNA species which hybridize to the array spot sample.
Statistical analysis then correlates similarity in expression level and provides idea about up-regulation and down-
regulation profiles.

Discussion:
 Bioinformatics for Microarrays
 Introduction and the use of Microarrays for expression analyses
 Gene Expression Data Analysis
 Serial Analysis of Gene Expression (SAGE)
 Image analysis: Statistic
 Normalization and clustering
 Variability and replication
 Gene expression analyses with R and Bioconductor

Questions to ponder

Introduction to Bioinformatics and Systems Biology Page# 20 of 53


 What is Microarray and what is the chip all about?
 What the spots tells us? How intensity of the spot is responsible for expression level?
 How to analyze the large scale array data?
 Why the normalization is necessary?
 What is difference between clustering and classification?
 When can someone go for; which clustering on dataset?
 How array data is used for annotation?
 What are the classification techniques in machine learning?
 Case study by using R-Bioconductor.

Section 8: IPR issues in Biotechnology: Implications and Applications

“IPR allows people to assert ownership rights on the outcomes of their creativity and innovative activity in the
same way that they own physical property. The four main types of intellectual property rights are: patents,
trademarks, design and copyrights.” The protection of IPR may take several forms depending mainly on the
type of intellectual property and the type of protection sought; each form of protection has its own advantages
and pitfalls.

Forms of IPR protection


(1) Trade secrets,
(2) Patents,
(3) Plant breeder's rights (PBR)
(4) Copyright.
(5) Trademarks
(6) Integrated Layout Circuit Designs
(7) Geographical Indications
(8) Designs for product shape and appearance
(9) Biological Diversity and traditional knowledge

Introduction to Bioinformatics and Systems Biology Page# 21 of 53


Figure 7. Ideas and IPR inter-relationships

Patents for Inventions:


These are given for both new and improved products or processes that
are capable of industrial application. These are the rights given by the national government to patent holder to
make the owner use his innovation and to exclude others from making, using or selling the invention. Patent
centers on the concepts of novelty and inventions. It relates to new products or processes of manufacturing a
product.

Trademarks for Product Differentiation:


It identifies the product's origin, its quality and its manufacturer. It prevents others from using the trademark
within the designated territory. The counterfeiting or misuse of trademark by other without the permission of the
registered trademark hold, constitutes infringement of rights and liable for prosecution. Trademark is based on
the concepts of distinctiveness and similarity of marks and similarity of goods. It consists of the word, name,
device or get-up used in relation to particular goods to indicate the source of manufacture or trade origin of the
goods.

Copyright for Creative Material:


These are the protection given to the creator of an original work, i.e. literary and artistic material, music, films,
sound recordings and broadcasts, including software and multimedia and computer programming. The owner of
the copyright has exclusive rights make multiple copies of his work. It prevents others from making copies of the
copyrighted material. Copyright is based on the concepts of originality and reproduction of the work in any
material form. It relates to original literary, dramatic, musical and artistic works, cinematography films and
sound recordings

Designs for Product Shape and Appearance:

Introduction to Bioinformatics and Systems Biology Page# 22 of 53


It covers protection for the whole or part of a product appearance to eyes, resulting from the features, lines,
contours, colors, shape texture and/or materials of the product itself and/or its ornamentation. This protection
does not cover the working or operations of the products. Design law is based on novelty or originality of design,
not previously published in India or any other country. It relates to the non-functional appearance of a product,
which appeals solely to the eyes.

Geographical Indications for Place of Origin:


This protection is for the goods manufactured or produced in a particular geographical area as the characteristics
of the goods are due to the climatic conditions of that particular region. Geographical Indication is a sign used on
goods, which have a specific geographical origin and possess qualities or a reputation that are available in the
goods due to the place of origin.

Plant Varieties and Farmers' Rights:


It is for the protection of intellectual property rights for plant ‘varieties by granting rights to breeders, farmers
and researchers. This Act grants rights to both breeders and farmers for new plant and farm varieties.

Miscellany:
 Some popular and justifiable legal case studies on the Turmeric, Neem, Basmati etc.
 Discussion on Traditional Knowledge Discovery Library.
 IP protection in Bioinformatics.

Questions to ponder
 What to patent and what not to patent?
 What makes Open Access more respectable compared to the IPR?
 Does IPR mean that you are void of Open Access?
 Can you patent a (synthetic) gene or protein that is being studied in the laboratory?
 What is semi open access?

Select References
 Practical approach to IPR, Rachna Singh Puri and Arvind Viswanathan, I.K. Int. Pub. House, New Delhi.
 IPR: A primer, Rao and Roa, Eastern Book Company.
 Intellectual property rights and the third world, R.A. Mashelkar, CSIR.
 ftp://ftp.cordis.europa.eu/pub/life/docs/ipr_bioinf.pdf

Introduction to Bioinformatics and Systems Biology Page# 23 of 53


Section 9: Structural Biology
The three-dimensional (3D) protein structures are of enormous interest for rational designing different types of
biological experiments. Examples include discovery mapping/structure-based predictions and site-directed
mutagenesis. However, the number of structurally characterized proteins in plants are very small and not more
than couple of thousands are known Predicting structures from raw sequences is an important step to ascertain
function. The homology models of proteins help the researchers when no experimental three dimensional
structures are available while building these models requires specialized programs apart taking help from up-to-
date sequence and structural databases. Integrating all required tools into a single web-based workspace
facilitates have just begun. For example, SWISS-MODEL is used for protein structure homology modeling in
building protein homology models at different levels of complexity. On the other hand sequence analysis can
provide valuable information about protein structure, function, and evolution, all these evolutionary events with
various aspects of selection processes can be further discussed. Prediction of these processes using servers such
as ConSeq, ConSurf, and Selecton has yielded results in the recent-past.

A. Secondary structure B. Tertiary structure


1. Chou Fasman method 1. Homology modeling / Comparative modeling
2. GOR method 2. Profiling
3. Neural Network 3. Threading / Fold recognition
4. Nearest neighbor

Figure 8: Evolutionary conservation on 3D structure of protein using ConSurf and the legend showing how
different methods are employed to predict secondary structures.

Energy minimisation is the key factor for predicting secondary (~ also for 3 0 structures)
1. Chou Fasman method
The Chou-Fasman method (Chou and Fasman 1978) is based on the frequency of each of 20 amino acids in
alpha helices, beta sheet and turns. Amino acids Ala, Glu, Leu and Met are strong predictors of α helices, but

Introduction to Bioinformatics and Systems Biology Page# 24 of 53


Proline and Glycine are predictors of a break in a helix. A table of predictive values (Pij) for each feature of
secondary structure is made for each of the alpha helices, beta strands and turns. To produce these values, the
frequency of amino acid in structures is divided by the frequency of all residues in structures. Depending on the
predicted value, the method assigns each value for 20 amino acids and the value gives the probability of the
amino acid to be present each class (helix, sheet or a turn). They are represented as: H – Helix, E – Sheet, C –
Turn. This method is 50 – 60 % accurate.

2. Garnier-Osguthorpe-Robson (GOR)
Garnier et al (1978) developed this sophisticated analysis method based on the assumption that amino acids
flanking the central amino acid residue influence the secondary structure wherein the central residue is likely to
adopt. Whereas the Chou-Fasman method is based on the assumption that each amino acid individually
influences the secondary structure within a given range of sequence, the method is known to be 50 – 60 %
accurate. In this method, there is a parameter called Sliding window. If you choose an amino acid X with a
sliding window of 8, then the method searches for the 8 amino acids in the carboxy and 8 amino acids in the
amino terminal. (So a total of 17 amino acids with the amino acid X as the central residue).

3. Neural Networks
A type of artificial intelligence that attempts to imitate the way a human brain works. Rather than using a digital
model, in which all computations manipulate zeros and ones, a neural network works by creating connections
between processing elements, the computer equivalent of neurons. The organizations of processing elements
determine the output. In the neural network approach, computer programs are trained to recognize amino acid
patterns that are located in known secondary structures and to distinguish these patterns from other patterns not
located in these structures. Accuracy is approximately 70 – 75%.

4. Nearest Neighbor method (NN)


This method is a Combination of GOR and neural network methods. A database of 100-400 trained sequences
with known protein structure is built. The frequency of the known secondary structure of the middle amino acid
in each fragment in database is used to predict secondary structure of the middle amino acid in the query
window. It uses the combination of machine learning approach and sliding window approach. This method is
known to be 75% accurate.
Tertiary structure prediction methods
Comparative / Homology modeling
Homology modeling exploits the fact that evolutionarily related proteins with similar sequences have similar
structures. Whereas homology modeling is based on the notion that new proteins evolve gradually from existing

Introduction to Bioinformatics and Systems Biology Page# 25 of 53


ones by amino acid substitution, addition, and/or deletion (mutation) through 3D structures and functions are
often strongly conserved during this process. For example, two sequences that have just 25-30% sequence
identity usually have the same overall fold. Many proteins share common function and structures and there are
usually strong sequence similarity among structurally similar proteins. There are three steps in homology
modeling:
(1) Select the “target sequence” of the protein with unknown 3D structure. Identify suitable structural templates
from the known protein structure databases.
(2) A 3D template is chosen by virtue of having the highest sequence identity with the target sequence. The 3D
structure of the template must be determined by reliable methods such as crystallography or NMR and is
typically an atomic coordinate “PDB” file from protein data bank
(3) An alignment between the target sequence and the template structure aligns the target sequence to the
structural template. It includes building the backbone from the alignment, including the region that is
significantly different from the template.

Proteins with sequence alignment of >25-30% identity typically have homologous structures. Model accuracy
depends on the level of similarity between the unknown protein and the known structure. If the newly modeled
protein obeys Ramachandran plot, then it is said to be an acceptable one.

Tool for homology modeling: SWISS-MODEL is a ‘biologist friendly’ program. When a sequence is submitted,
it first compares the sequence to the crystallographic database (ExPdb). If it finds any homology between query
sequence and database structures, it sends back the result of matching target proteins. Target structure is
superimposed to the sequence carbon backbone. The RMSD value must be low for good identity. (RMSD is the
square root of the distance between the alpha carbon atoms of both the structures). It is resubmitted to the Swiss
model database for modeling. First it builds the back bone and then the side chains. Then the newly modeled
protein is sent via mail. The evaluation of the newly modeled protein is done by drawing a Ramachandran plot.
If all the amino acids lie in allowed region then the structure is an acceptable one.

Fold recognition / Threading:


Threading is a method for the computational prediction of protein structure from protein sequence. The basic
idea is that the target sequence (the protein sequence for which the structure is being predicted) is threaded
through the backbone structures of a collection of template proteins (known as the fold library) and a “goodness
of fit” score calculated for each sequence-structure alignment. This goodness of fit is often derived in terms of an
empirical energy function, based on statistics derived from known protein structures.
Fold recognition methods can be broadly divided into two types:

Introduction to Bioinformatics and Systems Biology Page# 26 of 53


1.Methods that derive a 1-D profile for each structure in the fold library and align the target sequence to
these profiles.
2.Methods that consider the full 3-D structure of the protein template.
Fold recognition methods are widely used and effective because it is believed that there are a strictly limited
number of different protein folds in nature, mostly as a result of evolution but also due to constraints imposed by
the basic physics and chemistry of polypeptide chains.

Ab initio prediction
Ab initio prediction is carried out when there is no suitable homologue found in the database. Prediction is done
completely from the sequence. It is based on Anfinsen’s hypothesis that the native state of the protein represents
the global free energy minimum. Ab initio method tries to find these global minima of the protein. Finding the
correct native like protein conformation requires
 An efficient search method for exploring the conformational space to find the energy minima.
 An accurate potential function that calculates the free energy of a given structure
In order to reduce the complexity, local structure biases are used. But the strength and multiplicity of the local
structure prediction is highly sequence dependent. There are two types of scoring functions, viz. namely
knowledge based scoring function and force field based function. Currently there does not exist a reliable
scoring function or search method. However, some of the methods, viz. CASP4 and CASP5 were the segment
insertion Monte-Carlo method in Rosetta, threading and Monte Carlo method by Friesner, the lattice Monte
Carlo method by Jeff Skolnick and Andrew Kolinski where side chains were used for the lattice model etc.

Widely known software for structure prediction and visualization


o Secondary structure prediction: NNPredict and Predict protein
o Tertiary structure prediction: Swiss PDB viewer (Homology modeling) and Modeller, What if
o Visualizers: MAGE, Rasmol, Cn3D, ChemDraw & Chem3D and Jmol

Exercises
 Use the NCBI-Cn3d and Swiss-MODEL to explore predicting structures for your proteins
 Use the NCBI Blast and analyze your sequences using the structure (PDB) databases as the target.
Discuss the intricacies and problems with the instructor
 Use ConSeq, ConSurf, and Selecton
Introduction to cheminformatics
What is cheminformatics?
Cheminformatics, also known as chemical informatics was coined by F.K Brown in 1998(Brown F ,2005). It

Introduction to Bioinformatics and Systems Biology Page# 27 of 53


can be defined as in silico based study in the field of chemistry which has vast applications in the form of drug
discovery in pharmaceutical industries .

How is cheminformatics different?


It can solve four major problems such as
 store a molecule
 find exact molecule
 substructure search
 similarity search

Molecular Modelling (MM)


Molecular modelling can be defined as all theoretical and computational techniques used to model the behavior
of molecules. This can reduce the complexity of the system, allowing many atoms that can be considered during
simulations.

Applications of MM
Molecular modelling methods are used widely to investigate the structure, dynamics, surface properties and
thermodynamics of inorganic, biological and polymeric systems and biological activities such as protein folding,
enzyme catalysis, protein stability, conformational changes associated with biomolecular function, molecular
recognition of proteins, DNA, and membrane complexes (Leach A. R,2001).

Docking
Docking in molecular modelling is a method which predicts the preferred orientation of one molecule to a
second when bound to each other to form a stable complex.

Applications of Docking
A binding interaction between a small molecule ligand and an enzyme protein may result in activation or
inhibition of the enzyme. Docking is most commonly used in the field of drug design — most drugs are small
organic molecules.docking method used to indentify potential drugs molecules that are likely to bind to protein
target of interest and used in bioremediation – Protein ligand docking can also be used to predict pollutants that
can be degraded by enzymes(Suresh PS et al.,2008).

Introduction to Bioinformatics and Systems Biology Page# 28 of 53


Table : Various tools used for molecular design and modeling. Courtesy: Wiki

Select References
 Konstantin Arnold1, Lorenza Bordoli1, Ju¨ rgen Kopp1 and Torsten Schwede. The SWISS-MODEL
workspace: a web-based environment for protein structure homology modelling. Vol. 22 no. 2 2006,
pages 195–201.
 ConSeq: The Identification of Functionally and Structurally Important Residues in Protein Sequences,
2004 Berezin C., Glaser F., Rosenberg Y., Paz I., Pupko T., Fariselli P., Casadio R., and Ben-Tal
N. Bioinformatics. 20:1322-1324.
 Ashkenazy H., Erez E., Martz E., Pupko T. and Ben-Tal N. 2010
ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic
acids. Nucl. Acids Res.
 Doron-Faigenboim, A., Stern, A., Mayrose, I., Bacharach, E., and Pupko, T. 2005. Selecton: a server for
detecting evolutionary forces at a single amino-acid site. Bioinformatics. 21(9): 2101-2103.
 Brown F.,Editorial Opinion: Chemoinformatics – a ten year update, Current Opinion in Drug Discovery
& Development, 2005, 8 (3): 296–302.
 Leach A. R., Molecular Modelling: Principles and Applications, 2001.
 Suresh PS, Kumar A, Kumar R, Singh VP .,An in silico approach to bioremediation: laccase as a case
study. J. Mol. Graph. Model, 2008,26 (5): 845–9.

Introduction to Bioinformatics and Systems Biology Page# 29 of 53


Section 10: Using Omics data integration for Plant research
Please refer presentations for detailed notes.
The Plant Mitochondriomics
Mitochondria in plants, like in other eukaryotes, play an essential role in the cell as the major producers of ATP
via oxidative phosphorylation. On the other hand, mitochondria also play crucial roles in many other aspects of
plant development and performance. It possesses an array of unique properties allowing them to interact with the
specialized features of plant cellular metabolism. In the recent past, the plant mitochondriomics have caught
interest with several themes. Of them, how the interconnection between gene and protein function are regulated
each other and the how of integration of mitochondria with other components of plant cells have a major role to
be discussed.

Things to ponder
 Overview of the dynamics of mitochondrial structure, morphology and inheritance.
 Biogenesis of mitochondria
 Regulation of gene expression
 The mitochondrial genome and its interaction with the nucleus, and the targeting of proteins to the
organelle: Any specific targeting signals?
 What if the signal peptides are present in the proteins?
 What if the N-terminal mitochondrial targeting peptide is truncated? Can the protein still localize to
mitochondria?
 How mitochondria contribute to the mutations?
 Could we understanding the way the organelle interacts with the rest of the plant cell in silico? Any
visualizer meant for this?
 How’s the field of proteomics help disseminate discovery of new functions? How are the pathways of
electron transport bypass, metabolite transport, and specialized mitochondrial metabolism?

Evolution of Mitochondria and their Gene Rearrangements in Plants


With the advancements in sequencing technologies, deluge of biological sequence data is being generated.
Complete genome sequence information and comparative genomics allows us to study how gene locations
evolve. Adaptive evolution of genes and genomes is ultimately responsible for adaptation in morphology,
behavior, and physiology, and for species divergence and evolutionary innovations. Genes and genomes are the
product of complex processes of evolution, influenced by mutation, random drift, and natural selection. The
inference of genome rearrangement events such as duplication, inversion, and translocation, is crucial in multiple
genome comparisons. Gene rearrangements are considered to be rare evolutionary events. The existence of a

Introduction to Bioinformatics and Systems Biology Page# 30 of 53


shared derived gene order between taxa is often indicative of common ancestry. The success of Mitochondrial
DNA in molecular systematic has led to an interest towards characteristics such as maternal inheritance, rapid
rate of evolution, and haploid nature. Different parts of mitochondrial genome with different functional
constraints are expected to evolve at different rates. Thus, comparative mitochondrial genomics promises to offer
a comprehensive study of distinct patterns and processes of molecular evolution.

Figure 9. Origin of the mitochondrial


genome: The endosymbiosis theory
(Figure adapted from Molecular cell Biology text book, Courtesy: Google)
Exercises
 Consider an eukaryote of your interest.
o Try to find mitochondrial protein repertoire in that organism (HINT: Use University of
Montreal, Canada Mito database).
o Find some important proteins that interest you and check the bacterial proteins similar to
mitochondrial proteins ( HINT: That might have evented HGT through endosymbiotic theory ~
Mitochondrial proteome has an origin that can be traced back to the bacterial endosymbiont)
Select References:
 http://bioenergy.asu.edu/ (Constitutes all repositories of Plant mitochondria)
 Ian Moeller. PLANTMITOCHONDRIA AND OXIDATIVE STRESS: Electron Transport, NADPH
Turnover, and Metabolism of Reactive Oxygen Species. Annu. Rev. Plant Physiol. Plant Mol. Biol.
2001. 52:561–91

Introduction to Bioinformatics and Systems Biology Page# 31 of 53


Section 11: Protein-Protein Interactions (PPI)
Please refer presentations for detailed notes.
Protein interactions using Predictome and Interolog mapping
a. Predictome
Predictome is a database of predicted protein interactions that includes three computational methods--
chromosomal proximity, phylogenetic profiling and domain fusion besides considering large-scale experimental
screenings of protein-protein interaction data. The need for predictome has arisen because putative links against
predicting gene function across all organisms is not documented which, if available would maximize their
usefulness in linking orthologous sets of proteins. Besides providing functional relationships among proteins
using wet-lab referenced experiments like Y2H, CoIP etc., the database can be visualized through the web
through VisANT (Visual Analysis Tool) However, predictome has a disadvantage that it doesn't host interactions
for all sequences including many ongoing sequence projects viz., Macaca malatta ( Rhesus Monkey) .
b. Interolog mapping
Two set of proteins are considered interologs if the corresponding orthologs of the target organism also interact
same as the source organism proteins. Developed by Yu H et al. In 2002, through interolog mapping, the
interactions can be shown to be transferred, from one organism to the other. Yu H et al. used Best-Match
Mapping method (Matthews LR et al. 2001) besides the Reciprocal Best-Match Mapping which is considered a
more stringent method to map the interologs.
Things to ponder
 Whither systems biology? Why PPIs? What else does systems biology involve?
 What has made systems biology distinct from bioinformatics?
 Would PPIs bring out a function for umpteen orphan genes?
 What are the types of interaction data and their layouts?
 What are the online tools for analyzing networks?
 How good are the high-throughput methods employed to measure interactions?
 Webwatch: How to build a biological model?
o Ekat Kritikou : http://www.nature.com/nrm/journal/v8/n6/full/nrm2186.html
 How are interactions validated?

Exercises
 A major problem in managing numerous proteins is not the amount of data but the way we organize it
(~complexity). Do you agree? Answer with your comments, suggestions and how to tackle keeping view
of PPI.

Introduction to Bioinformatics and Systems Biology Page# 32 of 53


 If you have different cell cycle products, would you be able to accommodate them and build in your
network? If so how would your network look like?
 Your protein A is known to interact with another protein B. What series of steps from the following
would you infer to confirm its candidature?
o Data validation from integrated sources
o Protein-Protein interaction assay(pull down), in vivo
o Protein localization studies in silico
o Cleavage sites, if any
o A simple query from STRING, EMBL/Gene cards/iHOP
 Exercises covering all web interfaces, tools and Osprey as a visualization tool.

Introduction to Bioinformatics and Systems Biology Page# 33 of 53


Section 12: Functional genomics and proteomics
Subcellular localization determines the protein function
To elucidate the role of a protein in a cell, determining the sub cellular localization of proteins is an important
step as the proteins are organized according to their function (Dreger M, 2003). However, the exact location of
proteins in the cells has been backed by several difficulties. From preparing the pure organelles to understanding
the role of the proteins in the organelles, a vast amount of information, knowledge and efforts are needed
because confident localization of a protein requires contaminant free organelle types which is quite seen in
endomembrane systems. This problem is seen because the organelles share similar masses and hence these
proteins harbored in the membranes continuously cycle between the compartments. One solution to this problem
is the use of analytical rather than preparative centrifugation. Among the centrifugation/fractionation techniques,
analytical centrifugation is known to be well established method for assigning proteins to sub cellular
compartments that have eluded purification while in contrast; preparative centrifugation is based on the analysis
of single organelle-enriched fractions (de Duve C, 1971; Dunphy WG and Rothman JE. 1983). Recently,
Dunkley et al. (2004) described a proteomics method for determining the sub cellular localization of membrane
proteins wherein the organelles are partially separated using centrifugation using self-generating density
gradients. Further, proteins from each organelle are co-fractionated exhibiting equal distributions in the gradient.
The localization of novel proteins are then determined using multivariate data analysis techniques to match their
distributions to those of proteins that are known to reside in specific organelles. Dunkley et al. were able to
demonstrate the localization in both the ER and the Golgi apparatus in Arabidopsis thaliana. This method which
is abbreviated as LOPIT meaning Localization of Organelle Proteins by Isotope Tagging is a new tool for high-
throughput protein localization. Such high-throughput localization has extended possibility to apply and study
wide range of research areas including organelle function and protein trafficking.

Protein localization and functions have been known to be predicted in silico


The birth of Bioinformatics has entailed the creation and advancement of predictors besides umpteen databases,
algorithms, computational and statistical techniques in many areas of biology. Several predictors (discussed in
the subsequent section) especially on the sub cellular location have been developed whereas the prediction tools
are not always reliable making the prediction difficult. SignalP, a predictor based on cleavage sites (Nielsen H et
al. 1997) finds the signal peptides that allow the protein containing the residues to localise to the organelle.
Furthermore, say if a hypothetical protein is predicted to be localized to the mitochondria, it is likely that a
corresponding expressed protein would be localized to this organelle even though it may still be the product of a
pseudogene. Although several methods provide identification of proteins, in silico based functional analyses
using gene ontology, InterPro motifs, SMART, KEGG pathways, Biocarta pathways, Swissprot etc. allow

Introduction to Bioinformatics and Systems Biology Page# 34 of 53


functional annotation of genes and proteins variably thus making the functional annotation complex and
rigorous. The analysis is always to be carried out by means of series of integrated methods as back-to-back
cross-checking is always recommended through reciprocal/one-to-one blast hits. Furthermore, the functional
annotation of protein is attributed to experimental evidence or high throughput methods, the protein is linked to.
Although several in silico approaches like (Predictome as discussed in previous sections: Mellor JC et al. 2002)
will have solved the problems of annotation, there is a lack of information for hypothetical proteins targeted to
various organelles by these approaches. This information from various methods discussed in the subsequent
sections allows comparison of the proteins targeted to various organelles, further understanding pathway
information.

Most of the proteins transport proteins to various organelles in a cell. The eleven main organelles in eukaryotic
cells, viz. cytoplasm, nucleus, ER, ribosome, Golgi, mitochondria, chloroplast, centriole, vacuole, vesicles and
lysosomes are localization sites for proteins as they import and export yielding different mode of function.
Majority of the proteins though compartmentalised in cytosol, are localized across cytosolic-compartments, viz.
Mitochondria, Golgi, Endoplasmic Reticulum, Lysosomes, Golgi complex. It was felt that the proteins encoded
by the mitochondrial genome and those targeted to mitochondria would be interesting to facilitate researchers in
understanding the mitochondrial proteome better (Calvo S et al. 2006). The protein localization is facilitated by
specific targeting peptides. There are two types of targeting peptides, the internal targeting signals and
presequences. While presequences are often localized at the N-terminal end, the internal targeting signals can be
distributed throughout the protein. There are also precursor proteins that posses either an N-terminal presequence
or internal targeting signals or simply mitochondrial/matrix targeting sequences (MTS). These proteins are
specific to mitochondria, hence the name. The N-terminal sequences are enriched with hydrophobic residues -
Arg, Ser and Ala, recognised by different import receptors. The N-terminal presequences generally have a length
of 6-85 amino acid residues and rarely contain negatively charged amino acids. After import into mitochondria,
presequences get detached through proteolysis (Bolender N et al. 2008). The last decade has seen several tools
and predictors developed to find the proteins localized to mitochondria. Different tools have been known to
classify different methods, notable tools among them are TargetP –based on N terminal sequences (Emannuelson
O et al. 2000), Mitopred-based on Pfam domains (Guda C et al. 2004). Mitoprot -calculates the N-terminal
protein region that can support a mitochondrial/matrix targeting sequence (MTS) and the cleavage site (Claros
MG et al., 1996) and Predotar which is used to predict N-terminal sequence for mitochondrial, plastid and ER
targeting sequences (Small I et al. 2004). Another tool, the pTarget (Guda C, 2006), uses heuristics meaning the
method based on problem-solving plausible hypothesis that screens putative Pfam domains. The screening is
related to a specific cellular localization but not necessarily complete targeting signals (Guda C, 2006). The
occurrence patterns of protein functional domains and the amino acid compositional differences in proteins are

Introduction to Bioinformatics and Systems Biology Page# 35 of 53


checked. The TargetP on the other hand, is a less heuristic method. The mitochondrial sub cellular localization is
based solely on mitochondrial specific presequences. The presequences do not necessarily require cis or trans
acting domains in order to be fully functional mitochondrial target signals. All the above mentioned predictors
essentially predict if the proteins are mitochondrial. With the recent systematic identification of human
mitochondrial disease genes (Pagliarini DJ et al. 2008) there is a potential scope that some of them might contain
candidate genes for rare disorders and diseases like cardiomyopathy. One such database viz., Mitocarta includes
the experimental data obtained from highly purified mitochondria from human heart tissue, containing the
predictions performed by Mitopred (Guda C et al. 2004), a genome-scale method for the prediction of nuclear
encoded mitochondrial proteins. Mitochondrial protein sequences from different sources have been clustered to
generate a non- redundant dataset. Through this, annotations related to the protein function, structure, disease
association, pathways are collected from a number of publicly available databases.
S.No Predictor Method (Brief description) and URL Reference
1 BPROMPT Bayesian Prediction Of Membrane Protein Topology: A Taylor et al.
consensus server that predicts membrane proteins: for 2003
membrane protein prediction.
http://www.darrenflower.info/bprompt/
2 ChloroP Predict the presence of chloroplast transit peptides: Emanuelsson et al.
http://www.cbs.dtu.dk/services/ChloroP/ 1999
3 CoupleLoc Combines residue-couple model and SVM: Guo J et al. 2006
http://www.bioinfo.tsinghua.edu.cn/CoupleLoc/
4 HMMTOP Prediction of transmembrane helices and topology of Tusnady GE and
proteins: http://www.enzim.hu/hmmtop/ Simon I, 1998
5 Mitoprot Predictor specific to Mirochondrial sequences based on Claros MG and
N-terminalregions: http://ihg2.helmholtz- Vincens P, 1996
muenchen.de/ihg/mitoprot.html
6 pTARGET Based on the occurrence patterns of protein functional Guda C, 2006
domains and the amino acid compositional differences:
http://bioapps.rit.albany.edu/pTARGET/
#
7 pSLIP SVM based using multiple physicochemical properties: Sarda D et al. 2005
http://pslip.bii.a-star.edu.sg/
8 P2SL Implicit motif distribution based hybrid computational Atalay V and Cetin-
kernel: http://www.i-cancer.org/p2sl/ Atalay R, 2005
9 PSLpred A svm based method for prokaryotic proteins: Bhasin M et al.
2005

Introduction to Bioinformatics and Systems Biology Page# 36 of 53


S.No Predictor Method (Brief description) and URL Reference
http://www.imtech.res.in/raghava/pslpred/
10 PPROWLER Detecting residues in targeting peptides: Boden M and
http://pprowler.imb.uq.edu.au/references.jsp Hawkins J, 2005
11 PrediSi Prediction of signal peptides and their cleavage Hiller K et al. 2004
positions: http://www.predisi.de/
12 PA-Sub Proteome Analyst Specialized Subcellular Localication Lu Z et al., 2004
Server:
http://webdocs.cs.ualberta.ca/~bioinfo/PA/Sub/
13 Predotar A predictor used to identify N terminal targeting Small I et al. 2004
sequences:
http://urgi.versailles.inra.fr/predotar/predotar.html
14 SubLoc SVM prediction system based on amino acid Hua S and Sun Z,
composition alone: 2002
http://www.bioinfo.tsinghua.edu.cn/SubLoc/eu_pre
dict.htm
15 TargetP Prediction for eukaryotics based on N-terminal signal Emanuelsson et al.
sequences: http://www.cbs.dtu.dk/services/TargetP/ 2000
16 TMHMM Prediction of transmembrane helics in proteins: Krogh A et al. 2001
http://www.cbs.dtu.dk/services/TMHMM/

Table 2 Overview of some of the important and highly cited sub cellular localization prediction programs. Most
of the programs aforementioned work for eukaryotic organisms.

There are databases that house the proteins localized to organelles


The proteins whose sub cellular location is known by virtue of prediction are stored in databases. A few
databases are in use:
 eSLDB hosts data containing experimental annotations derived from primary protein databases,
homology based annotations and computational predictions (Pierleoni A et al. 2006).
 DBSubLoc contains proteins from primary protein database SWISS-PROT and the protein information
resource (PIR) ( Guo T et al. 2004)
 OrganelleDB is a compilation of protein localization data from eukaryotes especially the yeast The
catalog includes more than 50 organelles, sub cellular structures, and protein complexes across 138
organisms with emphasis on the major model systems (Wiwatwattana N and Kumar A, 2005).

Introduction to Bioinformatics and Systems Biology Page# 37 of 53


Advances in protein sub cellular location allowed researchers to develop predictors based on N-terminal regions,
Pfam or based on the characteristic matrix targeting signals. Efforts have been put on to develop a visualizer,
viz, organelle view -a tool intended for visualizing the sub cellular location of proteins (
http://organelleview.lsi.umich.edu/ ). However, protein sub cellular location alone doesn't help annotate a
protein; there are other bioinformatical approaches that could reveal the protein function. One of the other aims
of this thesis is to show whether or not hypothetical proteins can be reliably identified in silico using the
presence of sub cellular targeting signals. Does presence of characterized protein domains help understand the
protein function?

Introduction to Bioinformatics and Systems Biology Page# 38 of 53


Section 13: Molecular Markers: Potential Tools for Improvement
Please refer presentations for detailed notes.

The advent of molecular markers has revolutionized the scenario of plant biotechnology. New developments in
DNA marker technologies have made it possible to know the large number of genetic polymorphisms at the
DNA level. These can be used as markers for evaluation of the genetic basis for the observed phenotypic
variability. Molecular markers, viz. RFLP (restriction fragment length polymorphism), AFLP (amplified
fragment length polymorphism), RAPD (randomly amplified polymorphic DNA), ISSR (inter simple sequence
repeat), SSCP (single stranded conformation polymorphism), Mini- or microsatellites and SNPs (single
nucleotide polymorphisms) etc. have been intensively used in crop improvement. These DNA based markers can
be used to study sex identification, DNA fingerprinting, cultivar identification, genome variability, genetic
diversity and relatedness, gene mapping, phylogenetic relationships and marker assisted selection of desirable
genotypes:

 Amplified Fragment Length Polymorphism (AFLP)


 Restricted Fragment Length Polymorphism (RFLP)
 Random Amplification of Polymorphic DNA (RAPD).
 Single Nucleotide Polymorphism (SNP, pronounced snip)

Section 14: Applications of Support Vector Machines (SVM) in chemo and bioinformatics
Recent developments in genomic and post-genomic research have generated a large amount of biological data.
This data is growing exponentially with the advancement of research technologies. In order to handle such a
large amount of data, there is an increasing need for computational methods that can efficiently store, organize

Introduction to Bioinformatics and Systems Biology Page# 39 of 53


and interpret the data. Bioinformatics is one interdisciplinary science which uses information technology to solve
biological problems. The field of Bioinformatics mainly deals with:
 The development of databases that store, manage and provide easy access to vast amount of biological
data.
 The development of novel algorithms to solve several biological problems which involves protein
structure and function identification, locating a gene within a sequence and grouping protein sequences
into families.

A particular active area of research in bioinformatics is the application of machine learning tools to extract
important and useful information from a large pool of biological data. Machine learning algorithms are built in a
way such that they can easily recognize complex patterns and further make intelligent decisions based on the
data. For solving classification problems, machine learning techniques first obtain information from a set of
already classified samples (training set) and then use this information to classify unknown samples (test set).

Support Vector Machine (SVM)


Support Vector Machine (SVM) is one such machine learning technique which is used to carry out classification
and regression of data. SVM is rigorously based on statistical learning theory. For binary classification problems,
SVM employs a maximum margin hyper plane for separating examples belonging to two different classes. For
problems which cannot be separated employing linear hyper planes, SVM first transforms the data into a higher
dimensional feature space and subsequently employs a maximum margin linear hyper plane. To take care of the
intractability problems associated with the introduction of higher dimensions, appropriate kernel functions are
used which enable all calculations in the input space. As SVM can be formulated in terms of a convex quadratic
optimization problem which guarantees unique global solution, there is an explosion of usage of SVM in
different fields of science and Engineering.

SUPPORT VECTOR MACHINES FOR CLASSIFICATION


Support Vector Machine (SVM) classifiers are a set of universal feed-forward network based classification
algorithms that have been formulated from statistical learning theory and structural risk minimization principle
developed by Vapnik (1995). This principle is based on the fact that the error rate of a learning machine on test
data (i.e., the generalization error rate) is bounded by the sum of the training error rate and a term that depend on
Vapnik-Chervonenkis (VC) dimension; in the case of separable patterns, a support vector machine produces a
value of zero for the first term and minimizes the second term. It facilitates quantitative means of discriminating
between the capacities of different classifiers. The theoretical framework provides a rational link between the
empirical performance of a learning algorithm when trained from a finite data sample, and the ‘true’

Introduction to Bioinformatics and Systems Biology Page# 40 of 53


performance when used in practice. For non-linear and non-separable problems, support vector methodology
provides a decision space with a minimal VC dimension and training error so that the classifier has a low
probability of generalization errors.

IMPORTANT PROBLEMS OF SVM IN BIOINFORMATICS


This section explains few of the problems in Bioinformatics. Many important machine learning algorithms
including SVM have already been applied in solving these problems. It’s not possible to cover each and every
problem in Bioinformatics in a single tutorial. We rather stick to the relevance of the problems in the current
context.

1. Protein Localization
One of the main tasks of proteomics is the assignment of functionalities to sequenced proteins. The assignment
of a function for a given protein has proved to be especially difficult where no clear homology to proteins of
known function exists. One field of proteomics that has recently received a lot of attention is protein localization.
Protein expression analysis can indicate whether proteins are expressed, but it is also important to know where
proteins are expressed, and where they go over time. Knowing the sub-cellular location that a protein resides in
may give important insights as to its possible function. Even when the basic function of a protein is known,
knowing its location in the cell may give insights as to which pathway an enzyme is part of. There is an
increasing shift away from general protein expression analysis and toward mapping proteins distribution, relative
abundance, tissue specificity, and movement. By tracking these parameters (in healthy versus diseased tissue and
in control versus treated tissue), researchers can gain a greater understanding of these proteins functions and
determine which are likely to be the best drug targets.

1.1 Prediction of protein localization


Sub-cellular localization is a key functional characteristic of proteins. A fully automatic and reliable prediction
system for protein sub-cellular localization would be very useful.
Two types of prediction methods have been developed, viz., based on the recognition of protein N-terminal
sorting signals and based on amino-acid composition. In both cases protein sub-cellular localization is seen as a
multi-class classification problem.

Prediction of localization by Signal


All new proteins in the cell have a tag (signal peptide) on them, telling whether the protein is to be sent out of
the cell or to a special part in the cell. By comparing tags from known proteins, we can find out where an
unknown protein will be located. Supervised learning methods (neural networks, support vector machines) have

Introduction to Bioinformatics and Systems Biology Page# 41 of 53


been used for eukaryotic species to discriminate between proteins destined for the mitochondrion (mTP), the
chloroplast (cTP), the secretory pathway (SP), and other localization on the basis of N-terminal sorting sequence
information. The reliability of these predictive methods is strongly dependent on the quality of the protein N-
terminal assignment. The methods are inaccurate when the signals are missing or only partially included.

Prediction of localization by Composition


It was shown that intra-cellular and extracellular proteins differ significantly in their amino acid composition. As
a consequence, alternative prediction approaches, like neural networks and support vector machines, focus on
the study of the correlation of amino acid composition with different sub-cellular localizations. Twenty input
features, one for the fraction of each amino acid are generally used. A method based on the amino acid
composition is expected to be comparatively stable to wrong sequence assignment. Note that the output to be
predicted (i.e. the localization) remains the same but the set of input features is changed.

2. Recognition of Translation Initiation Sites


Translation of mRNA to protein does not begin with the first nucleotide triplet of an mRNA molecule, but rather
begins somewhere downstream. Translation is usually initiated by the start codon (AUG) which encodes the
amino acid methionine. But translation in eukaryotes does not always start at the first start codon, implying that
context information also plays a role. Also, the start codon is often not detected because of error in mRNA
annotation. This makes prediction of translation initiation sites a non-trivial task. The task of finding translation
initiation sites has been modeled as a classification problem. Supervised classification techniques including
support vector machines were used to predict transcriptional initiation sites (TIS) from a fixed-length sequence
window around a potential start codon. The input consists of a binary encoding of the sequence: no higher level
features are supplied. Each nucleotide (A, C, G, T and N for unknown) is encoded by five bits, exactly one of
which is set.

3. The Protein Folding Prediction


The protein folding problem is how to predict proteins three dimensional structure from its one-dimensional
amino-acid sequence. The number of protein sequences is growing much faster than our ability to solve their
structures experimentally. This is the so-called sequence-structure gap. This is often considered as one of the
most significant problem in structural molecular biology. On a practical level, solving the protein folding
problem is a key to the rapid progress in the fields of protein engineering and drug design. Proteins and peptides
are biopolymers composed of amino acid residues interlinked by amide bonds. Their structure can be discussed
in terms of four levels of complexity defined as follows:
Primary structure: the sequence of amino acids.

Introduction to Bioinformatics and Systems Biology Page# 42 of 53


Secondary structure: local folding maintained by short distance interactions.
Tertiary structure: additional folding maintained by more distant interactions
Quaternary structure: structure maintained by interchain interactions.

3.1 The Secondary Structure Prediction Problem


A protein secondary structure is the first folding level of protein conformation. It is an essential intermediate step
on the way to predicting the full three-dimensional structure of a protein. It comes from the formation of a
regular local arrangement within a single protein sequence. It consists in alpha helices (short, spiral-shaped
section), beta sheets (pleated section) or other types of coils (spirals). If the secondary structure of a protein is
known, it is possible to derive a comparatively small number of possible tertiary (Three-dimensional) structures.
The secondary structure prediction problem is generally framed as a classification problem where a fix sized
segment of protein chain centered on the residue to be predicted is given to a classifier and the output is one of
the three target classes: helix, strand or coil. Both the residues and the target classes are encoded in unary format.

3.2 Protein Fold Class Prediction and Fold Recognition


To predict the three-dimensional structure of a protein from its sequence alone has been a elusive problem for a
long time now. The use of ab initio methods for the prediction of native structures of proteins has several
limitations. Therefore, computational biologists have been trying to find a reverse solution to the protein folding
problem. The problem can now be formulated as a multi-class classification problem to recognize structural
classes in a database of known folds like SCOP or CATH. This greatly reduces the search space for the native
fold of the protein structure.

4. Predicting Protein-Protein Interactions


A goal of proteomics is to elucidate the structure, interactions and functions of all proteins within cells and
organisms. The interaction between proteins is fundamental to a broad spectrum of biological functions (e.g.
regulation of metabolic pathways, immunologic recognition, DNA replication, protein synthesis).
Whether or not two proteins will bind to form a stable complex that is prerequisite to biological function is
dependent on the three-dimensional conformations of the proteins. At the same time the sequence specifies the
conformation. Computational techniques could represent an alternative to conventional proteomics methods
known to be tedious, labor intensive and potentially inaccurate.
In many cases, the knowledge of the amino acid sequence alone might be sufficient to estimate the propensity
for two proteins to interact. Given a database of known protein-protein interaction pairs, a machine learning
system is trained to recognize interactions on the basis of the primary structure and associated physicochemical
properties. Protein interaction data can be obtained from the Database of Interacting Proteins. For each amino-

Introduction to Bioinformatics and Systems Biology Page# 43 of 53


acid sequence, feature vectors were assembled from encoded representations of tabulated residue properties (e.g.
charge, hydrophobicity, surface tension). Input patterns are obtained by concatenating the vectors of features of
the interacting proteins. Negative examples can be obtained by randomizing amino acid sequences.

5. Gene recognition
A major problem in molecular biology is to identify genes in uncharacterized DNA sequences. There are two
broad classes of computational approaches to finding genes in nucleotide sequences.
Search by signal: it locates genes by finding particular signals that are associated with gene expression. A signal
is a localized region of DNA that performs a specific function, such as binding an enzyme.
Search by content: it recognizes genes by identifying segments of DNA sequences that possess the general
properties of coding regions.

5.1 Search by signal


The search is represented as a classification task which takes a fixed-length window on a DNA sequence and
determines if the signal of interest occupies a particular position in the window. Once the classifier is trained, it
can be used to locate the signals of interest by scanning its window along the length of the sequence.
Applications are the detection of transcription initiation sites and the detection of splice junctions.

5.1.1 Splice junctions localization


Introns are sequences of DNA in eukaryotic organisms that are spliced out of mRNA before it is translated. They
range in length from less than 100 to more than 1000 nucleotides. Since eukaryotic genes may contain introns,
the problem of determining the coding region in eukaryotic DNA involves more than simply finding initiation
sites. Splice junctions are the boundary points where splicing occurs. Identifying splice junctions is important
because, in order to determine the proteins produced by a gene, it is necessary to precisely demarcate the
segments of the DNA sequence that are eventually translated.

5.2 Search by content


Unlike search-by-signal approaches, which look for specific functional sites in DNA, search-by-content methods
identify genes by recognizing general patterns that occur in their nucleotide sequences. The objective is to
identify the region of DNA sequences that are translated to protein.

6. Gene classification
Genome researchers are shifting their focus from structural genomics to functional genomics. Structural
genomics is the initial phase of genome analysis, whose goal is to construct high resolution genetic and physical

Introduction to Bioinformatics and Systems Biology Page# 44 of 53


maps as well as complete sequence information of the chromosomes. Functional genomics is the second phase,
aiming at studying the functionality of genes of a single organism as well as studying and correlating the
functionality of genes across many different organisms. The traditional approach to functional genomics consists
in using sequence data to determine the function of genes and/or the corresponding proteins. The idea is that
genes with sufficient similar sequences also perform similar functions. However, sometimes sequence
comparisons can be uninformative and misleading as well as there a lot of species for which we do not have
complete sequence information. Recently methods have been developed for monitoring genome-wide mRNA
expressions: oligonucleotide chips, SAGE (serial analysis of gene expression) and microarrays. These tools
allow observing expression levels of the entire genome under many different induced conditions. Knowing when
and under what conditions a gene or a set of genes is expressed often provides strong clues as to their biological
role and function. One way of using the data produced by microarray experiments to determine the function of
unknown genes is to use clustering algorithms to group together genes that have similar expression profiles.
Based on the distribution of known and unknown genes in such clusters, some information about the function of
previously unknown genes can be inferred. An alternative is provided by supervised learning methods. The key
advantage of supervised over unsupervised methods is that the predictive precision of these methods can be
quantified. Many authors have used several classification algorithms to predict if a gene has a particular function
based on expression profiles.

7. Microarray Data Analysis


A microarray experiment involves the measurement of number of messenger RNA (mRNA) in a given sample of
cells. The technique involves affixing known DNA strands (probes) to a substrate (a glass slide or a silicon chip).
A fluorescently labeled sample of mRNA is then washed over the substrate, and mRNA that are complementary
to the probes bind there. The dye is then fluoresced under a microscope, and the intensity at each spot is
measured. Since each spot on the substrate corresponds to a known gene, each spot intensity indicates how many
copies of mRNA exist in the sample. The overall signal for a given gene is computed by combining the
measurements from the corresponding spots. The result is a collection of on the order of 10,000 measurements of
gene activity per experiment. However, the data itself is quite noisy. Consequently, many research groups have
resorted to the use of clustering and pattern recognition techniques to interpret their microarray data.

7.1 Classification of Genes


The first application of SVMs in the analysis of microarray data involves the classification of genes into distinct
classes. Data from many separate microarray experiments are collected into a single matrix, indexed by gene
(row) and experiment (column). Classification can be performed along either dimension of this matrix: gene
functional classification along the row dimension or diagnostic or prognostic patient classification along the

Introduction to Bioinformatics and Systems Biology Page# 45 of 53


column dimension.

7.2 Classification of Tissues


A more popular application of SVMs to the analysis of microarray data involves transposing the matrix of
expression values. Rather than classifying each gene according to its profile across multiple experiments, the
SVM learns to classify experiments. In this type of study, one experiment typically corresponds to one patient,
and the classification label corresponds to a diagnosis. As such, the dimensionality of the problem is unusual:
typically, a dataset contains tens of experiments and thousands of genes.

8. Detection of Sequence Homology


As the number of protein sequences in biochemical databases keeps increasing much faster than our ability to
characterize their functions through experimentation, the need for accurate protein annotation from an amino
acid sequence only is a central problem in computational biology. A core tool in the annotation process is the
detection of sequence similarities, because homology often implies functional similarity. While satisfactory
methods exist to detect homologs with a high level of similarity, remote homologs are often difficult to separate
from pairs of proteins that share similarities due to chance. Detecting homologs in the so-called ‘twilight zone’
remains challenging nowadays.

Specialized kernels that account for sequence similarity have been developed for the purpose of classifying
sequences based on homology, for e.g. string kernels, mismatch string kernels, bag of words (BOW) kernels etc.

Use of SVM in Agri Bioinformatics:


Agriculture is the cultivation of animals, plants, fungi and other life forms for food, fiber, and other products
used to sustain life. Plants play a key role in maintaining life on earth. The oxygen we breathe comes from
plants. They form the basis of food for all the living forms on earth, provide nutrition and provide many useful
drugs. Agri Bioinformatics uses computational tools to analyse plant genomes and thus contribute to the
improvement in their productivity. There is a growing number of applications of machine learning techniques in
agriculture and a growing amount of data that are currently available from many resources. Following are some
applications of SVM in the field of agriculture:

Detection of weed and nitrogen stress in corn. (Karimi, 2005)


Crop growth is affected by different stresses (e.g., water, pest, and weed) which can decrease the crop
production. The study used SVM, as a tool to classify hyperspectral images taken over a corn ( Zea mays L.)
field. A hyperspectral imaging system obtains information in more than 100 very narrow, defined continuous

Introduction to Bioinformatics and Systems Biology Page# 46 of 53


spectral bands. In this system, reflected radiation from any specified target is obtained continuously, which gives
detailed information on the materials at the target. These narrow wavebands make hyper spectral remote sensing
systems powerful tools that have the potential to avoid time consuming and labor intensive ground data
collection methods. Nitrogen application rates and weed management practices were used for carrying out the
classification. The field experiment consisted of three nitrogen application rates and four weed management
strategies. In order to obtain a hyper spectral image, a 72-waveband Compact Airborne Spectrographic Imager
was used at an initial phase of growth during the year 2000 growing season. Nitrogen application rates
considered were 60, 120, and 250 kg N/ha. Weed controls taken into account were: none, control of grasses,
control of broadleaf weeds, and full weed control. Classification accuracy was evaluated for three cases: nitrogen
application rates alone, weed controls alone, and combinations of nitrogen application rates and weed infestation
levels. The classification accuracies obtained using SVM were compared with those obtained by an artificial
neural network (ANN) model on the same data. It was found that the SVM method resulted in very low
misclassification rates, as compared to the ANN approach for all the three cases. Use of SVM technique for
detecting stresses in initial phase of crop growth could help in early application of site-specific remedies to
timely in-season interventions effectively.

Separation of mixed plant-pathogen EST collections based on codon usage (Friedel et al., 2005)
The efficient characterization of the plant-pathogen interaction plays a key role in plant disease control. The
construction of mixed libraries that contain sequences from both genomes help in the discovery of host and
pathogen genes expressed at the plant-pathogen interface. Sequence identification requires high-throughput and
reliable classification of genome origin. A dataset of 3974 unigene sequences of various lengths from barley
(H.vulagare) and blumeria (B.graminis) were used as training sequences. The short length and the lack of
relevant data of single-pass cDNA sequences in public databases often cause difficulties. To overcome these
difficulties, a novel method was introduced that takes into account subtle differences in codon usage between
plant and fungal genes. For this, SVM was used to identify the probable origin of sequences. A support vector
model is calculated to distinguish between correct and wrong frames. SVMs were compared to several other
machine learning techniques and to a probabilistic algorithm (PF-IND) for Expressed Sequence Tag (EST)
classification also based on codon bias differences. The proposed Eclat software which consists of a web-
frontend and several Java packages and is used to calculate the support vector models achieved a classification
accuracy of 93.1% on a test set of 3217 EST sequences from Hordeum vulgare and Blumeria graminis. It was
found that the Eclat software can be used to efficiently classify EST sequences containing at least 50nt of coding
sequence. Eclat allows training of classifiers for any host-pathogen combination for which there are sufficient
classified training sequences. The methodology has also been tested on the EST sequences obtained from cotton
(Gossypium arboretum) and cotton root knot nematode (Meloidogyne incognita). The prediction accuracy of a

Introduction to Bioinformatics and Systems Biology Page# 47 of 53


model trained on this dataset has a 10-fold cross validation accuracy of ~87.3%. This clearly shows that the
methodology can be successfully applied to other systems, such as plant/nematode.

Software for Support Vector Machines


 SVMlight
SVMlight, by Joachims, is one of the most widely used SVM classification and regression packages. It has a fast
optimization algorithm, can be applied to very large datasets, and has a very efficient implementation of the
leave–one–out cross-validation. It is distributed as Cþþ source and binaries for Linux, Windows, Cygwin, and
Solaris. Kernels available include polynomial, radial basis function, and neural (tanh).
Availability: http://svmlight.joachims.org/

 SVMstruct
SVMstruct, by Joachims, is an SVM implementation that can model complex (multivariate) output data y, such
as trees, sequences, or sets. These complex output SVM models can be applied to natural language parsing,
sequence alignment in protein homology detection, and Markov models for part-of-speech tagging. Several
implementations exist: SVMmulticlass, for multiclass classification; SVMcfg, which learns a weighted context
free grammar from examples; SVMalign, which learns to align protein sequences from training alignments; and
SVMhmm, which learns a Markov model from examples. These modules have straightforward applications in
bioinformatics, but one can imagine significant implementations for cheminformatics, especially when the
chemical structure is represented as trees or sequences.
Availability: http://svmlight.joachims.org/svm_struct.html

 mySVM
mySVM, by Ru¨ ping, is a Cþþ implementation of SVM classification and regression. It is available as Cþþ
source code and Windows binaries. Kernels available include linear, polynomial, radial basis function, neural
(tanh), and anova. All SVM models presented in this chapter were computed with mySVM.
Availability: http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/index.html

 mySVM/db
mySVM/db is an efficient extension of mySVM, which is designed to run directly inside a relational database
using an internal JAVA engine. It was tested with an Oracle database, but with small modifications, it should also
run on any database offering a JDBC interface. It is especially useful for large datasets available as relational
databases.

Introduction to Bioinformatics and Systems Biology Page# 48 of 53


Availability http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVMDB/index.html

 LIBSVM
LIBSVM (Library for Support Vector Machines) was developed by Chang and Lin and contains C-classification,
n-classification, e-regression, and n-regression. Developed in Cþþ and Java, it also supports multiclass
classification, weighted SVMs for unbalanced data, cross-validation, and automatic model selection. It has
interfaces for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels available include linear,
polynomial, radial basis function, and neural (tanh).
Availability: http://www.csie.ntu.edu.tw/~cjlin/libsvm/

 SVMTorch
SVMTorch, by Collobert and Bengio,185 is part of the Torch machine learning library (http://www.torch.ch/)
and implements SVM classification and regression. It is distributed as Cþþ source code or binaries for Linux and
Solaris.
Availability: http://bengio.abracadoudou.com/SVMTorch.html

 Weka
Weka is a collection of machine learning algorithms for datamining tasks. The algorithms can either 388
Applications of Support Vector Machines in Chemistry be applied directly to a dataset or called from a Java
code. It contains an SVM implementation.
Availability: http://www.cs.waikato.ac.nz/ml/weka/

 BioWeka
BioWeka is an extension library to the data mining framework Weka for knowledge discovery and data analysis
tasks in biology, biochemistry and bioinformatics. Includes integration of the Weka LibSVM project.
Availability: http://sourceforge.net/projects/bioweka/

 Gist
Gist is a C implementation of support vector machine classification and kernel principal components analysis.
The SVM part of Gist is available as an interactive Web server at http://svm.sdsc.edu. It is a very convenient
server for users who want to experiment with small datasets (hundreds of patterns). Kernels available include
linear, polynomial, and radial.
Availability: http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi

Introduction to Bioinformatics and Systems Biology Page# 49 of 53


 MATLAB SVM Toolbox
This SVM toolbox, by Gunn, implements SVM classification and regression with various kernels, including
linear, polynomial, Gaussian radial basis function, exponential radial basis function, neural (tanh), Fourier series,
spline, and B spline. All figures from this chapter presenting SVM models for various datasets were prepared
with a slightly modified version of this MATLAB toolbox.
Availability: http://www.isis.ecs.soton.ac.uk/resources/svminfo/

Exercises
1. What are the problems in chemo and Bioinformatics which require machine learning tools? Give
illustrative examples
2. What is the principle behind Linear SVM?
3. What is the principle behind nonlinear SVM?
4. What are Kernel functions? Provide examples IMPORTANT Kernel Functions
5. Give examples of Multi-class classification problems in Chemo & Bioinformatics
6. Explain how domain information can be employed to choose sequence & structural Features
7. What is the principle of Ant Colony Optimization?
8. What is feature selection & how it is relevant in Bioinformatics
9. Give importance of SVM in Agri-Bioinformatics
10. Give examples in your own domain where SVM will be very useful

Select References
 Baneyx, F. (1999.) Recombinant protein expression in Escherichia coli. Curr. Opin. Biotechnol. 10, pp.
411–421.
 Bertone,P. et al. (2001) SPINE: an integrated tracking database and data mining approach for identifying
feasible targets in high-throughput structural proteomics.Nucleic Acids Res., 29, pp. 2884–2898.
 Burden, F.R.; Ford, M.G.; Whitley, D.C.; Winkler, D.A. (2000), J. Chem. Inf. Comput. Sci., 40, pp.
1423-1430.
 Davis, G.D.,Elisee, C Newham, D.M. and Harrison,R.G.(1999). New Fusion Protein Systems Designed
to Give Soluble Expression in Escherichia coli. Biotechnol Bioeng 65, pp. 382-388
 Duan, K.; Keerthi, S.; Poo, A.N. (2002). Evaluation of simple performance measures for tuning SVM
hyperparameters, Neurocomputing, 51, pp. 41-59.
 Goh,C.S. et al. (2004) Mining the structural genomics pipeline: identification of protein properties that
affect high-throughput experimental analysis. J. Mol. Biol., 336, pp. 115–130
 Golub TR, Slonim DK, Tamayo P, Gaasenbeek CHM, Mesirov JP, Coller H, Loh ML, Downing JR,
Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and
class prediction by gene expression monitoring. Science, 286, pp. 531–537
 Gunn, S. (1997). Support Vector Machines for Classification and Regression. ISIS Technical Report
 Harrison, P.W.; Barlin, G.B.; Davies, L.P.; Ireland, S.J.; Matyus, P.; Wong, M.G. (1996) Syntheses,

Introduction to Bioinformatics and Systems Biology Page# 50 of 53


pharmacological evaluation and molecular modelling of substituted 6-alkoxyimidazo[1,2-b]pyridazines
as new ligands for the benzodiazepine receptor, Eur. J. Med. Chem. , 31, pp. 651-662.
 Iakoucheva L. M., Brown C. J., Lawson J. D., Obradovic´ Z., Dunker. A. K. (2002) Intrinsic disorder
and protein function, J Mol Biol 323, pp. 573-584
 Idicula-Thomas S, Kulkarni A.J., Kulkarni B.D. , Jayaraman V.K., and Balaji P.V .(2006) A support
vector machine-based method for predicting the propensity of a protein to be soluble or to form
inclusion body on overexpression in Escherichia coli. Bioinformatics 22, pp. 278-284.
 Idicula-Thomas,S. and Balaji,P.V. (2005) Understanding the relationship between the primary structure
of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci., 14, pp.
582–592.
 Keerthi, S.S. (2002). Efficient tuning of SVM hyper parameters using radius/margin bound and iterative
algorithms. IEEE Transactions on Neural Networks 13, pp.1225-1229.
 Kulkarni Abhijit, Jayaraman V.K., Kulkarni B.D. (2004). Support vector classification with parameter
tuning assisted by agent based technique, Computers and Chemical Engineering, Vol. 28/3, pp. 311-318
 Kumar R., Kulkarni A.K., Jayaraman V.K., Kulkarni B.D. (2004). Structure-Activity Relationships using
Locally Linear Embedding Assisted by Support Vector and Lazy Learning Regressors, Internet
Electronic Journal of Molecular Design, Vol. 3/3, pp. 118-123
 Luan,C.H. et al. (2004) High-throughput expression of C. elegans proteins. Genome Res., 14, pp. 2102–
2110.
 L’Heureux , P.J.; Carreau, J.; Bengio, Y.; Delalleau , O.; Yue, S.Y.( 2004) J. Comput. Aided Mol. Des.,
18, pp. 475-482.
 Paul TK, Iba H (2005) Gene selection for classification of cancers using probabilistic model building
genetic algorithm. Biosystems, 82(3), pp. 208–225
 Schölkopf, B., Platt, J.C., Shawe-Taylor, J., & Smola, A.J. (1999). Estimating the support of a high
dimensional distribution. Technical Report MSR-TR-99-87, Microsoft Research, Redmond, WA, USA.
Online version: http://www.kernel-machines.org/papers/oneclass-tr.ps.gz
 Selwood, D. L.; Livingstone, D. J. ; Comley, J. C. W.; O’Dowd, A. B.; Hudson, A. T.; Jackson, P.; Jandu,
K. S.; Rose, V. S.; Stables, J. N. (1990) Structure-activity relationships of antifilarial antimycin analogs,
J. Med. Chem., 33, pp. 136–142.
 Sickmeier M., Hamilton J. A., LeGall T., Vacic V., Cortese M. S., Tantos A., Szabo B., Tompa P., Chen
J., Uversky V. N., Obradovic Z., Dunker A. K. (2007). DisProt: a database of protein disorder, Nucleic
Acids Research 35(Database issue), pp. D786-D793.
 Smola, A.J., & Scho¨lkopf, B. (1998). From regularization operators to support vector kernels. In: M.I.
Jordan, M.J. Kearns, S.A. Solla (Eds.), Advances in Neural Information Processings Systems, vol. 10.
Cambridge, MA: MIT Press.
 Vapnik V., Chapelle O., (2000), Bounds on error expectation for support vector machines. Neural
Computation 12(9), pp. 2013–2036.
 Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer.
 Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
 Dorigo M. (1996). The Ant System : Optimization by a colony of cooperating agents, IEEE Transactions
on Systems, Man, and Cybrnetics - Part B, 26(1), pp. 1-13.
 Vucetic S., Obradovic Z., Vacic V., Radivojac P., Peng K., Iakoucheva L. M., Cortese M. S., Lawson J.
D., Brown C. J., Sikes J. G., Newton C. D., Dunker A. K. (2005). DisProt: a database of protein
disorder, Bioinformatics 21, pp. 137-140.

 Wagener, M.; Sadowski, J.; Gasteiger, J. (1995), Autocorrelation Of Molecular Surface Properties For
Modeling Corticosteroid Binding Globulin And Cytosolic Ah Receptor Activity By Neural Networks, J.
Am. Chem. Soc., 117, pp. 7769–7775.
 West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA Jr., Marks JR,

Introduction to Bioinformatics and Systems Biology Page# 51 of 53


Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression
profiles. Proc. Natl. Acad. Sci., 98(20), pp. 11462-7
 Wilkinson, D.L. and Harrison, R.G. (1991). Predicting the solubility of recombinant proteins in
Escherichia coli. Biotechnology 9, pp. 443–448.
 Zbilut P., Giuliani A., Colosimo A., Mitchell J. C., Colafranceschi M., Marwan N., Uversky V. N.,
Webber Jr. C. L. (2004) Charge and hydrophobicity patterning along the sequence predicts the folding
mechanism and aggregation of proteins: a computational approach, Journal of Proteome Research 3, pp.
1243-1253
 Karimi Y., (2005) Application of hyperspectral remote sensing in stress detection and crop growth
modeling in corn fields, Research Thesis, Department of Bioresource Engineering, McGill University,
Canada
 Friedel C., Jahn H. V., Sommer, Rudd, Mewes H.W., Tetko I.V. (2005) Support vector machines for
separation of mixed plant–pathogen EST collections based on codon usage, Bioinformatics 21, pp. 1383-
1388

Section 15: Technical aspects of Bioinformatics


There are many other areas that Agricultural Bioinformatics account to and are widely related in conjunction

Introduction to Bioinformatics and Systems Biology Page# 52 of 53


with other fields. A few such important terms are discussed in a few words.

 Biodiversity Informatics is the application of computer informatics techniques to biodiversity


information for improved management, presentation, exploration and analysis. It typically builds on a
foundation of taxonomic, biogeographic, or ecological information stored in digital form, which with
the application of modern computer techniques can yield new ways to view and analyze existing
information, as well as predictive models for information that does not yet exist. Biodiversity
informatics is a relatively young discipline (the term being coined in around 1992) but has hundreds of
practitioners worldwide, including the numerous persons involved with the design and construction of
taxonomic databases. The term "Biodiversity Informatics" is generally used in the broad sense to apply
to computerized handling of any biodiversity information; the apparently broader term " bioinformatics"
is often used synonymously with the computerized handling of data in the specialized area of molecular
biology.

o The multifariousness in Biology is described through Biodiversity. Explore some of the


biodiversity informatics databases. Discuss the applications and implications of these tools.
o Explore the various plant genome resources specific to different crops.
o Discuss combinatorial complexity of simple sequences patterns
o Explore various e-portals and e resources in agriculture.
o Discuss how the HAPMAP project is helpful for Agri-scientists. Make a pilot analyses of the
following
o Domains, motifs, sub motifs, patterns and RNA Folding Patterns, SNP
o Comparative Genomics of Regulatory Regions

 Metagenomics, a study of ‘meta’genomes wherein genetical material is assayed across environmental


samples. This has caught interest to the researchers because the organisms that are not cultured in the
laboratory are assayed and studied in the natural environment.

 Cloud computing for Bioinformatics: With the advent of ultra high-throughput sequencing, genotyping
and other functional genomics in every laboratory, there is a need to have the data shared and accessed
by the umpteen users, perhaps in real time. Cloud computing is the answer for this even as several
terabytes of data can be accessed and shared together.

 BioSLAX is a Live USB comprising of more than 30 bioinformatics tools and application suites.
Released by the Bioinformatics Resource Unit of the Life Sciences Institute (LSI), National University of
Singapore (NUS) and is bootable from any PC that allows a CD/DVD or USB boot option, it runs the
compressed Slackware flavour of the Linux Operating System (OS). More at www.bioslax.com

Introduction to Bioinformatics and Systems Biology Page# 53 of 53

You might also like