Manual PDF
Manual PDF
Organized by
Bioclues Organization
www.bioclues.org
An affiliate of International Society for Computational Biology (ISCB.org)
and Asia Pacific Bioinformatics Network (APBioNet.org)
Hyderabad, India
This manual is not a concise version of the taught program that is delivered during the workshop. Some select
and important topics of interest that would interest the participants have been dealt in a pragmatic way. Therefore
this manual may not be construed as a full reference for the participants whence their hands-on session. In
addition, there will be exercises and summary delivered to the participants separately at the end of each day.
M – Mentoring
O – Outreach
R – Research
E – Entrepreneurship
Raghunath, Keshavachandran, Professor and Head, Bioinformatics Centre, KAU, Thrissur, India.
Jayaraman Valadi, PhD. Scientist Emeritus, CSIR, India.
Tiratha Raj Singh, PhD, Vice President, Bioclues.org. Sr. Lecturer, JUIT, Solan, HP, India
Pritish Varadwaj, PhD. Co founder, Bioclues.org. Professor, IIIT Allahabad, India
Arun Gupta, M.Tech. Visiting Faculty, DAV University, Indore, India.
Sundararajan VS, Nanyang Technological University, Singapore.
Mohana Lata Paul, CCMB Alumni and Kakatiya University
Renuka Suravajhala (PhD), Roskilde University, Denmark
Shidhi, PhD fellow, Kerala University, Trivandrum, India.
Shrish Tiwari, Ph D. CCMB, Hyderabad.
Prashanth Suravajhala, PhD.
Figure 1: The hypotheses to the dogmas’ explaining how Bioinformatics has evolved.
Thrust areas of Bioinformatics but not just limited to the subject alone
Sequence analysis
Genome annotation
Computational evolutionary biology
Gene and protein expression analysis
Today Bioinformatics research is known in several areas whereas Agricultural Bioinformatics is steadfastly
increasing and developing. With an approximate 10 plant genomes and more than 100 plant/live-stock genomes
being sequenced/sequenced, there is a need for bioinformatics to be leveraged. Arabidopsis thaliana has been the
reference genome not just for plants but for all higher eukaryotes and mammalian genomes. Genome Informatics
in these areas has resulted in development of host of tools. However, there is a lacuna of research in Protein-
Protein Interaction (PPI) studies in Agriculture which so far has been limited to identifying the function of
proteins and genes through Quantitative Trait Loci (QTL) and Qualitative Trait Loci (QuTL)
Laboratory based methods: Experimental procedures for locating genes in new DNA are based on the
following:
1. Identification via hybridization to mRNA or cDNA (Northern blotting: It involves gel separation and
transferring of RNA into nitrocellulose membrane for detection of specific RNA by a labeled molecular
probe)
2. Exon trapping: It is a molecular biology technique to identify potential exons in a fragment of eukaryote
DNA of unknown intron-exon structure.
Feature based approach: Typical features include splice sites, promoter region (e.g. TATA box, CAAT box and
GC box), identification of ORFs start and stop codons etc. The best gene prediction programs tends to be species
specific, trained on examples of known genes in different organisms. Other typical features include codon bias
(Codon Bias is the tendency for an organism to use certain codons more than others to encode a particular
amino acid), donor/ receptor sites and coding frame length. The key to the analysis of unknown DNA sequence
is the identification of ORFs. Web based gene recognition system such as GRAIL, Gene ID and Gene Parser
work by searching for various features of genes and then identifying those regions which score high enough.
Select References:
The Plant Genome Central: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html
The EMBL: http://www.embl.de or http://www.ebi.ac.uk/embl/
The DDBJ: http://www.ddbj.nig.ac.jp/
Molecular Modeling Data Base (MMDB) is an NCBI’s Entrez database which emphasizes in adding
structure data to Entrez so that the information is easily accessible to biologists thereby facilitating
comparative analysis involving 3-D structure.
All the databases embed sequences and are usually presented in standard formats which include the following
(shown along with examples):
FASTA
>WheatSSR1
MAVTQTAQACDLVIFGAKGDLARRKLLPSLYQLEKAGQLNPDTRIIGVGRADWDKAAYTKVVREA
LETFMKETIDEGLWDTLSARLDFCNLDVNDTAAFSRLGAMLDQKNRITINYFAMPEECQVYRIDHY
LGPARVVMEKPLGTSLATSQKEFANDQVGEYFTVLNLLALRPSTFGAICKGLGEAKLNAKNSLFVN
NWDNRTIDHVEITV
GDE
%5HIB_CAVPO008892|WheatSSR1
MAVTQTAQACDLVIFGAKGDLARRKLLPSLYQLEKAGQLNPDTRIIGVGRADWDKAAYTKVVREA
LETFMKETIDEGLWDTLSARLDFCNLDVNDTAAFSRLGAMLDQKNRITINYFAMPEECQVYRIDHY
LGPARVVMEKPLGTSLATSQKEFANDQVGEYFTVLNLLALRPSTFGAICKGLGEAKLNAKNSLFVN
NWDNRTIDHVEITV
NBRF/PIR (National Biomedical Research Foundation/Protein Information Resource).
>P1; Wheat SSR1QTL integrated.
MAVTQTAQACDLVIFGAKGDLARRKLLPSLYQLEKAGQLNPDTRIIGVGRADWDKAAYTKVVREA
LETFMKETIDEGLWDTLSARLDFCNLDVNDTAAFSRLGAMLDQKNRITINYFAMPEECQVYRIDHY
LGPARVVMEKPLGTSLATSQKEFANDQVGEYFTVLNLLALRPSTFGAICKGLGEAKLNAKNSLFVN
NWDNRTIDHVEITV
Pointers
Exercises
Analyze various bioinformatics databases, understand the intricacies of it. Make a short list of
fundamental concepts that a perfect bioinformatics database should have.
Make a first hand study of a Relational DataBase Management System (RDBMS): Use SQL/MS excel
to develop a small database of at least 10 rows and 5 columns. Query the contents using SQL.
Annotate your database and learn how to NCBI Link Out the database you developed.
Select References
The NCBI Link Out : http://www.ncbi.nlm.nih.gov/projects/linkout/
http://bioinformatics.oxfordjournals.org/content/25/12/1475.full
Figure 2a: The locus 2’s alignment with locus 1 is used as a standard to find coding regions and
alignment
2. Local alignment
Through sequence alignment, one can align two sequences thereby scoring the similarities and differences at
each and every nucleotide or amino acid. This is known as pair wise alignment. One of the pairs could be a new
or unknown sequence whereas the other(s) would be a sequence whose structure and functions are known. An
example is shown below from the matrix:
Sequence1: EKIUHWTGFRGHC VNM LCIPEI UYTF
Sequence2: EKIUH STGFR GHC V- MLCIPEIUYTF
The summation of scores (for similarity, dissimilarity and gap penalties) gives the overall score for a particular
alignment.
Pointers:
Figure 3
You may use the Figure 3 as a prototype to answer few questions from the below-mentioned exercises:
Use BLAST and FASTA to find homologous sequences of interest.
Can we compare a sequence against multiple sequences (target entries)? HINT: Use BLAT ~ Blast like
WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways. While they
present a new model for pathway databases that enhances and complements ongoing efforts, such as KEGG,
Reactome and Pathway Commons, it invited broader participation in the form of community annotation ranging
from students to senior experts in each field to add entries.
Figure 4: Moores' law and the why of Biowikis (Courtesy: Dan Bolser/Broad Institute)
Exercises
How do Biowikis help the community annotation drive?
Explore Protein data Bank Wiki (PDBWiki), Protepedia and host of other tools. Tabulate them and
make pros and cons of all the wikis.
Use Internet Relay Chat to explore and discuss wikis: irc://irc.freenode.net/#bioinformatics
Why not start your own Wiki project? (For example: http://wiki.bioclues.org)
Select References
Select References:
Steve Rozen and Helen J. Skaletsky (2000) Primer3 on the WWW for general users and for biologist
programmers. In: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in
Molecular Biology. Humana Press, Totowa, NJ, pp 365-386.
Schuler,GD. (1997). Sequence mapping by electronic PCR.
Rotmistrovsky K, Jang W, Schuler GD. (2004). A web server for performing electronic PCR.
Thornton B, Basu C. Real-time PCR (qPCR) primer design using free online software. Biochem Mol
Biol Educ. 2011 Mar; 39(2):145-54. doi: 10.1002/bmb.20461.
Questions to ponder
What is Maximum Likelihood? How Phylogenetic Analysis does help us to describe the ML for
sequences that event HGT?
Use PAML/Codeml software to explore and find novel genes from set of your favorite genes.
From Figure 6 below, identify and explore the overlapping domains and proteins. Use PAML and Clustal
analyses to correlate which sequences are similar? (Also use pen-and-paper mode)
Discussion:
Bioinformatics for Microarrays
Introduction and the use of Microarrays for expression analyses
Gene Expression Data Analysis
Serial Analysis of Gene Expression (SAGE)
Image analysis: Statistic
Normalization and clustering
Variability and replication
Gene expression analyses with R and Bioconductor
Questions to ponder
“IPR allows people to assert ownership rights on the outcomes of their creativity and innovative activity in the
same way that they own physical property. The four main types of intellectual property rights are: patents,
trademarks, design and copyrights.” The protection of IPR may take several forms depending mainly on the
type of intellectual property and the type of protection sought; each form of protection has its own advantages
and pitfalls.
Miscellany:
Some popular and justifiable legal case studies on the Turmeric, Neem, Basmati etc.
Discussion on Traditional Knowledge Discovery Library.
IP protection in Bioinformatics.
Questions to ponder
What to patent and what not to patent?
What makes Open Access more respectable compared to the IPR?
Does IPR mean that you are void of Open Access?
Can you patent a (synthetic) gene or protein that is being studied in the laboratory?
What is semi open access?
Select References
Practical approach to IPR, Rachna Singh Puri and Arvind Viswanathan, I.K. Int. Pub. House, New Delhi.
IPR: A primer, Rao and Roa, Eastern Book Company.
Intellectual property rights and the third world, R.A. Mashelkar, CSIR.
ftp://ftp.cordis.europa.eu/pub/life/docs/ipr_bioinf.pdf
Figure 8: Evolutionary conservation on 3D structure of protein using ConSurf and the legend showing how
different methods are employed to predict secondary structures.
Energy minimisation is the key factor for predicting secondary (~ also for 3 0 structures)
1. Chou Fasman method
The Chou-Fasman method (Chou and Fasman 1978) is based on the frequency of each of 20 amino acids in
alpha helices, beta sheet and turns. Amino acids Ala, Glu, Leu and Met are strong predictors of α helices, but
2. Garnier-Osguthorpe-Robson (GOR)
Garnier et al (1978) developed this sophisticated analysis method based on the assumption that amino acids
flanking the central amino acid residue influence the secondary structure wherein the central residue is likely to
adopt. Whereas the Chou-Fasman method is based on the assumption that each amino acid individually
influences the secondary structure within a given range of sequence, the method is known to be 50 – 60 %
accurate. In this method, there is a parameter called Sliding window. If you choose an amino acid X with a
sliding window of 8, then the method searches for the 8 amino acids in the carboxy and 8 amino acids in the
amino terminal. (So a total of 17 amino acids with the amino acid X as the central residue).
3. Neural Networks
A type of artificial intelligence that attempts to imitate the way a human brain works. Rather than using a digital
model, in which all computations manipulate zeros and ones, a neural network works by creating connections
between processing elements, the computer equivalent of neurons. The organizations of processing elements
determine the output. In the neural network approach, computer programs are trained to recognize amino acid
patterns that are located in known secondary structures and to distinguish these patterns from other patterns not
located in these structures. Accuracy is approximately 70 – 75%.
Proteins with sequence alignment of >25-30% identity typically have homologous structures. Model accuracy
depends on the level of similarity between the unknown protein and the known structure. If the newly modeled
protein obeys Ramachandran plot, then it is said to be an acceptable one.
Tool for homology modeling: SWISS-MODEL is a ‘biologist friendly’ program. When a sequence is submitted,
it first compares the sequence to the crystallographic database (ExPdb). If it finds any homology between query
sequence and database structures, it sends back the result of matching target proteins. Target structure is
superimposed to the sequence carbon backbone. The RMSD value must be low for good identity. (RMSD is the
square root of the distance between the alpha carbon atoms of both the structures). It is resubmitted to the Swiss
model database for modeling. First it builds the back bone and then the side chains. Then the newly modeled
protein is sent via mail. The evaluation of the newly modeled protein is done by drawing a Ramachandran plot.
If all the amino acids lie in allowed region then the structure is an acceptable one.
Ab initio prediction
Ab initio prediction is carried out when there is no suitable homologue found in the database. Prediction is done
completely from the sequence. It is based on Anfinsen’s hypothesis that the native state of the protein represents
the global free energy minimum. Ab initio method tries to find these global minima of the protein. Finding the
correct native like protein conformation requires
An efficient search method for exploring the conformational space to find the energy minima.
An accurate potential function that calculates the free energy of a given structure
In order to reduce the complexity, local structure biases are used. But the strength and multiplicity of the local
structure prediction is highly sequence dependent. There are two types of scoring functions, viz. namely
knowledge based scoring function and force field based function. Currently there does not exist a reliable
scoring function or search method. However, some of the methods, viz. CASP4 and CASP5 were the segment
insertion Monte-Carlo method in Rosetta, threading and Monte Carlo method by Friesner, the lattice Monte
Carlo method by Jeff Skolnick and Andrew Kolinski where side chains were used for the lattice model etc.
Exercises
Use the NCBI-Cn3d and Swiss-MODEL to explore predicting structures for your proteins
Use the NCBI Blast and analyze your sequences using the structure (PDB) databases as the target.
Discuss the intricacies and problems with the instructor
Use ConSeq, ConSurf, and Selecton
Introduction to cheminformatics
What is cheminformatics?
Cheminformatics, also known as chemical informatics was coined by F.K Brown in 1998(Brown F ,2005). It
Applications of MM
Molecular modelling methods are used widely to investigate the structure, dynamics, surface properties and
thermodynamics of inorganic, biological and polymeric systems and biological activities such as protein folding,
enzyme catalysis, protein stability, conformational changes associated with biomolecular function, molecular
recognition of proteins, DNA, and membrane complexes (Leach A. R,2001).
Docking
Docking in molecular modelling is a method which predicts the preferred orientation of one molecule to a
second when bound to each other to form a stable complex.
Applications of Docking
A binding interaction between a small molecule ligand and an enzyme protein may result in activation or
inhibition of the enzyme. Docking is most commonly used in the field of drug design — most drugs are small
organic molecules.docking method used to indentify potential drugs molecules that are likely to bind to protein
target of interest and used in bioremediation – Protein ligand docking can also be used to predict pollutants that
can be degraded by enzymes(Suresh PS et al.,2008).
Select References
Konstantin Arnold1, Lorenza Bordoli1, Ju¨ rgen Kopp1 and Torsten Schwede. The SWISS-MODEL
workspace: a web-based environment for protein structure homology modelling. Vol. 22 no. 2 2006,
pages 195–201.
ConSeq: The Identification of Functionally and Structurally Important Residues in Protein Sequences,
2004 Berezin C., Glaser F., Rosenberg Y., Paz I., Pupko T., Fariselli P., Casadio R., and Ben-Tal
N. Bioinformatics. 20:1322-1324.
Ashkenazy H., Erez E., Martz E., Pupko T. and Ben-Tal N. 2010
ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic
acids. Nucl. Acids Res.
Doron-Faigenboim, A., Stern, A., Mayrose, I., Bacharach, E., and Pupko, T. 2005. Selecton: a server for
detecting evolutionary forces at a single amino-acid site. Bioinformatics. 21(9): 2101-2103.
Brown F.,Editorial Opinion: Chemoinformatics – a ten year update, Current Opinion in Drug Discovery
& Development, 2005, 8 (3): 296–302.
Leach A. R., Molecular Modelling: Principles and Applications, 2001.
Suresh PS, Kumar A, Kumar R, Singh VP .,An in silico approach to bioremediation: laccase as a case
study. J. Mol. Graph. Model, 2008,26 (5): 845–9.
Things to ponder
Overview of the dynamics of mitochondrial structure, morphology and inheritance.
Biogenesis of mitochondria
Regulation of gene expression
The mitochondrial genome and its interaction with the nucleus, and the targeting of proteins to the
organelle: Any specific targeting signals?
What if the signal peptides are present in the proteins?
What if the N-terminal mitochondrial targeting peptide is truncated? Can the protein still localize to
mitochondria?
How mitochondria contribute to the mutations?
Could we understanding the way the organelle interacts with the rest of the plant cell in silico? Any
visualizer meant for this?
How’s the field of proteomics help disseminate discovery of new functions? How are the pathways of
electron transport bypass, metabolite transport, and specialized mitochondrial metabolism?
Exercises
A major problem in managing numerous proteins is not the amount of data but the way we organize it
(~complexity). Do you agree? Answer with your comments, suggestions and how to tackle keeping view
of PPI.
Most of the proteins transport proteins to various organelles in a cell. The eleven main organelles in eukaryotic
cells, viz. cytoplasm, nucleus, ER, ribosome, Golgi, mitochondria, chloroplast, centriole, vacuole, vesicles and
lysosomes are localization sites for proteins as they import and export yielding different mode of function.
Majority of the proteins though compartmentalised in cytosol, are localized across cytosolic-compartments, viz.
Mitochondria, Golgi, Endoplasmic Reticulum, Lysosomes, Golgi complex. It was felt that the proteins encoded
by the mitochondrial genome and those targeted to mitochondria would be interesting to facilitate researchers in
understanding the mitochondrial proteome better (Calvo S et al. 2006). The protein localization is facilitated by
specific targeting peptides. There are two types of targeting peptides, the internal targeting signals and
presequences. While presequences are often localized at the N-terminal end, the internal targeting signals can be
distributed throughout the protein. There are also precursor proteins that posses either an N-terminal presequence
or internal targeting signals or simply mitochondrial/matrix targeting sequences (MTS). These proteins are
specific to mitochondria, hence the name. The N-terminal sequences are enriched with hydrophobic residues -
Arg, Ser and Ala, recognised by different import receptors. The N-terminal presequences generally have a length
of 6-85 amino acid residues and rarely contain negatively charged amino acids. After import into mitochondria,
presequences get detached through proteolysis (Bolender N et al. 2008). The last decade has seen several tools
and predictors developed to find the proteins localized to mitochondria. Different tools have been known to
classify different methods, notable tools among them are TargetP –based on N terminal sequences (Emannuelson
O et al. 2000), Mitopred-based on Pfam domains (Guda C et al. 2004). Mitoprot -calculates the N-terminal
protein region that can support a mitochondrial/matrix targeting sequence (MTS) and the cleavage site (Claros
MG et al., 1996) and Predotar which is used to predict N-terminal sequence for mitochondrial, plastid and ER
targeting sequences (Small I et al. 2004). Another tool, the pTarget (Guda C, 2006), uses heuristics meaning the
method based on problem-solving plausible hypothesis that screens putative Pfam domains. The screening is
related to a specific cellular localization but not necessarily complete targeting signals (Guda C, 2006). The
occurrence patterns of protein functional domains and the amino acid compositional differences in proteins are
Table 2 Overview of some of the important and highly cited sub cellular localization prediction programs. Most
of the programs aforementioned work for eukaryotic organisms.
The advent of molecular markers has revolutionized the scenario of plant biotechnology. New developments in
DNA marker technologies have made it possible to know the large number of genetic polymorphisms at the
DNA level. These can be used as markers for evaluation of the genetic basis for the observed phenotypic
variability. Molecular markers, viz. RFLP (restriction fragment length polymorphism), AFLP (amplified
fragment length polymorphism), RAPD (randomly amplified polymorphic DNA), ISSR (inter simple sequence
repeat), SSCP (single stranded conformation polymorphism), Mini- or microsatellites and SNPs (single
nucleotide polymorphisms) etc. have been intensively used in crop improvement. These DNA based markers can
be used to study sex identification, DNA fingerprinting, cultivar identification, genome variability, genetic
diversity and relatedness, gene mapping, phylogenetic relationships and marker assisted selection of desirable
genotypes:
Section 14: Applications of Support Vector Machines (SVM) in chemo and bioinformatics
Recent developments in genomic and post-genomic research have generated a large amount of biological data.
This data is growing exponentially with the advancement of research technologies. In order to handle such a
large amount of data, there is an increasing need for computational methods that can efficiently store, organize
A particular active area of research in bioinformatics is the application of machine learning tools to extract
important and useful information from a large pool of biological data. Machine learning algorithms are built in a
way such that they can easily recognize complex patterns and further make intelligent decisions based on the
data. For solving classification problems, machine learning techniques first obtain information from a set of
already classified samples (training set) and then use this information to classify unknown samples (test set).
1. Protein Localization
One of the main tasks of proteomics is the assignment of functionalities to sequenced proteins. The assignment
of a function for a given protein has proved to be especially difficult where no clear homology to proteins of
known function exists. One field of proteomics that has recently received a lot of attention is protein localization.
Protein expression analysis can indicate whether proteins are expressed, but it is also important to know where
proteins are expressed, and where they go over time. Knowing the sub-cellular location that a protein resides in
may give important insights as to its possible function. Even when the basic function of a protein is known,
knowing its location in the cell may give insights as to which pathway an enzyme is part of. There is an
increasing shift away from general protein expression analysis and toward mapping proteins distribution, relative
abundance, tissue specificity, and movement. By tracking these parameters (in healthy versus diseased tissue and
in control versus treated tissue), researchers can gain a greater understanding of these proteins functions and
determine which are likely to be the best drug targets.
5. Gene recognition
A major problem in molecular biology is to identify genes in uncharacterized DNA sequences. There are two
broad classes of computational approaches to finding genes in nucleotide sequences.
Search by signal: it locates genes by finding particular signals that are associated with gene expression. A signal
is a localized region of DNA that performs a specific function, such as binding an enzyme.
Search by content: it recognizes genes by identifying segments of DNA sequences that possess the general
properties of coding regions.
6. Gene classification
Genome researchers are shifting their focus from structural genomics to functional genomics. Structural
genomics is the initial phase of genome analysis, whose goal is to construct high resolution genetic and physical
Specialized kernels that account for sequence similarity have been developed for the purpose of classifying
sequences based on homology, for e.g. string kernels, mismatch string kernels, bag of words (BOW) kernels etc.
Separation of mixed plant-pathogen EST collections based on codon usage (Friedel et al., 2005)
The efficient characterization of the plant-pathogen interaction plays a key role in plant disease control. The
construction of mixed libraries that contain sequences from both genomes help in the discovery of host and
pathogen genes expressed at the plant-pathogen interface. Sequence identification requires high-throughput and
reliable classification of genome origin. A dataset of 3974 unigene sequences of various lengths from barley
(H.vulagare) and blumeria (B.graminis) were used as training sequences. The short length and the lack of
relevant data of single-pass cDNA sequences in public databases often cause difficulties. To overcome these
difficulties, a novel method was introduced that takes into account subtle differences in codon usage between
plant and fungal genes. For this, SVM was used to identify the probable origin of sequences. A support vector
model is calculated to distinguish between correct and wrong frames. SVMs were compared to several other
machine learning techniques and to a probabilistic algorithm (PF-IND) for Expressed Sequence Tag (EST)
classification also based on codon bias differences. The proposed Eclat software which consists of a web-
frontend and several Java packages and is used to calculate the support vector models achieved a classification
accuracy of 93.1% on a test set of 3217 EST sequences from Hordeum vulgare and Blumeria graminis. It was
found that the Eclat software can be used to efficiently classify EST sequences containing at least 50nt of coding
sequence. Eclat allows training of classifiers for any host-pathogen combination for which there are sufficient
classified training sequences. The methodology has also been tested on the EST sequences obtained from cotton
(Gossypium arboretum) and cotton root knot nematode (Meloidogyne incognita). The prediction accuracy of a
SVMstruct
SVMstruct, by Joachims, is an SVM implementation that can model complex (multivariate) output data y, such
as trees, sequences, or sets. These complex output SVM models can be applied to natural language parsing,
sequence alignment in protein homology detection, and Markov models for part-of-speech tagging. Several
implementations exist: SVMmulticlass, for multiclass classification; SVMcfg, which learns a weighted context
free grammar from examples; SVMalign, which learns to align protein sequences from training alignments; and
SVMhmm, which learns a Markov model from examples. These modules have straightforward applications in
bioinformatics, but one can imagine significant implementations for cheminformatics, especially when the
chemical structure is represented as trees or sequences.
Availability: http://svmlight.joachims.org/svm_struct.html
mySVM
mySVM, by Ru¨ ping, is a Cþþ implementation of SVM classification and regression. It is available as Cþþ
source code and Windows binaries. Kernels available include linear, polynomial, radial basis function, neural
(tanh), and anova. All SVM models presented in this chapter were computed with mySVM.
Availability: http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/index.html
mySVM/db
mySVM/db is an efficient extension of mySVM, which is designed to run directly inside a relational database
using an internal JAVA engine. It was tested with an Oracle database, but with small modifications, it should also
run on any database offering a JDBC interface. It is especially useful for large datasets available as relational
databases.
LIBSVM
LIBSVM (Library for Support Vector Machines) was developed by Chang and Lin and contains C-classification,
n-classification, e-regression, and n-regression. Developed in Cþþ and Java, it also supports multiclass
classification, weighted SVMs for unbalanced data, cross-validation, and automatic model selection. It has
interfaces for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels available include linear,
polynomial, radial basis function, and neural (tanh).
Availability: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
SVMTorch
SVMTorch, by Collobert and Bengio,185 is part of the Torch machine learning library (http://www.torch.ch/)
and implements SVM classification and regression. It is distributed as Cþþ source code or binaries for Linux and
Solaris.
Availability: http://bengio.abracadoudou.com/SVMTorch.html
Weka
Weka is a collection of machine learning algorithms for datamining tasks. The algorithms can either 388
Applications of Support Vector Machines in Chemistry be applied directly to a dataset or called from a Java
code. It contains an SVM implementation.
Availability: http://www.cs.waikato.ac.nz/ml/weka/
BioWeka
BioWeka is an extension library to the data mining framework Weka for knowledge discovery and data analysis
tasks in biology, biochemistry and bioinformatics. Includes integration of the Weka LibSVM project.
Availability: http://sourceforge.net/projects/bioweka/
Gist
Gist is a C implementation of support vector machine classification and kernel principal components analysis.
The SVM part of Gist is available as an interactive Web server at http://svm.sdsc.edu. It is a very convenient
server for users who want to experiment with small datasets (hundreds of patterns). Kernels available include
linear, polynomial, and radial.
Availability: http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi
Exercises
1. What are the problems in chemo and Bioinformatics which require machine learning tools? Give
illustrative examples
2. What is the principle behind Linear SVM?
3. What is the principle behind nonlinear SVM?
4. What are Kernel functions? Provide examples IMPORTANT Kernel Functions
5. Give examples of Multi-class classification problems in Chemo & Bioinformatics
6. Explain how domain information can be employed to choose sequence & structural Features
7. What is the principle of Ant Colony Optimization?
8. What is feature selection & how it is relevant in Bioinformatics
9. Give importance of SVM in Agri-Bioinformatics
10. Give examples in your own domain where SVM will be very useful
Select References
Baneyx, F. (1999.) Recombinant protein expression in Escherichia coli. Curr. Opin. Biotechnol. 10, pp.
411–421.
Bertone,P. et al. (2001) SPINE: an integrated tracking database and data mining approach for identifying
feasible targets in high-throughput structural proteomics.Nucleic Acids Res., 29, pp. 2884–2898.
Burden, F.R.; Ford, M.G.; Whitley, D.C.; Winkler, D.A. (2000), J. Chem. Inf. Comput. Sci., 40, pp.
1423-1430.
Davis, G.D.,Elisee, C Newham, D.M. and Harrison,R.G.(1999). New Fusion Protein Systems Designed
to Give Soluble Expression in Escherichia coli. Biotechnol Bioeng 65, pp. 382-388
Duan, K.; Keerthi, S.; Poo, A.N. (2002). Evaluation of simple performance measures for tuning SVM
hyperparameters, Neurocomputing, 51, pp. 41-59.
Goh,C.S. et al. (2004) Mining the structural genomics pipeline: identification of protein properties that
affect high-throughput experimental analysis. J. Mol. Biol., 336, pp. 115–130
Golub TR, Slonim DK, Tamayo P, Gaasenbeek CHM, Mesirov JP, Coller H, Loh ML, Downing JR,
Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and
class prediction by gene expression monitoring. Science, 286, pp. 531–537
Gunn, S. (1997). Support Vector Machines for Classification and Regression. ISIS Technical Report
Harrison, P.W.; Barlin, G.B.; Davies, L.P.; Ireland, S.J.; Matyus, P.; Wong, M.G. (1996) Syntheses,
Wagener, M.; Sadowski, J.; Gasteiger, J. (1995), Autocorrelation Of Molecular Surface Properties For
Modeling Corticosteroid Binding Globulin And Cytosolic Ah Receptor Activity By Neural Networks, J.
Am. Chem. Soc., 117, pp. 7769–7775.
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA Jr., Marks JR,
Cloud computing for Bioinformatics: With the advent of ultra high-throughput sequencing, genotyping
and other functional genomics in every laboratory, there is a need to have the data shared and accessed
by the umpteen users, perhaps in real time. Cloud computing is the answer for this even as several
terabytes of data can be accessed and shared together.
BioSLAX is a Live USB comprising of more than 30 bioinformatics tools and application suites.
Released by the Bioinformatics Resource Unit of the Life Sciences Institute (LSI), National University of
Singapore (NUS) and is bootable from any PC that allows a CD/DVD or USB boot option, it runs the
compressed Slackware flavour of the Linux Operating System (OS). More at www.bioslax.com