Qsar and Drug Design
Qsar and Drug Design
Qsar and Drug Design
Abhik Seal
OSDD Cheminformatics
Aim of Cheminformatics Project
To screen molecules interacting with the
Potential TB targets using classifiers.
Select the selected molecules and dock
with Targets to further screen the
molecules for leads.
Use cheminformatics techniques such as
QSAR ,3D qsar, ADMET to look for
potential leads and design Drugs using the
leads – by building combinatorial libraries.
Tuberculosis
Obstacles For Drug Design
HIV-epidemic that has dramatically increased risk for developing
active TB.
increasing emergence of multi-drug resistant TB (MDR-TB)
emergence of extensively drug-resistant (XDR) TB strains
XDR-TB is characterized by resistance to at least the two first-line
drugs rifampicin and isoniazid and additionally to a fluoroquinolone
and an injectable drug- kanamycin
Existing TB drugs are therefore only able to target actively growing
bacteria through the inhibition of cell processes such as cell wall
biogenesis and DNA replication.
TB chemotherapy characterized by an efficient bactericidal activity
but an extremely weak sterilizing activity i.e inability to kill slowly
growing and slowly metabolizing strains.
Drugs Currently in Development
QSAR
Bioavailability
t has also been effectively used to characterize drug likeness during virtual
screening & combinatorial library design.
[X]octanol
P=
[X]aqueous
P is a measure of the relative affinity of a molecule for the lipid and aqueous phases in
the absence of ionisation.
compa
re
Pharmacophore Pharmacophore
Modelling
Workflow
validatio
n
Application
Continued.......
b)QSAR: The goal of QSAR studies is to predict the activity of
new compounds based solely on their chemical structure. The
underlying assumption is that the biological activity can be
attributed to incremental contributions of the molecular fragments
determining the biological activity. This assumption is called the
linear free energy principle. Information about the strength of
interactions is captured for each compound by,for example,
steric,electronic,and hydrophobic descriptors.
Molecular similarity and searching Molecules
What is it?
Chemical, pharmacological or biological properties of two compounds
match.
The more the common features, the higher the similarity between two
molecules.
Chemical
The two structures on top are chemically similar to each other. This is reflected in their
common sub-graph, or scaffold: they share 14 atoms
Pharmacophore
The two structures above are less similar chemically (topologically) yet have the same
pharmacological activity, namely they both are Angiotensin-Converting Enzyme (ACE)
inhibitors
Molecular similarity
How to calculate it?
Quantitative assessment of similarity/dissimilarity of structures
need a numerically tractable form
molecular descriptors, fingerprints, structural keys
n B( x & y)
T ( x, y ) =
∑( x − yi )
2
E ( x, y ) = i
i =1
B( x) + B( y ) − B( x & y )
Molecular descriptors
a) chemical fingerprint
Construction
0100010100010100010000000001101010011010100000010100000000100000
0100010100010100010000000001101010011010100000000100000000100000
Molecular descriptors
Example 2: pharmacophore fingerprint
Construction
12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3 3
2 2
1 1
0 0
A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H
A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Virtual screening using fingerprints
Individual query structure
0101010100010100010100100000000000010010000010010100100100010000
query fingerprint
query
proximity
0000000100001101000000101010000000000110000010000100001000001000
0100010110010010010110011010011100111101000000110000000110001000
0100010100011101010000110000101000010011000010100000000100100000
0001101110011101111110100000100010000110110110000000100110100000
hits
0100010100110100010000000010000000010010000000100100001000101000
0100011100011101000100001011101100110110010010001101001100001000
0101110100110101010111111000010000011111100010000100001000101000
0100010100111101010000100010000000010010000010100100001000101000
0001000100010100010100100000000000001010000010000100000100000000
0100010100010011000000000000000000010100000010000000000000000000
0100010100010100000000000000101000010010000000000100000000000000
0101010101111100111110100000000000011010100011100100001100101000
0100010100011000010000011000000000010001000000110000000001100000
0000000100000000010000100000000000001010100000000100000100100000
0100010100010100000000100000000000010000000000000100001000011000
0001000100001100010010100000010100101011100010000100001000101000
0100011100010100010000100001001110010010000010001100000000101000
0101010100010100010100100000000000010010000010010100100100010000
Where act(i) is the number of active molecules that contain the i th fragment and
inact(i) is the number of inactive molecules that contain the i th fragment
Discriminant algorithms
The aim of discriminant analysis is try to separate the
molecules into constituent classes.
The simplest Linear discriminant which in case of two
activity class and two descriptors which aim to find a st.
line that separates data such that maximum number of
compounds are classified.
If more than variable uses the line become hyperplane.
The idea is to express a class as a linear combination of
attributes.
X= w0+w1a1+w2a2+w3a3+.........
X =class a1 a2 = attributes w1 w2 = weights
Neural Networks(NN)
The two most commonly used neural network architectures used
in chemistry are the feed forward networks and the Kohonen
networks.
The feed forward NN is a supervised learning method as it uses
the values of dependent variables to derive the model. The
Kohonen or Self Organizing map (SOM) is an unsupervised
method.
The Feed forward NN contains layers of nodes with connection
between all pairs of nodes in the adjacent layers. A key feature is
presence of hidden nodes along with back propagation algorithm
makes the network applicable to many fields.
The neural network must first be trained with set of inputs. Once
it has been trained it can then be used to predict values for new
and unseen molecules.
Neural Networks Continued...
The Figure Below shows a Feed forward network with 3Hidden nodes
and one output.
where K is the so-called kernel function, the suffix k represents
the support vector, and m stands for the number of support
vectors.
The Gaussian and the Polynomial kernel function are used
Strengths and Weaknesses of SVM
Strengths
Training is relatively easy
No local optima
It scales relatively well to high dimensional data
Tradeoff between classifier complexity and error can be controlled
explicitly
Non-traditional data like strings and trees can be used as input to
SVM, instead of feature vectors
Weaknesses
Need to choose a “good”kernel function.
Measuring Classifier Performance
N= total number of instances in the dataset
TPj= Number of True Positives for class j
FPj = Number of False positives for class j
TNj= Number of True Negatives for class j
FNj= Number of False Negatives for class j
Accuracy =
Sensitivity/recall =
Specificity/precision =
Types of Datamining learning
Process in Weka
Classification- learning-the learning scheme is presented
with a set of classified examples from which it is expected to learn
a way of classifying unseen examples.
Association Learning-any association among features
is sought, not just ones that predict a particular class value
Clustering-groups of examples that belong together are
sought
Numeric prediction-the outcome to be predicted
is not a discrete class but a numeric quantity.
Classifier Algorithms in WEKA
a)Bayes Classifier c) Functions
AODE LINEAR REGRESSION
BAYES NET LOGISTIC
NAÏVE BAYES MULTILAYERD PERCEPTRON
NAÏVE BAYES MULTINOMIAL RBF NETWORK
NAÏVE BAYES UPDATABLE SIMPLE LINEAR REGRESSION
SIMPLE LOGISTIC
SMO,SMO REG.
b)Trees d)Rules
ADTREE CONJUCTIVE RULE
ID3 DECISION TABLE
J48 JRIP
LMT M 5RULES
NB5TREE NNGE
RANDOM FOREST ONE R
RANDOM TREE PRISM
REP TREE ZERO R
Summary
Machine learning is mainly applied to ligand-based drug
screening and it is applied to the calculation of the
optimal distance between the feature vectors of active
and inactive compounds.
A kernel is essentially a similarity function with certain
mathematical properties, and it is possible to define
kernel functions over all sorts of structures for
example, sets, strings, trees, and probability
distributions .
Interest in neural networks appears to have declined
since the arrival of support vector machines, perhaps
because the latter generally require fewer parameters
to be tuned to achieve the same (or greater) accuracy.
THANK YOU