Molecular Docking Methods
Molecular Docking Methods
Molecular Docking Methods
The molecular docking approach can be used to model the interaction between a small molecule and a
protein at the atomic level, which allow us to characterize the behaviour of small molecules in the binding
site of target proteins as well as to elucidate fundamental biochemical processes.
The docking process involves two basic steps: prediction of the ligand conformation as well as its position
and orientation within these sites (usually referred to as pose) and assessment of the binding affinity. These
two steps are related to sampling methods and scoring schemes, respectively.
Knowing the location of the binding site before docking processes significantly increases the docking
efficiency.
Theory of docking:
Essentially, the aim of molecular docking is to give a prediction of the ligand-receptor complex structure
using computation methods. Docking can be achieved through two interrelated steps: first by sampling
conformations of the ligand in the active site of the protein; second ranking these conformations via a
scoring function. Ideally, sampling algorithms should be able to reproduce the experimental binding mode
and the scoring function should also rank it highest among all generated conformations.
SAMPLING ALGORITHMS
With six degrees of translational and rotational freedom as well as the conformational degrees of freedom of
both the ligand and protein, there are a huge number of possible binding modes between two molecules.
Unfortunately, it would be too expensive to computationally generate all the possible conformations.
Various sampling algorithms have been developed and widely used in molecular docking software (Table
1).
Matching algorithms (MA) based on molecular shape it maps a ligand into an active site of a protein in
terms of shape features and chemical information. The protein and the ligand are represented as
pharmacophores. Each distance of the pharmacophore within the protein and ligand is calculated for a
match; new ligand conformations are governed by the distance matrix between the pharmacophore and the
corresponding ligand atoms. Chemical properties, like hydrogen-bond donors and acceptors, can be taken
into account during the match. Matching algorithms have the advantage of speed; thus they may be used for
the enrichment of active compounds from large libraries. Matching algorithms for ligand docking are
available in DOCK, FLOG, LibDock and SANDOCK programs.
Incremental construction (IC) methods put the ligand into an active site in a fragmental and incremental
fashion. The ligand is divided into several fragments by breaking its rotatable bonds and then one of these
fragments is selected to dock into the active site first. This anchor is usually the largest fragment or the piece
which may have significant functional role or interaction with protein. The remaining fragments can be
added incrementally. Different orientations are generated to fit in the active site, which realizes the
flexibility of the ligand. The incremental construction method has been used in DOCK 4.0, FlexX,
Hammerhead, SLIDE and eHiTS.
Multiple Copy Simultaneous Search (MCSS) and LUDI are fragment-based methods for the de novo
design of ligands and modifications of known ligands that may enhance their binding to the target protein.
MCSS makes 1,000 to 5,000 copies of a functional group, which are randomly placed in the binding site of
interest and subjected to simultaneous energy minimization and/or quenched molecular dynamics in the
forcefield of the protein. Copies only interact with the proteins and any interactions among the copies are
omitted. Consequently a set of energetically favorable binding sites and orientations for the functional group
is identified based on the interaction energies. The binding site is mapped by using different functional
groups. New molecules which perfectly match the binding site can be designed through the linkage of those
different functional groups.
LUDI focuses on the hydrogen bonds and hydrophobic contacts which could be formed between the ligand
and protein. Its central concept is interaction sites, which are discrete positions in space suitable for forming
hydrogen bonds or for filling a hydrophobic pocket. A set of interaction sites is generated either by
searching the database or using the rules. The fragment is then fitted onto the interaction sites and evaluated
by distance criteria.
The final step is the connection of some or all of the fitted fragments to a single molecule. Stochastic
methods search the conformational space by randomly modifying a ligand conformation or a population of
ligands.
Monte Carlo (MC) and genetic algorithms (GA) are two typical algorithms that belong to the class of
stochastic methods. Monte Carlo (MC) methods generate poses of the ligand through bond rotation, rigid-
body translation or rotation. The conformation obtained by this transformation is tested with an energy-
based selection criterion. If it passes the criterion, it will be saved and further modified to generate next
conformation. The iterations will proceed until the predefined quantity of conformations is collected. The
main advantage of MC is that the change can be quite large allowing the ligand to cross the energy barriers
on the potential energy surface, a point that isn’t achieved easily by molecular dynamics based simulation
methods. Examples of applying the Monte Carlo methods include an earlier version of AutoDock, ICM,
QXP and Affinity.
Genetic algorithms (GA) form another class of well-known stochastic methods. The idea of the GA stems
from Darwin’s theory of evolution. Degrees of freedom of the ligand are encoded as binary strings called
genes. These genes make up the ‘chromosome’ which actually represents the pose of the ligand. Mutation
and crossover are two kinds of genetic operators in GA. Mutation makes random changes to the genes;
crossover exchanges genes between two chromosomes. When the genetic operators affect the genes, the
result is a new ligand structure. New structures will be assessed by scoring function, and the ones that
survived (i.e., exceeded a threshold) can be used for the next generation. Genetic algorithms have been used
in AutoDock, GOLD, DIVALI and DARWIN.
Molecular dynamics (MD) is widely used as a powerful simulation method in many fields of molecular
modeling. In the context of docking, by moving each atom separately in the field of the rest atoms, MD
simulation represents the flexibility of both the ligand and protein more effectively than other algorithms.
However, the disadvantage of MD simulations is that they progress in very small steps and thus have
difficulties in stepping over high energy conformational barriers, which may lead to inadequate sampling.
On the other hand, MD simulations are often efficient at local optimization. Thus a current strategy is to use
random search in order to identify the conformation of the ligand, followed by the further subtle MD
simulations.
SCORING FUNCTIONS
The purpose of the scoring function is to delineate the correct poses from incorrect poses, or binders from
inactive compounds in a reasonable computation time. However, scoring functions involve estimating,
rather than calculating the binding affinity between the protein and ligand and through these functions,
adopting various assumptions and simplifications.
Scoring functions can be divided in force-field-based, empirical and knowledge-based scoring functions.
Classical force-field-based scoring functions assess the binding energy by calculating the sum of the non-
bonded (electrostatics and van der Waals) interactions. The electrostatic terms are calculated by a
Coulombic formulation. Since such point charge calculations have problems in modeling the protein’s real
environment a distance-dependent dielectric function is generally used to modulate the contribution of
charge–charge interactions. The van der Waals terms are described by a Lennard-Jones potential function.
Adopting different parameter sets for the Lennard-Jones potential can vary the “hardness” of the potential
which controls how close a contact between protein and ligand atoms can be acceptable. Force-field-based
scoring functions also have the problem of slow computational speed. Thus cut-off distance is used to
handle the non-bonded interactions. This also results in decreasing the accuracy of long-range effects
involved in binding. Extensions of force-field-based scoring functions consider the hydrogen bonds,
solvations and entropy contributions. Software programs, such as DOCK, GOLD and AutoDock, offer users
such functions. They have some differences in the treatment of hydrogen bonds, the form of the energy
function etc. Furthermore, the results of docking with force-field-based functions can be further refined with
other techniques, such as linear interaction energy and free-energy perturbation methods (FEP) to improve
the accuracy in predicting binding energies.
In empirical scoring functions, binding energy decomposes into several energy components, such as
hydrogen bond, ionic interaction, hydrophobic effect and binding entropy. Each component is multiplied by
a coefficient and then summed up to give a final score. Coefficients are obtained from regression analysis
fitted to a test set of ligand-protein complexes with known binding affinities.
Empirical scoring functions have relatively simple energy terms to evaluate. However, it is unclear as to
how well they are suited for ligand-protein complexes beyond the training set. Additionally, each term in
empirical scoring functions may be treated in a different manner by different software, and the numbers of
the terms included are also different. LUDI, PLP, ChemScore are examples derived from empirical scoring
functions.
Knowledge-based scoring functions use statistical analysis of ligand-protein complexes crystal structures to
obtain the interatomic contact frequencies and/or distances between the ligand and protein. They are based
on the assumption that the more favorable interaction is, the greater the frequency of occurrence will be.
These frequency distributions are further converted into pairwise atom-type potentials. The score is
calculated by favouring preferred contacts and penalizing repulsive interactions between each atom in the
ligand and protein within a given cut-off. The appeal of knowledge-based functions is computational
simplicity, which can be exploited to screen large compound databases. They can also model some
uncommon interactions like sulphur-aromatic or cation-π, which are often poorly handled in empirical
approaches. However, they are still faced with the problem that some interactions are underrepresented in
the limited training sets of crystal structures as well as by the bias inherent in the selection of proteins for
successful structure determination thus the obtained parameters may not be suitable for widespread use,
especially with interactions involving metals or halogens. PMF, DrugScore, SMoG and Bleep are examples
of knowledge-based functions which differ mainly in the size of training sets, the form of the energy
function, the definition of atom types, distance cut-off or other parameters.
Consensus scoring is a recent strategy that combines several different scores to assess the docking
conformation. A pose of ligand or a potential binder could be accepted when it scores well under a number
of different scoring schemes. Consensus scoring usually substantially improves enrichments (i.e., the
percentage of strong binder among the high scoring ligands) in virtual screening, and improves the
prediction of bound conformations and poses. However, the prediction of binding energies might still be
inaccurate. Also, the usefulness of consensus scoring diminishes when terms in different scoring functions
are significantly correlated. CScore is an example of which combines DOCK, ChemScore, PMF, GOLD,
and FlexX scoring functions. Typical scoring functions face the problem of affinity prediction partly
because of the limited treatment of solvation effect. One of the ways to solve this problem is physics-based
scoring, e.g. MM-PB/SA and MM-GB/SA (MM stands for molecular mechanics, PB and GB for Poisson-
Boltzmann and Generalized Born, respectively, SA for solvent-accessible surface area), which is involved in
rescoring or lead optimization to improve the accuracy of binding affinity prediction. Promising results were
obtained using MM-PB/SA or MM-GB/SA in some studies. However, recently Guimarães and Mathiowetz
reported that the GB/SA model poorly estimated protein desolvation on certain systems, while incorporating
WaterMap into the MM-GB/SA method instead of GB/SA protein desolvation gave the best ranking result.
Singh and Warshel compared several methods for evaluating the affinity of protein-ligand complexes and
suggested that PDLD/S-LRA/β (protein dipoles Langevin dipoles linear response approximation) appears to
offer an appealing option for the final stages of massive VS and in contrast, PB/SA appears to provide
erroneous estimates of the absolute binding energies because of its incorrect estimation of entropies and the
problematic treatment of electrostatic energies.
DOCKING METHODOLOGIES
When the ligand and receptor are both treated as rigid bodies, the search space is very limited, considering
only three translational and three rotational degrees of freedom. In this case, ligand flexibility could be
addressed by using a pre-computed a set of ligand conformations, or by allowing for a degree of atom–atom
overlap between the protein and ligand. The early versions of DOCK, FLOG and some protein-protein
docking programs, such as FTDOCK, adopted such a method that kept the ligand and receptor rigid during
the process of the docking.
DOCK is the first automated procedure for docking a molecule into a receptor site and is being continuously
developed. It characterizes the ligand and receptor as sets of spheres which could be overlaid by means of a
clique detection procedure. Geometrical and chemical matching algorithms are used, and the ligand-receptor
complexes can be scored by accounting for steric fit, chemical complementation or pharmacophore
similarity. Within its improved versions, incremental construction method and exhaustive search are added
to consider the ligand flexibility. The exhaustive search randomly generates a user-defined number of
conformers as a multiple of the number of rotatable bonds in the ligand.
With respect to scoring, the latest version DOCK 6.4 has included both an AMBER-derived forcefield
scoring with implicit solvent and GB/SA, PB/SA solvation scoring. FLOG generates ligand conformations
on the basis of distance geometry and uses a clique finding algorithm to calculate the sets of distances. Up to
25 explicit conformations of the ligand could be used to dock for some flexibility. FLOG allows users to
define essential points which must be paired with a ligand atom. This approach is useful if an important
interaction is already known before docking. Conformations are scored with a function considering van der
Waals, electrostatics, hydrogen bonding and hydrophobic interactions.
For systems whose behaviour follows the induced fit paradigm, it is of vital importance to consider the
flexibilities of both the ligand and receptor since in that case both the ligand and receptor change their
conformations to form a minimum energy perfect-fit complex. However, the cost is very high when the
receptor is also flexible. Thus the common approach, also a trade-off between accuracy and computational
time, is treating the ligand as flexible while the receptor is kept rigid during docking. Almost all the docking
programs have adopted this methodology, such as AutoDock, FlexX.
AutoDock 3.0 incorporates Monte Carlo simulated annealing, evolutionary, genetic and Lamarckian genetic
algorithm methods to model the ligand flexibility while keeping the receptor rigid. The scoring function is
based on the AMBER force field, including van der Waals, hydrogen bonding, electrostatic interactions,
conformational entropy and desolvation terms. Each term is weighted using an empirical scaling factor
obtained from experimental data. AutoDock 4.0 is able to model receptor flexibility by allowing side-chains
to move.
Additionally, interaction of protein-protein docking could be evaluated in this version of AutoDock.
AutoDock Vina was recently released as the latest version for molecular docking and virtual screening. By
redocking the 190 receptor-ligand complexes that had been used as a training set for the AutoDock 4,
AutoDock Vina simultaneously showed approximately a two orders exponential improvement of magnitude
in speed and a significantly better accuracy of the binding mode prediction.
FlexX uses an incremental construction algorithm to sample ligand conformations. The base fragment is
first docked into the active site by matching hydrogen bond pairs and metal and aromatic ring interactions
between the ligand and protein. Then the remaining components are incrementally built-up in accordance
with a set of predefined rotatable torsion angles to account for ligand flexibility. The FlexX scoring function
is based on Böhm’s work.
Its current version includes terms of electrostatic interactions, directional hydrogen bonds, rotational
entropy, and aromatic and lipophilic interactions. The interactions between functional groups are also taken
into account through assigning the type and geometry for groups.
The intrinsic mobility of proteins has been proved to be closely related to ligand binding behaviour and it
has been reviewed by Teague. Incorporating the receptor flexibility is significant challenge in the field of
docking. Ideally, using MD simulations could model all the degrees of freedom in the ligand-receptor
complex. But MD has the problem of inadequate sampling that we mentioned earlier. Another hurdle is its
high computational expense, which prevents this method from being used in the screening of large chemical
database.
In addition to the historic induced fit theoretical models, conformer selection and conformational induction
have been proposed to illustrate the flexible ligand-protein binding process. According to the definition
given by Teague, conformer selection refers to a process when a ligand selectively binds to a favorable
conformation from a number of protein conformations; conformational induction describes a process in
which the ligand converts the protein into a conformation that it would not spontaneously adopt in its
unbound state. In some cases, this conformational conversion can be likened to a partial refolding of the
protein.
Various methods are currently available to implement the receptor flexibility (Table 3). The simplest one is
so-called “soft-docking”, decreases the van der Waals repulsion energy term in the scoring function to allow
for a degree of atom-atom overlap between the receptor and ligand. For example, the LJ 8-4 potential in
GOLD and smooth potential in AutoDock 3.0 belong to this class. This method may not include adequate
flexibility. Nevertheless, it has the advantage of computational efficiency as the receptor coordinates are
fixed, simply by adjusting van der Waals parameters.
Utilizing rotamer libraries is another approach to modeling receptor flexibility. Rotamer libraries include a
set of side-chain conformations which are usually determined from statistical analysis of structural
experimental data. The advantage of using rotamers is the relative speed in sampling, and the avoiding of
minimization barriers. ICM (Internal Coordinates Mechanics) is a program using rotamer libraries with the
biased probability methodology, coupled with Monte Carlo search of the ligand conformation.
AutoDock 4 adopts a simultaneous sample method to deal with side chain flexibility. Several side chains of
the receptor can be selected by users and simultaneously sampled with a ligand using the same methods.
Other portions of the receptor are treated rigidly with a grid energy map during sampling. Grid energy map
introduced by Good ford is used to store energy information of the receptor and simplify interaction energy
calculation between ligand and receptor.
Still another way to deal with the protein flexibility is to use an ensemble of protein conformations, which
corresponds to the theory of conformer selection. A ligand is separately docked into a set of rigid protein
conformations rather than a single one, and the results are merged depending on the method of choice. This
method was originally implemented in DOCK, which generates an average potential energy grid of the
ensemble and is extended in many programs in different ways. For example, FlexE collects multiple crystal
structures of a certain protein, merging the similar parts while marking the dissimilar areas as different
alternatives. During the incremental construction of a ligand discrete protein conformations are sampled in a
combinatorial fashion. The highest scoring protein structure is selected based on a comparison between the
ligand and each alternative.
Hybrid method is another practical strategy to model receptor flexibility. One example is Glide, a very
popular program in the field of docking. Glide designs a series of hierarchical filters to search the possible
poses and orientations of the ligand within the binding site of the receptor. Ligand flexibility is handled by
an exhaustive search of the ligand torsion angle space. Initial ligand conformations are selected based on
torsion energies and docked into receptor binding sites with soft potentials. Then a rotamer exploration is
used to further model receptor flexibility. IFREDA utilizes a hybrid method that combines soft potential and
multiple receptor conformations, accounting for receptor flexibility. Other programs, like QXP and Affinity,
perform a Monte Carlo search of ligand conformations followed by a minimization step. During
minimization, the user-defined parts of the protein are allowed to move in order to avoid atom clashes
between the ligand and receptor. SLIDE is designed to incorporate flexibility with the ability to remove
clashes by directed, single bond rotation of either the ligand or the side chains of the protein. An
optimization approach based on the mean-field theory is applied to model induced-fit complementarities
between the ligand and protein.
Methods mentioned above either include only side chain flexibility or full flexibility of the receptor. We
have known that loops forming active sites play an important role in ligand binding. In some cases the loop
may undergo dramatic conformational change whereas in other portions of the receptor there is little change
upon ligand binding. For this situation, side chain flexibility methods fail to sample the correct protein
conformation and full flexibility seems to be a computational waste. Figure 1 shows superimposed crystal
structures of triosephosphate isomerase as an example. The active site of triosephosphate isomerase has an
11-residue loop which moves 7Å upon ligand binding. However, the rest of the enzyme has no movement in
comparison to their apo and holo structures. Several enzyme families also involve loop rearrangement
within the active site responsible for ligand binding, such as Bromo domain, an extensive family related to
acetyl-lysine binding, or Dihydrofolate reductase, responsible for the maintenance of the cellular pools of
tetrahydrofolate, as well as other kinds of kinases. In the next section, we present the Local Move Monte
Carlo (LMMC) loop sampling method, a new approach which focuses on sampling ligand conformation
within loop-containing active sites.
STEPS TO FOLLOW DOCKING:
1. Ligand Preparation: 2D structures are drawn on ChemSketch or ChemDraw. The structures are
imported to the docking software tool and converted to 3D structures. Ligands are prepared by minimizing
their energy to a threshold, adding hydrogens and partial charges are calculated. There may be several output
ligands for one input ligand as different possible stereoisomeric forms or tautomeric forms of the ligand at a
definite pH range (6.5-8.5) may be given.
2. Protein Preparation: The desired protein target or biological target may be imported from protein
databank or from other macromolecular databases in the prescribed format. The co-crystallized ligand and
water molecules are removed. The protein is checked for any missing atoms or residues and valencies are
checked.
3. Defining and editing the binding site: After the protein is prepared, the active site is defined according
to PDB data or the receptor cavities are detected and edited if required. The coordinates are fixed as X, Y,
and Z and the fourth parameter R i.e. the diagonal distance is also edited.
4. Docking: Finally the docking protocol is run according to user default parameters. The results are
displayed as 3D format.
5. In situ Ligand minimization: After docking all the poses of the ligands are energy minimized to fine
tune the ligand receptor interaction. This step is very important and takes more time.
6. Scoring: All the conformations of the ligands are scored using different algorithms which actually gives
an idea about the binding affinity of the ligands to the active site of the protein. Thus score helps us to pick
up the poses which are more practically accepted.
7. Calculating binding energy: Finally binding energy of the poses is calculated in kJ/mol and is correlated
with the data.
8. Analyzing the results: The docking results are analyzed by mapping the interaction of different part of
the ligand with the amino acid residues of the protein. The interactions may be favourable or unfavourable,
generally non-covalent e.g. H-bond, π-alkyl, π-π, hydrophobic interaction and many others.