2. Abstract
Molecular docking is a study of how two or more molecular structures, for
example drug and enzyme or receptor of protein, fit together. In other words,
the problem is like solving a 3-dimensional puzzle. For example, the action of
a harmful protein in human body may be prohibited by finding an inhibitor,
which binds to that particular protein. Molecular docking softwares are mainly
used in drug research industry.
The most important application of docking software is virtual screening. In
virtual screening the most interesting and promising molecules are selected from
an existing database for further research. This places demands on the used
computational method; it must be fast and reliable. Another application is the
research of molecular complexes.
4. Chapter 1
Introduction
1.1 Biological background
Molecular docking is used to predict the structure of the intermolecular complex
formed between two or more molecules. The most interesting case is the protein-
ligand interaction, because of its applications in medicine. Ligand is a small
molecule, which interacts with protein’s binding sites. Binding sites are areas of
protein known to be active in forming of compounds. There are several possible
mutual conformations in which binding may occur. These are commonly called
binding modes.
1.2 Physical background
1.2.1 General
When studying the structure of matter, the most thorough way of inspection
is to apply quantum mechanics to the situation. In this case, the interaction
between two macromolecules - the ligand and the protein - could be found
out by solving the combined Schr¨dinger equation of both systems. Possible
o
states of the combined system could be achieved through this method. However,
quantum mechanical approach leads into dead end nearly immediately, since it
is impossible to find an explicit solution for this difficult problem. Of course,
it’s possible to find a numerical solution, but it soon turns out that even the
numerically solvable quantum field models are computationally too heavy to
produce truly exploitable results.
It is far more productive to apply a bit more primitive, mechanic, model. This
means that we need to study the quality and quantity of forces between the
interactive particles. Depending on the computational method we may assign
different weights to different kinds of forces. It is quite common to resort to
certain simplifications, and some of the interacting forces are not used in the
modeling.
3
5. 1.2.2 Forces
It is very common to define the interactions between particles to be the conse-
quence of forces between the molecules contained by the particles. Often forces
are divided into four categories:
• Forces with electrostatic origin
• Forces with electrodynamic origin
• Steric forces
• Solvent-related forces
Forces with electrostatic origin are due to the charges residing in the matter. The
most common interactions are charge-charge, charge-dipole and dipole-dipole.
These forces can be computed with the basic law of Coulomb. Dependencies on
the distance are the following:
• charge-charge: 1/r
• charge-dipole: 1/r2
• dipole-dipole: 1/r3
In addition to purely electrostatic forces there exists also those with electrody-
namical background. The most widely known is probably the van der Waals
-interaction. Atoms, that are normally electrically neutral, may develop an in-
duced dipole moment when an external electric field is applied. Van der Waals
-interaction is the force between the two induced dipoles, and it has a very
short range. There are also forces between existing charges and induced dipoles.
Range dependences are the following:
• charge-induced dipole: 1/r4
• van der Waals: 1/r6
Steric forces are caused by entropy. For example, in cases where entropy is
limited, there may be forces to minimize the free energy of the system, that are
due to entropy.
Solvent-related forces are due to the structural changes of the solvent. These
structural changes are generated, when ions, colloids, proteins, etc. are added
into the structure of solvent. For example, when water is acting as a solvent, one
must take the polaric nature of water molecules into account. Water molecules
form hydrogen bonds, and for example the water mass around the studied pro-
tein may turn into a highly organized structure. It is very hard to determine
the solvent-related interactions, and their modeling depends very much on the
way the actual solvent is modeled.
Common thing to all these forces is the electromagnetic origin.
4
6. Quantum mechanical background must be kept in mind while developing the
computational model, since some quantum phenomena must be taken into ac-
count in otherwise classic evaluation. For example, a covalent bond between
atoms (two atoms share a common electron), is a purely quantum mechanical
phenomenon. Another quantum mechanical phenomenon, that needs to be ad-
dressed is the Pauli Exclusion Principle. As simply stated as possible, Pauli
Exclusion Principle says, that two nearby electrons may not be in the exactly
same quantum state. Exclusion Principle expresses itself in such a way, that
if the distance between two particles is very small, they experience a strong
repulsive force.
1.2.3 Other physical factors
Generic protein-protein interactions differ from protein-ligand interactions due
to the small size of ligand. Because of their large size, proteins are usually
treated as rigid bodies. However, conformational changes in the protein and
the ligand are often necessary for a successful docking process. That is why it
must be clearly understood how drastic generalization the rigid body approach
is. One of the goals in current research is to be able to use flexible protein
structure models.
5
7. Chapter 2
Molecular docking
Molecular docking can be divided into two separate problems. The search al-
gorithm should create an optimum number of configurations that include the
experimentally determined binding modes. These configurations are evaluated
using scoring functions to distinguish the experimental binding modes from all
other modes explored through the searching algorithm.
A rigorous searching algorithm would go through all possible binding modes
between the two molecules. However, this is impractical due to the size of
the search space. Consider a simple system comprised of a ligand with four
rotable bonds and six rigid-body alignment parameters and a cubic active site
measuring 103 ˚3 . The translational and rotational properties add up to six
A
degrees of freedom. If the angles are considered in 10 degree increments and
translational parameters on a 0.5 ˚grid there are approximately 4 × 108 rigid
A
body degrees of freedom to sample, corresponding to 6 × 1014 configurations to
be searched. This would require approximately 2 000 000 years of computational
time at a rate of 10 configurations per second. As a consequence only a small
amount of the total conformational space can be sampled, and so a balance must
be reached between the computational expense and the amount of the search
space examined.
Some common searching algorithms include
• Molecular dynamics
• Monte Carlo methods
• Genetic algorithms
• Fragment-based methods
• Point complementary methods
• Distance geometry methods
• Tabu searches
• Systematic searches
6
8. Current docking methods utilize the scoring functions in one of two ways. The
first approach uses the full scoring function to rank a protein-ligand conforma-
tion. The system is then modified by the search algorithm, and the same scoring
function is again applied to rank the new structure. In the alternative approach
a two stage scoring function is used. A reduced function is used in directing the
search and a more rigorous one is then used to rank the resulting structures.
Some common scoring functions are
• Force-field methods
• Empirical free energy scoring functions
• Knowledge-based potential of mean force
Only force-field based methdos are considered in this article.
2.1 Docking methods
2.1.1 Molecular dynamics
These methods involve the calculation of solutions to Newton’s equations of
motions. Finding the global minimum energy of a docked complex is difficult
since traversing the rugged hypersurface of a biological problem is problematic.
The problem is approached using standard optimization algorithms including
• direct searches, using only the potential function, impractical for large
molecules, suitable only for crude optimization of small molecules far away
from the minimum, e.g. simplex
• gradient methods, involving the first derivative of the potential function,
low convergence near the minimum, recommended for initial optimization,
e.g. steepest descend
• conjugate-gradient methods, history of the search influences the search
direction, high computational efforts, better convergence, e.g. Fletcher-
Reeves
• second derivative methods, very efficient convergence, e.g. Newton-Raphson
• least squares methods, good convergence but often computationally too
expensive, e.g. Marquardt
Often a combination of methods mentioned above is used, for example a com-
bination of a gradient method for initial optimization and a conjugate-gradient
method when nearing the minimum.
7
9. 2.1.2 Monte Carlo methods
The Monte Carlo simulation method occupies a special place in the history
of molecular modeling, as it was the technique used to perform the first com-
puter simulation of a molecular system. The expression Monte Carlo simulation
seems to be extremely general and many algorithms are called by that whenever
they contain a stochastic process or some kind of random sampling. For those
interested, in molecular docking the expression Monte Carlo usually means im-
portance sampling or Metropolis method. The Metropolis method, which is
actually a Markov chain Monte Carlo method, generates random moves to the
system and then accepts or rejects the move based on a Boltzmann probability.
The Monte Carlo methods play an important role in molecular docking but the
variety of different kinds of algorithms is too large be considered here in detail.
Programs using MC methods include AutoDock, ProDock, ICM, MCDOCK,
DockVision, QXP and Affinity.
2.1.3 Genetic algorithms
Genetic algorithms and evolutionary programming are quite suitable for solving
docking problems because of their usefulness in solving complex optimization
problems. The essential idea of genetic algorithms is the evolution of a pop-
ulation of possible solutions via genetic operators (mutation, crossovers and
migrations) to a final population, optimizing a predefined fitness function.
The process of applying genetic algorithms starts with encoding the variables,
in this case the degrees of freedom, into a ”genetic code”, e.g. binary strings.
Then a random initial population of solutions is created. Genetic operators
are then applied to this population leading to a new population. This new
population is then scored and ranked, and using ”the survival of the fittest”,
their probabilities of getting to the next iteration round depends on their score.
If the size of the population is kept constant, good solutions will occupy the
population. It should be noted that genetic algorithms are well suitable for
parallel computing. Some programs using GAs are GOLD, AutoDock, DIVALI
and DARWIN.
2.1.4 Fragment-based methods
Fragment based methods can be described as dividing the ligand into separate
portions or fragments, docking the fragments, and finally linking these fragments
together. These methods require subjective decisions on the importance of the
various functional groups in the ligand, because a good choice of base fragment
is essential for these methods. A poor choice can significantly affect the quality
of the results. The base fragment must contain the predominant interactions
with the receptor. Early algorithms required manual selection of base fragment,
but this has been automated in newer implementations.
Some well known programs using fragment based methods are FlexX and DOCK.
These programs are covered in more detail a later chapter.
8
10. 2.1.5 Point complementary methods
These methods are based on evaluating the shape and/or chemical complemen-
tarity between interacting molecules. The interacting molecules are usually
modeled in an easy way, for example using spheres or cubes as atoms. The
ligand description is then rotated and translated to obtain maximum number
of matches between ligand and protein surfaces, minus the number of volume
overlaps. Additional constraints may be present, for example a demand for
interacting surface normals to be approximately in opposite directions.
Some algorithms use a 3D grid, which is placed over the protein and over the
ligand. Each grid point is then labeled either open space or inside the ligand or
protein. Then a correlation function is created and this function is optimized
using rigid body translation and rotation. This often involves using traditional
shape recognition algorithms like Fast Fourier Transform(FFT) with Fourier
correlation theory. A high correlation score denotes good surface complemen-
tarity between the molecules. Because many of the methods were originally
created for protein-protein docking, the rigid body assumption is usually made.
This is a limitation in ligand-protein docking. However, some algorithms are
addressed to ligand-protein docking and these allow some flexibility. Examples
of programs using point complementary methods are FTDOCK, SANDOCK,
FLOG and the Soft Docking algorithm.
2.1.6 Distance geometry methods
Many types of structural information can be expressed as intra- or intermolec-
ular distances. The distance geometry formalism allows these distances to be
assembled and three-dimensional structures consistent with them to be calcu-
lated. The crucial feature is that it is not possible to arbitrarily assign values
to the inter-atomic distances in a molecule and always obtain a low-energy con-
formation. Rather, the inter-atomic distances are closely interrelated and many
combinations of distances are geometrically impossible. This enables fast sam-
pling of the conformational space though not always resulting in good results.
An example of a program using distance geometry in docking problem is DockIt.
2.1.7 Tabu searches
These methods are based on stochastic processes, in which new states are ran-
domly generated from an initial state (referred to as the current solution). These
new solutions are then scored and ranked in ascending order. The best new so-
lution is then chosen as the new current solution and the same process is then
repeated again. To avoid loops and ensure diversity of the current solution a
tabu list is used. This list acts as a memory. It contains information about
previous current solutions and a new solution is rejected if it reminds a previ-
ous solution too much. An example of docking algorithm using tabu search is
PRO LEADS.
9
11. 2.1.8 Systematic searches
These methods systematically go through all possible conformations and repre-
sent the brute force solution to the docking problem. All molecules are usually
assumed to be rigid and interaction energy is evaluated from a force field model.
Some constraints and restraints can be used to reduce the dimensionality of the
problem.
2.2 Force field models
Molecular mechanics stem from the idea, that the electrons of the atom can
be thought as fixed. Geometry of a molecule can be approximated effectively
by taking all the interacting forces into account. Bonded interactions are de-
scribed by spring forces, and non-bonded interactions are usually approximated
by potentials resembling van der Waals interaction. The desired parameters are
determined by experimental observations. Geometry is further optimized by
finding the energy minimum.
Total energy is represented by set of potential energy functions. In addition
to these functions, a set of parameters is also needed to compute the total
energy. It is worthwhile to notice that force field parameters have no meaning
unless they are considered together with the potential energy functions. Thus
a comparison between force field models is very difficult. In addition to these
two parts, information about atom types and atom charges is also required. We
also usually need a set of rules to type atoms, generate parameters not explicitly
defined and to assign functional forms and parameters. These methods together
form a force field.
• Potential energy functions
• Parameters for function terms
• List of atoms and atom charges
• Rules for atom-typing, parameter generation and functional form assigning
Force fields are usually employed to generate accurate predictions to complex
problems by interpolating and extrapolating from relatively simple experimental
set of molecules. There are generally two approaches to force fields. They are
either very accurate with small set of molecules and compounds. They also may
be more general, in which case the accuracy is often compromised.
Let’s have a bit more detailed view into some of the existing force field models.
2.2.1 Classical force field models
Examples of classical force field models include AMBER, CHARMM and CVFF.
They are used mainly in biochemistry.
10
12. AMBER (Assisted Model Building with Energy Refinement)
AMBER refers to two things: it may mean a set of molecular mechanic force
fields used for the simulation of biomolecules, or it may also mean a package of
molecular simulation programs. AMBER’s set of parameters is experimentally
derived. AMBER force fields are probably the most widespread ones. AMBER
is designed especially for biological macromolecules.
CHARMM (Chemistry at HARvard Macromolecular Mechanics)
CHARMM is a program for macromolecular dynamics. In addition to perform-
ing MD using algorithms for time-stepping, long range force calculation and
periodic images, it can be used for energy minimization, normal modes and
crystal optimizations. There are several potential energy functions parameter-
ized for protein, lipid and nucleic acid simulations. CHARMM also incorporates
free energy methods for chemical and conformational free energy calculations.
CVFF (Consistent Valence Force Field)
CVFF has parameters that are acquired by fitting crystal and gas structures to
small organic molecules. CVFF is designed mainly for organic materials, and it
is commonly used to predict structures and compute binding energies.
2.2.2 Second generation force field models
Second generation force fields examples include CFF and COMPASS.
CFF (Consistent Force Field)
CFF is a bit more complex compared to AMBER. The potential energy func-
tions in CFF are expanded in order to avoid problems concerning complexity of
potential energy surfaces. CFF also uses quantum calculations to determine the
parameters for energy functions. This approach gives a great advantage over
classical models, since parameters can be determined much more accurately.
Other advantages include the possibility to cover larger number of compounds
into the force field model, and the fact that all parameters are determined the
same way (which makes the model more consistent).
COMPASS (Condensed-phase Optimized Molecular Potentials for
Atomistic Simulation Studies)
COMPASS is another ab initio (from the beginning) force field model. Like
CFF, it also has parameters defined by quantum mechanical calculations and
validated by empirical data.
11
13. 2.2.3 Generalized force field models
Generalized force fields are not as accurate as the ones presented above, but they
have their uses. They can be applied to systems that are not covered by more
accurate force field models. Generalized force field models are based on atomic
parameters and rules to determine the explicit form of parameters. Examples
include ESFF and UFF.
ESFF (Extensible Systematic Force Field)
ESFF covers all elements up to Rn. ESFF can be used for both the organic and
inorganic systems.
UFF (Universal Force Field)
UFF covers the whole periodic table. However, it is not very accurate, and thus
it’s main application is systems that are not covered by other force fields.
12
14. Chapter 3
Software
In addition to the existing large number of docking programs, there are also
many molecular mechanics programs applicable to these problems. Despite the
huge variety of available programs, no single program has been able to become
recognized as a standard. Of course, there are some programs that are very
widely used. Nevertheless it seems that the programs are not that easy to use
and require some understanding of the underlying computational principles.
This leads into situations, where people are using the same program they have
been using before though better options could be available. It also seems that
some of the existing programs are reaching a bit more mature state, since there
seem to be an increasing number of commercial solutions available. Docking
programs are usually sold in a package with other molecular design software.
It should also be noted that the division made earlier is not very strict and many
programs would fit into more than one category of methods. Tests have shown
that there is not a significant difference in hit rates between different programs
and they all produce false alarms. Because of this, combining different searching
and scoring functions produces more reliable results. This has lead to the most
successful docking programs usually being a collection of the methods described.
It is also worth remembering that a molecular docking software is only as good
as its scoring function is. It does not help if we are able to create the right
conformation not but able to recognize it.
Probably the best known example of rational drug design has been the HIV-1
protease inhibitor. Starting with X-ray structures of HIV-1 protease, a group
of scientists at DuPont Merck used docking and molecular design softwares to
succesfully design an inhibitor.
3.1 AutoDock
AutoDock uses Monte Carlo simulated annealing and Lamarckian genetic al-
gorithm to create a set of possible conformations. LGA is used as a global
optimizer and energy minimization as a local search method. Possible orien-
tations are evaluated with AMBER force field model in conjunction with free
energy scoring functions and a large set of protein-ligand complexes with known
13
15. protein-ligand constants. The newest yet unreleased version 4 should contain
side chain flexibility. AutoDock has more informative web pages than its com-
petitors and because of its free academic license, it is a good starting point when
wondering into the world of molecular docking software.
3.2 DOCK
DOCK is one of the oldest and best known ligand-protein docking programs.
The initial version used rigid ligands; flexibility was later incorporated via in-
cremental construction of the ligand in the binding pocket. As said DOCK is a
fragment-based method using shape and chemical complementary methods for
creating possible orientations for the ligand. These orientations can be scored
using three different scoring functions, however none of them contain explicit
hydrogen-bonding terms, solvation/desolvation terms, or hydrophobicity terms
thus limiting serious use. DOCK seems to handle well apolar binding sites and
is useful for fast docking, but it is not the most accurate software available.
3.3 FlexX
FlexX is another fragment based method using flexible ligands and rigid pro-
teins. It uses MIMUMBA torsion angle database for the creation of conformers.
The MIMUMBA is an interaction geometry database used to exactly describe
intermolecular interaction patterns. For scoring, the Boehm function (with mi-
nor adaptions necessary for docking) is applied. FlexX is introduced here to
pronounce the importance of scoring functions. Although FlexX and DOCK
both are fragment based methods, they produce quite different results. On the
contrary to DOCK which performs well with apolar binding sites, FlexX shows
totally opposite behavior. It has a bit lower hit rate than DOCK but provides
better estimates of Root Mean Square Distance for compounds with correctly
predicted binding mode. There is an extension of FlexX called FlexE with flexi-
ble receptors which has shown to produce better results with significantly lower
running times.
3.4 Gold
Gold has won a lot of new users during the last few years because of its good
results in impartial tests. It has a good hit rate overall, however it somewhat
suffers when dealing with hydrophobic binding pockets. Gold uses genetic algo-
rithm to provide docking of flexible ligand and a protein with flexible hydroxyl
groups. Otherwise the protein is considered to be rigid. This makes it a good
choice when the binding pocket contains amino acids that form hydrogen bonds
with the ligand. Gold uses a scoring function that is based on favorable con-
formations found in Cambridge Structural Database and on empirical results
on weak chemical interactions. The development of GOLD is currently focused
on improving the computational algorithm and adding a support for parallel
14
16. processing. GOLD has one of the most comprehensive validation test sets and
is also available for use at CSC.
3.5 Summary
URL:
[1]: http://www.accelrys.com/insight/affinity.html
[2]: http://www.scripps.edu/pub/olson-web/doc/autodock/
[3]: http://www.cmpharm.ucsf.edu/kuntz/
[4]: http://www.metaphorics.com/products/dockit.html
[5]: http://www.dockvision.com
[6]: http://www.sdsc.edu/CCMS/DOT
[7]: http://www.sdsc.edu/CCMS/FP/index.html
[8]: http://www.tripos.com/sciTech/inSilicoDisc/virtualScreening/fdock.html
[9]: http://cartan.gmd.de/flexx/
[10]: http://www.bmm.icnet.uk/docking/
[11]: http://www.schrodinger.com/Products/glide.html
[12]: http://www.ccdc.cam.ac.uk/prods/gold/
[13]: http://reco3.ams.sunysb.edu/gramm/
[14]: http://www.edusoft-lc.com/hint/
[15]: http://www.molsoft.com/index.html
[16]: http://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/leapfrog.html
[17]: http://www.schrodinger.com/Products/liaison.html
[18]: http://www.biochem.ucl.ac.uk/bsm/ligplot/ligplot.html
[19]: http://www.schrodinger.com/Products/qsite.html
[20]: http://www.sdsc.edu/CCMS/Packages/shape.html
[21]: http://situs.scripps.edu/
15
17. # Name License terms Platforms Keywords
1 Affinity Commercial SGI Monte Carlo meth-
ods
2 AutoDock Free for non-profit Unix, MacOS, GA/LGA, MC
use Linux, SGI
3 DOCK Free for academic Unix, Linux GA, FB
use
4 DockIt Commercial SGI, Sun, Linux Distance geometry
5 DockVision Commercial IRIX, Linux MC, GA
6 DOT Free Supercomputers
(Daughter of and clusters, Unix
turnip)
7 FADE and Free for academic Unix, Macintosh Point complemen-
PADRE use tary
8 FlexiDock Commercial SGI, IRIX 6.5 GA
9 FlexX Commercial Unix Fragment based
10 FTDOCK Free, registration Unix Point complemen-
tary, Fourier corre-
lation
11 GLIDE Commercial Unix, Linux, SGI, MC
Sun
12 GOLD Free evaluation Unix GA
13 GRAMM Free, registration SGI, Sun, Alpha, Global minimums
Windows, Linux of intermolecular
energies
14 HINT Commercial SGI, Linux, Sun, hydropathic inter-
Win2000, Macin- actions
tosh
15 ICM ICM Lite free for SGI, Alpha, Sun,
academic use, oth- Linux, WinNT
erwise commercial
16 LEAPFROG Commercial SGI De novo ligand de-
sign tool
17 Liaison Commercial Unix, Linux, SGI, Fast calculations of
Sun free energy of bind-
ing
18 LIGPLOT Free for academic Unix Schematic dia-
use grams of protein -
ligand complexes
19 QSite Commercial Unix, Linux, SGI, mixed quantum
Sun and molecular me-
chanics, hydrogen
bonds, hydrophobic
interactions
20 Shape E-mail request Unix Structure and
chemistry of molec-
ular surfaces
21 Situs E-mail request Unix Both rigid and flex-
ible proteins
16
18. Chapter 4
References
C. Bissantz, G. Folkers, D. Rognan: Protein-based virtual screening of chemical
databases. 1. Evaluation of different docking/scoring combinations, Journal of
Medicinal Chemistry, 43: 4759-4767, 2000
http://pubs.acs.org/journals/jmcmar/article.cgi/jmcmar/2000/43/i25/pdf/jm001044l.pdf
A. Leach : Molecular modeling, principles and applications 2e, Prentice Hall,
2001
P. Lehtovuori, T. Nyr¨nen: Nykyaikaisen telakkaty¨l¨isen vasara: molekyylien
o oa
telakointiohjelma GOLD Cedarissa, @CSC, 2/2002
http://www.csc.fi/lehdet/atcsc/atcsc2-2002/atcsc2 02.pdf
Roberto Millini: Introduction to molecular mechanics, 2002
http://www.ics.trieste.it/documents/chemistry/combinatorial/activities/tc-Jul2002/Millini1.pdf
R.D. Taylor, P.J. Jewsbury & J.W. Essex: A review of protein-small molecule
docking methods, Journal of Computer-Aided Molecular Design, 16: 151-166,
2002
ESF Training course on Molecular Interactions: New Frontiers for Computa-
tional Methods
http://cassandra.bio.uniroma1.it/ESFcourse/download.htm
17