2. Computer Aided Drug Design:
QSAR Related Methods
Jahan B Ghasemi
DDSLab K N Toosi Univ of Tech.
Tehran, Iran
3. 5/27/2014 Importance of PROCESS is not less than PRODUCT
Topics in this Talk
are:
General
Introduction
Some of These
QSAR Steps:
3
Data Pre-Processing
Normalization
Standardization
Variable Selection
Subset Selection
Outlier Detection
Multivariate
Analysis
MLR
PCA
PLS
SVM
ANN
CART
Molecular
Descriptors
Constitutional
Electronic
Geometrical
Hydrophobic
Lipophilicity
Solubility
Steric
Quantum
Chemical
Topological
Molecular Structures
OC1=CC=CC=C1
1D
2D
3D
Statistical
Evaluation
R
R2
Q2
MSE
RMSE
PRESS
4. Importance of PROCESS is not less than PRODUCT
"Well begun is half done“ Aristotle
Renes Descartes in 1619 Quantitative
Measurement in Science
Research
Types
Inductive
Approach
Deductive
Approach
Abductive
Approach
5/27/2014 4
General
Introduction
5. Importance of PROCESS is not less than
PRODUCT
Theory
Hypothesis
Confirmation
Observation
Theory
Hypothesis
Observation
Pattern
Induction is usually described as moving from the specific to the general, while deduction begins
with the general and ends with the specific.
Arguments based on laws, rules and accepted principles are generally used for Deductive
Reasoning. Observations tend to be used for Inductive Arguments.
5/27/2014
-Metrics as soft-computing or soft-modeling are Inductive Research Approaches. Uncertainty
Are humans
natural logic
reasoners?
No!!!
5
6. 5/27/2014 Importance of PROCESS is not less than PRODUCT
What Do We Need to Know in a Successful
QSAR Modeling as a Drug Design Tool?
6
7. I- Math-Science or Informatique or Informatics
Aspect
Linear Algebra
Vectors, Matrices,
Tensors…
Homogenous and regular linear and
nonlinear simultaneous equations
Graph Theory
Maximal Subgraph
Clique Detection
Multivariate Statistical
Analysis
Column Space, Row SpacePattern Recognition
(Dis)Similarity
Distance Metrics, Euclidean,
Manhattan, Mahalanobis
Fingerprints, Tanimoto,
Jaccard
Supervised and Unsupervised Pattern Recognition
Clustering, Agglomerative(bottom up), Divisive(top down)
MLR, PCA, PLS
Optimization
Selection of the most
informative variables,
GA
Selection of the most representative
objects, KS
Function minimization, Newton,
Gauss-Newton, Marquradt-Levenberg
Computer
Computer
Graphic
HPC
5/27/2014 Importance of PROCESS is not less than PRODUCT 7
8. 5/27/2014 Importance of PROCESS is not less than PRODUCT
II-Bio-Science
Aspect
Chemistry
Organic Chemistry
Quantum/Molecular Mechanics
Forcefield, Conformer, Bioactive
Conformer
Medicinal Chemistry
Biology
Molecular Biology
Systems Biology
Pharmacology
Pharmacokinetics
Pharmacodynamics
Toxicity
ADMET
8
9. Combination
of I and II
OMICS
Bioinformatics
Proteomics
Metabolomics
Genomics
Metrics
Biometrics
Chemometrics
Technometrics
Chem(o)informatics
5/27/2014 Importance of PROCESS is not less than PRODUCT 9
QSAR is related to
the most of –
OMICS and –
METRICS
routines
11. Chemical Space
(Gathering Information from All Involved Species)
Aggregation
Host-Guest
Complex
Receptor-
Inhibitor
Complex
Macromolecules
Protein
Receptor
Host
Small
Molecules
Guest
Ligand
Inhibitor
5/27/2014 Importance of PROCESS is not less than PRODUCT 11
12. Chemical Space
Chemical Information
Information
due to
Macromolecule
Structure
Information
due to
Aggregation Structure
Information
Due to
Small Molecule
Structure
5/27/2014 Importance of PROCESS is not less than PRODUCT 12
13. To have
and use
Chemical
Space:
Extract and Convert
Chemical
Information
to
Numerical Values
We Are Calling
These Numerical
Values:
Molecular
Descriptors
5/27/2014 Importance of PROCESS is not less than PRODUCT 13
14. Descriptors should
be associated with
the following
desirable features:
Easy Interpretation
Show Correlation with a Property
Discrimination of Isomers
Independence
Simplicity
Not to be based on properties
Not to be trivially related to other descriptors
Allow for efficient construction
Use familiar structural concepts
Show gradual change with gradual change in structures
5/27/2014Importance of PROCESS is not less than PRODUCT
15. End Points to
Be Modeled
Chemical
properties
Boiling point
Retention time
Dielectric constant
Diffusion coefficient
Dissociation constant
Melting point
Reactivity
Solubility
Stability
Thermodynamic properties
Viscosity
5/27/2014Importance of PROCESS is not less than PRODUCT
16. End Points to
Be Modeled
Biological
Properties
Bioconcentration
Biodegradation
Carcinogenicity
Drug metabolism and clearance
Inhibition constant
Mutagenicity
Permeability
Blood brain barrier
Skin
Pharmacokinetics
Receptor binding
5/27/2014Importance of PROCESS is not less than PRODUCT
17. There are more
than 5500 Mol.
Des. BUT!
Why do we need more
and more Molecular
Descriptors?
Each molecular descriptor takes into account a small
part of the whole chemical information contained
into the real molecule and, as a consequence, the
number of descriptors is continuously increasing
with the increasing request of deeper investigations
on chemical and biological systems.
Different descriptors have independent methods or
perspectives to view a molecule, taking into account
the various features of chemical structure. Molecular
descriptors have now become some of the most
important variables used in molecular modeling,
and, consequently, managed by statistics,
chemometrics, and chemoinformatics.
5/27/2014 Importance of PROCESS is not less than PRODUCT 17
19. Molecular
Descriptors
How to Calculate Molecular
Descriptors?
By Hand! By Software
Dragon SYBYL
PaDEL-
Descriptor
AdrianaCode
5/27/2014 Importance of PROCESS is not less than PRODUCT 19
20. Molecular Descriptors
Classes!
Different
Classes?
Yes
How many?
Many classes
What are the bases of
Classification?
Based of
Dimensionality
0D-4D
Geometric Constitutional Topological
Quantum
Chemical
etc….
Based of Origin
Theoretical Experimental
Both!
5/27/2014 Importance of PROCESS is not less than PRODUCT 20
21. Molecular
Descriptors
Do they have equal importance?
0D<1D<2D<2.5D<3D<4D…<nD
Low Information Content High Information Content
5/27/2014 Importance of PROCESS is not less than PRODUCT 21
22. Now We Have Molecular Descriptors and Chemical,
Molecular or Information Space
But first define and introduce:
Objects=
Molecules
Variables=
Descriptors
Object to Variable ratio ≥ 4
Why? Least-Squares Need
it!
5/27/2014 Importance of PROCESS is not less than PRODUCT 22
23. 5/27/2014 Importance of PROCESS is not less than PRODUCT 23
Math-Science Part
Start Here: Using
a Very Efficient
Way to Show
Chemical
Information:
Matrix-Vector
24. Objects
as rows
Variables as Columns
1
2
3
.
.
.
.
.
.
.
.
.
.
n
1 2 3 . . . . . . . . . m
Objects
as rows
1
2
3
.
.
.
.
.
.
.
.
.
.
n
25. Preprocessing
On End Point
Vector y
nM unit
log Transformation
To Linearized the
Variation
To Have LFER
InterpretationMean Centering
Autoscaling
On Molecular
Descriptors Matrix
X
Mean Centering-
Has its general purpose
Autoscaling
Has its general purpose
Outlier Detection AD
Dimensionality
Reduction
PCA
5/27/2014 Importance of PROCESS is not less than PRODUCT 25
26. Geometrical Interpretation of Information Matrix
Spaces
Row
Space
Column Space:
Object Map
Metrics
Distances
Euclidean
and….
Classes Clusters Groups
5/27/2014 Importance of PROCESS is not less than PRODUCT 26
27. Row Space!
Is it informative? How? What does it mean? How can we use it?
On
O1
O2
Each Point is a Vector!
m-dimensional space Sm
n- points pattern Pn
Importance of PROCESS is not less than PRODUCT5/27/2014 27
28. Column Space
Objects Map Scientists(Chemists, Biologists..) are interest in!!!
Is it informative? How? What does it mean? How can we use it?
Vn
V1
V2
Class I or Group I
Class II or Group II
Each Point is a Vector!
n-dimensional space Sn
m- points pattern Pm
Importance of PROCESS is not less than PRODUCT5/27/2014 28
29. QSAR Model Building
Based on Molecular Geometry
2D-QSAR 2.5D-QSAR 3D-QSAR
5/27/2014 Importance of PROCESS is not less than PRODUCT 29
30. QSAR Model
Building
Type of Mapping Function
A Crucial Decision
Linear
MLR kNN PLS
Nonlinear
ANN SVM
Linear+Non-
Linear
DT + other Tree
and Ensemble
Methods
5/27/2014 Importance of PROCESS is not less than PRODUCT 30
31. QSAR Model Building
Object Selection-Data Splitting-Train-Test Sets
To have Good 1- Representative and 2- Diversity
y-Based Method
Randomly Evenly
X-Based Methods
Random
Selection
kNN
Selection
Similarity Principle
KS,SOM, LMD,
Duplex, MDC
5/27/2014 Importance of PROCESS is not less than PRODUCT 31
32. QSAR Model Building
Variable Selection
Filters
(Subjective)
Uninformative Variable Elimination (UVE)
Correlation Ranking (CR)
Wrappers
(Objective)
GA-PLS
Embedded
(Selection+Mapping Integrated)
Stepwise Selection
RM, ERM, FFD
5/27/2014 Importance of PROCESS is not less than PRODUCT 32
33. QSAR Model Building
Model Validation- There are different Criteria in the Literatures
Residual
Analysis
Analysis of
Varaince
Applicability
Domain
Residual Leverage
Good
Leverage
Bad
Leverage
Q_Residual T2
_Hotelling
Model Precision(Confidence
Intervals of Model Parameters)
Bootstrap
Resampling
Jackknife
Resampling
Model
Accuracy(Predic
tion Error)
Internal
Validation
Cross
Validation
Leave One
Out
Leave
Many Out
Scrambling
X-
randomization y-randomization
External
Validation
External and
Fully Unseen or
Independent Data
Set
5/27/2014 Importance of PROCESS is not less than PRODUCT
Final word on Validation: The
external Independent Unseen Data
Set Is Mandatory for a Successful
QSAR Model: Do you know why?
Local-X-Global or Induction
Research has Uncertainty
33
34. Purposes OF
QSAR:
Rational
Identification of
New Leads with:
Pharmacological,
Biocidal or
Pesticidal
Activity.
Optimization of
New Leads with:
Pharmacological,
Biocidal or
Pesticidal
Activity.
The Rational
Design of:
Surface-active
agents, Perfumes,
Dyes, and Fine
Chemicals. 5/27/2014Importance of PROCESS is not less than PRODUCT
35. Purposes OF
QSAR:
The Selection of
Compounds with
Optimal
Pharmacokinetic
Properties.
The Prediction of a
variety of Physico-
chemical Properties
of Molecules.
The Prediction of
the Fate of
Molecules.
The Rationalization
and Prediction of
the Combined
Effects of
Molecules.
5/27/2014Importance of PROCESS is not less than PRODUCT
36. Purposes OF
QSAR:
The Identification
of Hazardous
Compounds at
Early Stages.
The Designing out
of Toxicity and
Side-Effects in
New Compounds.
The Prediction of
Toxicity of
Compounds to
Humans.
The Prediction of
Toxicity to
Environmental
Species.
5/27/2014Importance of PROCESS is not less than PRODUCT
37. Original
Data Set
Curated
Dataset
Split into
training, test
and external
validation set
Multiple
Training
Sets
Y-Randomization
Combi-QSAR modeling
Multiple
Test Sets
Activity
Prediction
Only Retain
Models that
pass both
internal and
external
accuracy
filters
Validated
Predictive
models with
High Internal
and External
Accuracy
External
Validation using
Applicability
Domain
Virtual Screening
Using Applicability
Domain
Experimental
Validation
The Most Rigorous and Currently Accepted QSAR Methodology
5/27/2014Importance of PROCESS is not less than PRODUCT
38. 5/27/2014 Importance of PROCESS is not less than PRODUCT
ASmallQuestion!!!
Why is QSAR alive in spite of the existence of very
strong rivals like Docking, MDs, Pharmacophore, SB
and LB methods?
Modeling and taking into account all pharmacological
phenomena is:
Nearly or totally impossible even in high level and
advanced research laboratories.
38
40. 1
2
a
d
c
b
Which one would
be the third point?
a, b, c or d?
1 and 2 have the largest distance.
They are firstly selected. Then
distance between of all unselected
points and all selected points
calculated.
Calculate distances 1a and 2a then min(1a,2a)= 2a.
Calculate distances 1b and 2b then min(1b,2b)= 2b.
Calculate distances 1c and 2c then min(1c,2c)= 1c.
Calculate distances 1d and 2d then min(1d,2d)= 1d.
Max(min(1a,2a),min(1b,2b),min(1c,2c),min(1d,2d))=1d
Then the point d is selected as the Third Point and so on…
1a
2a
1b
2b
1c
2c1d
2d
KSA Graphical Algorithm
5/27/2014 40Importance of PROCESS is not less than PRODUCT
44. 5/27/2014 Importance of PROCESS is not less than PRODUCT
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6 7 8 9 10 11
Original Data
log Values
45
45. Activity Descr 1 Descr 2 … Descr m
Y1 X11 X12 … X1m
Y2 X21 X22 … X2m
… … … … …
Yn Xn1 Xn2 … Xnm
Yi = a0 + a1 Xi1 + a2 Xi2 +…+ am Xim
Don’t consider the nonlinearity effects
Multiple Linear Regression (MLR)
465/27/2014 Importance of PROCESS is not less than PRODUCT
46. nnn FqtqtqtY 2211
• t latent variables or scores
• q loading vectors
Partial Least Square (PLS)
Robust with respect to collinear descriptors
Only one model optimization parameter (LV’s )
Fast computational 47
47. 48
Works on Similarity Principle
A compound in space close to, its kNN compounds from the training set and predicts the activity
class that is most highly represented among these neighbors.
The k-NN scheme is
sensitive: 1-
Distance Metric 2-
Number of training
compounds 3- k can
be optimized to
yield best results.
5/27/2014 Importance of PROCESS is not less than PRODUCT
The k-Nearest Neighbor Method kNN
48. Artificial Neural Network (ANN)
495/27/2014 Importance of PROCESS is not less than PRODUCT
DescriptorsorOriginalSpace
NonlinearorHiddenSpace
PropertiesBeingPredicted
49.
otherwise
if
0
:Only the points outside the ε-tube are penalized in a
linear fashion
ε-Insensitive Loss Function
Support Vector Regression (SVR)
Support Vector Classification (SVC)
505/27/2014 Importance of PROCESS is not less than PRODUCT
50. Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:
0
x
0 x
0 x
x2
5/27/2014 Importance of PROCESS is not less than PRODUCT 51
51. Non-linear SVMs: Feature spaces
General idea: the original input space can always be mapped to some higher-
dimensional feature space where the training set is separable:
Φ: x → φ(x)
5/27/2014 Importance of PROCESS is not less than PRODUCT 52
52. Decision Trees as a Greedy Algorithm:
CART: Classification and regression Tree
Binary recursive partitioning tree
Best First
Left Right
Up down
Here the Variable to classify
Audience! Here the First
Variable is “Biologist or Not”?
Why? We are in Bio-Dept.
535/27/2014 Importance of PROCESS is not less than PRODUCT
53. 3D-QSAR
Notes
Advantages over 2D-QSAR
No reliance on experimental values
Can be applied to molecules with unusual substituents
Not restricted to molecules of the same structural class in
(Pharmacophre 3D-QSAR case)
Predictive capability
5/27/2014 Importance of PROCESS is not less than PRODUCT 54
No experimental constants or measurements are involved
Properties are known as ‘Fields’
Steric field - defines the size and shape of the molecule
Electrostatic field - defines electron rich/poor regions of molecule
54. 3D-QSAR
Comparative molecular field analysis (CoMFA) - Tripos
Build each molecule using modelling software
Identify the active conformation for each molecule
Identify the pharmacophore
Method
NHCH3
OH
HO
HO
Active conformation
Build 3D
model
Define pharmacophore
5/27/2014 Importance of PROCESS is not less than PRODUCT 55
55. 3D-QSAR
Method
NHCH3
OH
HO
HO
Active conformation
Build 3D
model
Define pharmacophore
5/27/2014 Importance of PROCESS is not less than PRODUCT 56
Comparative molecular field analysis (CoMFA) - Tripos
Build each molecule using modelling software
Identify the active conformation for each molecule
Identify the pharmacophore
56. 3D-QSAR
•Place the pharmacophore into a lattice of grid points
Method
•Each grid point defines a point in space
Grid points
.
.
.
.
.
5/27/2014 Importance of PROCESS is not less than PRODUCT 57
57. 3D-QSAR
Method
•Each grid point defines a point in space
Grid points
.
.
.
.
.
•Position molecule to match the pharmacophore
5/27/2014 Importance of PROCESS is not less than PRODUCT 58
58. 3D-QSAR
•A probe atom is placed at each grid point in turn
Method
•Probe atom = a proton or sp3 hybridised carbocation
.
.
.
.
.
Probe atom
5/27/2014 Importance of PROCESS is not less than PRODUCT 59
59. 3D-QSAR
•A probe atom is placed at each grid point in turn
Method
•Measure the steric or electrostatic interaction of the probe atom
with the molecule at each grid point
.
.
.
.
.
Probe atom
5/27/2014 Importance of PROCESS is not less than PRODUCT 60
60. 3D-QSAR
Method
Compound Biological Steric fields (S) Electrostatic fields (E)
activity at grid points (001-998) at grid points (001-098)
S001 S002 S003 S004 S005 etc E001 E002 E003 E004 E005 etc
1 5.1
2 6.8
3 5.3
4 6.4
5 6.1
Tabulate fields for each compound at each grid point
Partial least squares analysis (PLS)
QSAR equation Activity = aS001 + bS002 +……..mS998 + nE001 +…….+yE998 + z
. .
.
.
.
5/27/2014 Importance of PROCESS is not less than PRODUCT 62
61. 3D-QSAR
•Define fields using contour maps round a representative molecule
Method
5/27/2014 Importance of PROCESS is not less than PRODUCT 63
62. A procedure based on the information included in the
MIF
generating a handful of informative variables,
independent of the location of the molecules within the
grid
Two main steps of the procedure of transformation:
Field filtering
Maximum auto-cross correlation(MACC2) encoding.
2 means distance between two points in the space.
2.5D-QSAR or GRIND methodology
5/27/2014 Importance of PROCESS is not less than PRODUCT 64
63. MACC2 transform
The MACC transform has
maximum value of the products of
the two i and j field values, found
at each different rij distance.
Here the colors represent the
activity of the compounds (blue
inactive, red active)
33 means the energy products
produced by two N1 probes
8 means the 8th variable of auto-
correlogram 33
5/27/2014 Importance of PROCESS is not less than PRODUCT 65
64. GRID interaction fields
calculated using the N1 probe:
positive (yellow) interactions
describe unfavorable and
negative (blue) interactions
describe favorable interactions
they should have low
energy values
(representing highly
favorable interactions)
they should be as far as
possible one from each
other.
5/27/2014 Importance of PROCESS is not less than PRODUCT 66
71. One of the unique features of the MACC
transform is that it is possible to trace back the
variables that generated this "most intense"
interaction.
5/27/2014 Importance of PROCESS is not less than PRODUCT 73
VRS