A Journey Into the Emotions of Software Developers
Smashing Molecules
1. Smashing
Molecules
How
Molecular
Fragments
Allow
us
to
Explore
Large
Chemical
Spaces
Rajarshi
Guha
&
Trung
Nguyen
NIH
Center
for
Transla9onal
Therapeu9cs
Chemaxon
UGM
September
2011
2. Outline
• Fragments
as
the
building
blocks
of
chemistry
• Fragments
and
SAR
• Fragments
and
ac9vity
profiles
3. Big
Data
for
Some
Problems
• Halevy
et
al
discuss
the
effec9veness
of
extremely
large
datasets
• Their
applica9on
focuses
on
machine
transla9on
–
see
the
Google
n-‐gram
corpus
• They
suggest
that
such
extremely
large
datasets
are
useful
because
they
effec9vely
encompass
all
n-‐grams
(phrases)
commonly
used
• Domain
is
rela9vely
constrained
Halevy
et
al,
IEEE
Intelligent
Systems,
2009,
24,
8-‐12
4. Google
Scale
in
Chemistry?
• What
would
be
the
equivalent
of
an
n-‐gram
corpus
in
chemistry?
– Fragments
– A
more
direct
analogy
can
be
made
by
using
LINGO’s
• It
is
possible
to
generate
arbitrarily
large
(virtual)
compound
and
fragment
collec9ons
• But
would
such
a
collec9on
span
all
of
“commonly
used”
chemistry?
– Depending
on
the
ini9al
compound
set,
yes
– But
we’re
also
interested
in
going
beyond
such
a
“commonly
used”
set
Fink
T,
Reymond
JL,
J
Chem
Inf
Model,
2007,
47,
342
5. Fragment
Diversity
• Consider
a
set
of
bioac9ves
such
as
the
LOPAC
collec9on,
1280
compounds
• Using
exhaus9ve
fragmenta9on
we
get
40
2,460
unique
fragments
Percent of Total
30
• On
the
MLSMR
(~
372K
compounds),
20
we
get
164,583
10
fragments
0
0 1 2 3 4
log Fragment Frequency
6. Fragment
Diversity
6 All
fragments
4
Fragments
occurring
in
5
to
50
molecules
4
2
2
PC 2
0
PC 2
0
-2
-2
-4
-4
-4 -2 0 2
-4 -2 0 2 4
PC 1 PC 1
• Distribu9on
of
MLSMR
fragments
in
BCUT
space
7. What
Do
We
Do
with
Fragments?
• Assuming
we
obtain
fragments
from
a
large
enough
collec9on
what
do
we
do?
– Learning
from
fragments
–
QSARs,
genera9ve
models
– Use
fragments
as
filters,
alterna9ve
to
clustering
– Explore
chemotypes
and
ac9vity
– Scaffold
level
promiscuity
White,
D
and
Wilson,
RC,
J
Chem
Inf
Model,
2010,
50,
1257-‐1274
8. Scaffold
AcKvity
Diagrams
• Network
oriented
view
of
fragment
(scaffold)
collec9ons
– Similar
in
idea
to
Scaffold
Hunter
etc
– Not
purely
hierarchical
• Color
by
arbitrary
proper9es
• Quickly
assess
u9lity
of
a
scaffold
• Try
it
online
9. What
Makes
a
Good
Scaffold?
• What
makes
a
good
scaffold?
– Size,
complexity,
…
– Do
the
members
represent
an
SAR
or
not?
– Intui9on
and
experience
also
play
a
role
10. Scaffold
QSAR
Fit
PLS
or
ridge
regression
model
0
!
!
!!
!
!2
! !
!
!
Predicted
!
!4
! ! !!
!
Evaluate
topological
! !
and
physicochemical
!
!6
descriptors
for
the
!
!
R-‐groups
!8
Characterize
the
!8 !6 !4 !2 0
Observed
SAR
landscape
11. Scaffold
QSAR
-‐
Drawbacks
• Many
scaffolds
have
few
(5
to
10)
members
• Invariably,
more
features
than
observa9ons
• If
the
number
of
R-‐groups
is
large,
the
feature
matrix
can
be
very
sparse
– Less
of
a
problem
for
combinatorial
libraries
• A
linear
fit
may
not
be
the
best
approach
to
correla9ng
R-‐groups
to
the
ac9vi9es
– Difficult
to
choose
a
model
type
a
priori
12. Fragment
AcKvity
Profiles
• Using
scaffolds
in
HTS
triage
usually
leads
to
two
ques9ons
– What
is
known
about
the
chemical
series
with
respect
to
the
intended
target?
– What
compound
classes
are
known
to
modulate
the
intended
target
&
how
similar
are
they
to
series
in
ques9on
• We’re
interested
in
exploring
summaries
of
ac=vity,
grouped
by
scaffolds
and
targets
13. Fragment
AcKvity
Profiles
• We
use
ChEMBL
(08)
as
the
source
of
bioac9vity
across
mul9ple
targets
• Preprocess
the
database
– Generate
scaffolds
(exhaus9ve
enumera9on
of
combina9ons
of
SSSR’s)
– Normalize
ac9vity
data
so
that
we
compare
the
ac9vity
of
a
molecule
across
different
assays
14. Database
Setup
• Preprocessing
steps
available
as
a
Java
servlet
– hkp://tripod.nih.gov/files/chembl-‐servlets.zip
• Need
ChEMBL
installed
in
Oracle;
we
add
some
extra
tables
– Fragment
structures
and
computed
proper9es
– Aggregated
assay
ac9vity
summary
• Only
consider
assays
with
IC50’s
in
nM
and
uncensored
data,
more
than
5
observa9ons
and
a
MAD
>
0
– (Robust)
z-‐scored
ac9vi9es
15. Some
Fragment
StaKsKcs
• Considered
Z-‐score
range
of
-‐40
to
15
• There
were
12,887
molecules
lying
outside
this
range
15 50
Number of compounds
Percentage of assays
40
10
30
20
5
10
0 0
1.0 1.5 2.0 2.5 -40 -30 -20 -10 0 10
log(Number of molecules) Z-score
16. Some
Fragment
StaKsKcs
• Next,
iden9fy
fragments
with
8
to
20
atoms
and
occurring
in
100
to
900
molecules
• Gives
us
1,746
fragments
40
Percentage of Fragments
30
20
10
0
200 400 600 800
Num Molecules
17. Some
Fragment
StaKsKcs
• We
can
query
the
fragment
tables
to
get
ac9vity
summaries
40169 64473 115654
for
individual
60
N = 1457 N = 1595 N = 1515
50
40
fragments
30
20
10
0
• For
these
examples
-20 0
5390
20 -40 -20
5486
0 20 -20 -10
13485
0 10
60
we
consider
the
Percent of Total N = 1489 N = 1578 N = 1455
50
40
30
full
range
of
Z-‐
20
10
0
scores
60
-5
N = 1280
0
778
5 10 15 0
N = 1918
10
2723
20 -60 -40
N = 2641
-20
4058
0 20
50
40
30
20
10
0
-30 -20 -10 0 10 -600 -400 -200 0 -50 0 50
Z-Score
18. Exploring
AcKvity
Profiles
Ac9vity
distribu9ons
of
parent
molecules
Fragments
from
ChEMBL
across
all
targets
Z-‐scores
for
individual
molecules
against
a
specific
target
19. Exploring
AcKvity
Profiles
• User
can
draw
a
molecule
and
fragment
on
the
fly
• Use
generated
fragments
to
create
ac9vity
histograms
20. Target
SelecKon
• Employs
the
ChEMBL
target
hierarchy
• Can
select
target
families
or
individual
targets
21. Similar
Fragments
with
Similar
Profiles?
• Consider
658
fragments
with
>
10
atoms
and
occurring
in
500
to
1200
molecules
• Overall,
the
fragments
25
tend
to
be
dissimilar
20
– 95th
percen9le
is
just
Percentage of pairs
0.50
15
• 1,873
pairs
do
exhibit
10
Tc
>
0.8
5
0
0.0 0.2 0.4 0.6 0.8 1.0
Tanimoto Similarity
22. Comparing
AcKvity
Profiles
• Compare
ac9vity
profiles
with
the
K-‐S
sta9s9c
• Color
corresponds
to
1.0
p-‐value
of
the
K-‐S
test
0.6
0.5
• No
obvious
correla9on
0.8
between
fragment
0.4
0.6
K-S statistic
similarity
&
ac9vity
0.3
0.4
profile
similarity
0.2
0.2
• Probably
not
rigorous
0.1
when
a
scaffold
has
few
0.0 0.0
0.80 0.85 0.90 0.95 1.00
parent
molecules
Tanimoto Similarity
23. Exploring
Profiles
for
Fragment
Pairs
• Compare
ac9vity
distribu9ons
across
all
targets
in
a
pairwise
fashion
• Can
also
generate
comparison
for
a
single
target,
but
requires
data
for
all
the
fragments
24. Looking
for
SelecKve
Fragments
• Interes9ng
to
visually
explore
fragment
pairs
• Can
become
tedious,
especially
in
a
database
as
big
as
ChEMBL
• Can
we
automate
this
type
of
analysis?
– Iden9fy
fragment
pairs
with
very
different
ac9vity
distribu9ons?
– Iden9fy
fragments
with
a
preference
for
a
certain
target
(class)?
25. Mean Z−Score
Ac −10 −5 0
et
yl
ch
Ad olin
re e
ne re
rg cep
ic
3
re tor
An ce
gi pt
50
ot or
4056459
en
si
n Ag
6
ge re c
ne ce
−r p
14
el AN tor
at IO
ed N
class
IC
107
pe
pt
id C
e
target
6
re 1A
ce
pt
C
2
C or
ch C
em C am
5
ok AT k
C in ION
e
19
XC re IC
ch ce
em pt
1
or
ok
in Cm
e
19
re gc
c
C ept
1
YP or
_1
C 1
3
YP B1
_
C 11B
6
YP 2
_1
8
C 9A1
YP
C _1A
14
YP 2
_2
C C1
7
YP 9
_
C 2C
17
YP 9
_
13
C 2D6
YP
_
20
C 3A4
YP
C _4A
2
YP 1
_4
24
C A11
YP
_
2
C 4A3
YP
D _
op C 4F
24
am YP 2
in _5
e
9
re A1
En ce
pt
18
do or
th
el
in dru
4
G re g
nR ce
H H p
is
2
ta re tor
m
in cep
e
2
et re tor
ab ce
ot pt
ro
1
or
pi
c M
gl C M1
ut H
2
0A
a re
N ma cep
N e t
1
eu uro e re tor
ro k c
pe inin ept
o
1
pt
id rec r
e ep
Y to
2
N r
or ece r
ep pt
in o
10
ep r
hr
in
1
N e
R
1H
59
N 3
R
3A
4
N 1
R
3A
4
O 2
pi NR
oi
d 3C
2
re 3
ce
pt
4
or
po PA
86
ta F
ss
Se iu
m
3
ro
to
• Count
number
of
parent
molecules
tested
against
the
ni
So n S1
12
di r A
um ece
_h pto
42
yd r
ro
ge
7
n
153
Tk
• Evaluate
mean
ac9vity
of
parent
molecules
within
a
target
• Selec9vity
of
1-‐phenylimidazole
for
CYP450
has
been
noted
Wilkinson
et
al,
Biochem
Pharmacol,
1983,
32,
997-‐1003
Targetwise
AcKvity
Profiles
26. Mean Z−Score
−8 −6 −4 −2 0 2
Ad
re
n er
g ic A2
5
re A
An ce
pt
2
gi or
4055899
ot
e
Br nsin Ag
23
ad c
yk rec
in ep
in t
7
al re or
ci ce
um pt
6
se or
ns
in C
g
7
1A
re
ce
pt
24
or
C
C C
ch am
e C
2
C
ho mo ATI k
le k O
cy ine N
IC
67
st
ok rec
in ep
in t
102
re or
ce
pt
6
or
C
m
C g
18
YP c
_2
C D
3
YP 6
_3
D
8
op Do A4
am pa
in m
e in
11
r e
ED ece
En G pt
19
do re or
th ce
el
G in pt
o
16
lu
ca rec r
go ep
n to
2
G re r
nR ce
H H pt
1
is re or
ta
Le min cep
e to
16
uk
ot r r
rie ece
ne pt
49
re or
ce
pt
1
or
M
10
3
ro A
pi
c M
C M1
gl H
2
ut 2B
am rec
N a ep
t
33
N eu te
eu ro re or
ro ki ce
pe nin pt
18
r or
pt
id ece
e pt
Y
118
r or
N
or ece
ep pt
o
1
in
ep r
hr
in 1
e
N
R
1I
4
O 1
pi NR
oi 3C
d
2
re 4
ce
pt
11
or
• But
reported
as
dopamine
agonists
O
Pr th
8
os er
ta
no
id PA
3
re F
ce
pt
28
or
R
5
eg
S1
38
A
S2
with
preference
for
a
specific
target
class
7
1
• Iden9fied
benzylpyrrolidine
as
a
fragment
Se S9
45
ro Se A
t on roto
in ni
4
re n
ce
pt
9
or
29
Tk
Tk
2
l
Targetwise
AcKvity
Profiles
27. Fragment
or
Scaffold?
• I’ve
been
using
fragment
&
scaffold
interchangeably
–
not
always
true
• Chemists
have
an
intui9ve
idea
of
what
a
scaffold
is
• Can
we
encode
the
idea
of
scaffold-‐like
or
fragment-‐like
• We
use
the
concept
of
Size
of
fragment
Signal-‐to-‐Noise
µ SD
of
number
of
atoms
Ra9o
SNR = not
in
the
fragment,
! considered
over
the
parent
molecules
28. Fragment
or
Scaffold
• Par9al
distribu9on
of
SNR
values
for
fragments
with
atom
count
>
8
&
<
20
60
50
Percentage of Fragments
40
30
20
10
0
0 1 2 3 4 5 6
SNR
29. Fragment
or
Scaffold
• Large
SNR’s
associated
with
Murcko-‐like
fragments
• A
useful
SNR
cutoff
is
an
open
ques9on
SNR
=
8.50
SNR
=
9.10
SNR
=
12.09
SNR
=
0.83
SNR
=
0.43
SNR
=
0.36
30. AcKvity
Profiles
&
SNR
• Given
a
fragment,
evaluate
SD
of
the
number
of
atoms
in
the
parent
molecules
that
are
not
part
of
the
fragment
• Label
the
parent
molecules
based
on
– If
number
of
atoms
not
in
the
fragment
>
SD,
non
core-‐like
– Otherwise
core-‐like
• Visualize
the
ac9vity
distribu9ons
of
the
parent
molecules,
grouped
by
the
label
31. AcKvity
Profiles
&
SNR
-50 0 50 -50 0 50
20967 20967 44591 44591
Core-like Not core-like Core-like Not core-like
Percentage of Total
80
60
40
20
-50 0 50 -50 0 50
High
SNR
Z-Score
-30 -20 -10 0 10 -30 -20 -10 0 10
801 801 68604 68604
Core-like Not core-like Core-like Not core-like
Percentage of Total
80
60
40
20
Low
SNR
-30 -20 -10 0 10 -30 -20 -10 0 10
Z-Score