Creating novel drugs is an extraordinarily hard and complex problem.
One of the many challenges in drug design is the sheer size of the search space for novel chemical compounds. Scientists need to find molecules that are active toward a biological target or pathway and at the same time have acceptable ADMET properties.
There is now considerable research going on using various AI and ML approaches to tackle these challenges.
Our distinguished speakers, Drs. Alex Tropsha and Ola Engkvist, will discuss their recent work in Drug Design involving Deep Reinforcement Learning and Neural Networks, and will answer questions from the audience on the current state of the research in the field.
Speakers:
Prof Alex Tropsha, Professor at University of North Carolina at Chapel Hill, USA
Dr. Ola Engkvist, Associate Director at AstraZeneca R&D, Gothenburg, Sweden
3. Poll Question 1:
Are you or your organisation using AI /
ML in Drug Design?
A. Yes, already
B. Plan to do in next 12 months
C. Plan in next 12-24 months
D. No plans
4. ŠPistoiaAlliance
Introduction to Todayâs Speakers
Prof Alex Tropsha
Associate Dean for
Pharmacoinformatics and data
science
K.H. Lee distinguished professor
Dr Ola Engqvist
Associate Director
Discovery Sciences
AstraZeneca
5. Alexander Tropsha
UNC Eshelman School of
Pharmacy
Machine learning, text mining, and
AI approaches for drug discovery
and repurposing
10. QSAR Modeling Workflow: the
importance of rigorous validation
M o d e l i n g m e t h o d s
5-fold
External
Validation
1
4
3
2
5
12354
courtesy of L. Zhang
Combi-QSAR
modeling
Datasets
K-Nearest
Neighbors (kNN)
Random
Forest (RF)
Support Vector
Machines (SVM)
Dragon MOE
Internal validation
Model selection
An ensemble of
QSAR Models
Modeling set
External set
D e s c r i p t o r s
Evaluation of
external performance
10
Tropsha, A. Best Practices for QSAR Model Development, Validation,
and Exploitation Mol. Inf., 2010, 29, 476 â 488
Fully implemented on CHEMBENCH.MML.UNC.EDU
Virtual screening
(with AD threshold)
Experimental
confirmation
13. ReLeaSE* design principles: learning
and exploiting structural linguistics of
SMILES notation
⢠SMILES notations reflect rules of Chemistry
⢠SMILES notation may embed linguistic rules
⢠Neural nets could learn both of the above types of rules
⢠This knowledge can be transformed into the generation of
new SMILES corresponding to novel chemically feasible
molecules (generative model)
⢠One can build QSAR models based solely on SMILES
notation (predictive model)
⢠QSAR models can be used as a reward function for
reinforcement learning to bias the design of novel libraries
*Popova, M,, Isayev, O., and Tropsha, A. "Deep reinforcement learning for de-novo drug design."
Science Advances, 2018 Jul 25;4(7):eaap7885.
14. NLP/Text mining:directly learn
low-dimensional word vectors
â In deeplearning models, a wordis represented as a dense vector
â Word vectors form the basis for deep learning methods
â Objective: predict word based on the context
Mikolov T . et al. Distributed representations of words and phrases and their compositionality
//Advances in neural information processing systems. â 2013. â ĐĄ. 3111-3119.
15. Design of the ReLeaSE* method
(Reinforcement Learning for Structural Evolution)
Elements of the
thought cycle
(molecules->models-
molecules):
⢠Generate chemically
feasible SMILES
⢠Develop SMILES-
based QSAR model
⢠Employ QSAR model
to bias library
generation
⢠Produce new
SMILES
*Popova, Mariya, Olexandr Isayev, and Alexander Tropsha. "Deep reinforcement learning for de-novo drug design."
arXiv preprint arXiv:1711.10907 (2017).
16. ReLeaSE:* Disruptive Innovation of
Conventional Computational Drug
Discovery Pipeline
Learn from
target-specific
data (300-500
molecules)
Target-specific
models
Virtual screening
Internal/public
databases
Selection and
testing of
known
molecules
Generation
of novel
molecules
Selection and
testing of
novel
molecules
ReLeaSE Workflow
Traditional Workflow
Learn from
all data (2M
molecules)
Target-specific and property
models / Reinforcement learning
Hits with
desired
properties
*Popova, M,, Isayev, O., and Tropsha, A. "Deep reinforcement learning for de-novo drug design."
Science Advances, 2018 Jul 25;4(7):eaap7885.
17. Disruptive innovation in QSAR: Can we avoid
descriptor generation altogether and besides,
predict new structures?
Did the
training
converge?
NO
YES
<START>
c
<START>c1ccc(O)cc1<END>
c
1
1
c
c
c
c
)
+ loss
c
(
(
F
+ loss
O
)
)
c
c
c
c
1
1
<END>
Softmax
loss
1.5M
molecules
from
ChEMBL
c1ccc(O)cc1
*Popova, M,, Isayev, O., and Tropsha, A. "Deep reinforcement learning for de-novo drug design."
Science Advances, 2018 Jul 25;4(7):eaap7885.
18. Are we making legitimate Smiles?
AI learning
system
95% Valid
Chemically-feasible
molecules
SMILE strings
/
Smiles strings
21. QSAR modeling using Smiles strings
only*
RMSE: 0.57 0.53
MAE: 0.37 0.35
R2
ext: 0.90 0.91
CN2C(=O)N(C)C(=O)C1=C2N=CN1C
Neural
Network
Property prediction
Predicted LogP
ObservedLogP
5CV RF model with
DRAGON7 Descriptors
5CV NN model with
SMILES directly
*LogP data for ~16K molecules from PHYSPROP (srcinc.com), Toxcast Dashboard
(https://comptox.epa.gov/dashboard), and others.
32. Results: Synthetic accessibility
score* of the designed libraries
*Ertl, Peter, and Ansgar Schuffenhauer. "Estimation of synthetic accessibility score of drug-like molecules based on molecular
complexity and fragment contributions." Journal of cheminformatics 1.1 (2009): 8.
34. Predicted pIC50 for JAK2 kinase
CAS 236-084-2
(buffer reagent)
ZINC37859566
New moleculeSIMILAR SCAFFOLDS
NEW CHEMOTYPE
JAK2 Kinase inhibition
Untrained data distribution
Maximized property distribution
Minimized property distribution
35. Target predictions for generated
compounds using SEA*
*Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand
chemistry. Nat Biotech 25 (2), 197-206 (2007).
36. Target predictions for generated
compounds using SEA*
*Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand
chemistry. Nat Biotech 25 (2), 197-206 (2007).
37. Practical implementation workflow
⢠Select a target
⢠Train ReLeaSE to generate new target-specific
molecules; collect computational hits
⢠Identify a fraction of hits available in commercial
libraries; purchase and test selected hits
⢠Following successful validation, order NCE synthesis
and testing in vitro and in vivo and if successful file
for IP protection
37
38. Summary
⢠We propose an innovative de novo drug discovery
technology termed Reinforcement Learning for
Structural Evolution (ReLeaSE)*
⢠ReLeaSE is a product of convergence of fields as
disparate as cheminformatics and text mining united
by AI
⢠Unlike most of the current technologies, ReLeaSE
enables the discovery of new chemical entities with the
desired bioactivity and drug-like properties
Patent application filed (application # 62/535069, filed by UNC07/2018)
39. General Summary
⢠Accumulation of Big Data in all areas of research creates
previously unachievable opportunities for using ML and AI
approaches
â However, primary data must be handled with extreme care (curation,
reproducibility)
⢠Exciting developments in computational chemistry
â Critical shift from discovery to design and AI-driven robotics
⢠Rapid progression from the use of computational modeling
for decision support to using models to guide experimental
research
â Critical importance of rigorous and comprehensive model validation
using truly external data
⢠Natural progression toward automated chemical labs driven
by AI
40. Principal Investigator
Alexander Tropsha
Research Professors
Alexander Golbraikh
Olexander Isayev
Eugene Muratov
Graduate students
Sherif Faraq
Kyle Bowers
Maria Popova
Andrew Thieme
Dan Korn
Phil Gusev
Postdoctoral Fellows
Vinicius Alves
Joyce Borba
MAJOR FUNDING
NIH
- 1U01CA207160
- R01-GM114015
- 5U54CA198999
- 1OT3TR002020
ONR
- N00014-16-1-2311
Acknowledgements
41. Poll Question 2:
What are the biggest barriers to machine
learning adoption Drug Design? (multi
select)
A. Lack of access to AI/ML Skills
B. Access to Data
C. Quality of Data
D. Access to ML & AI Tools
E. Other
42. Artificial Intelligence in Drug Design
Ola Engkvist, Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Sweden
February 26 2019PISTOIA Webinar
43. Drug Design
What to make next? How to make it?
De novo design
Multi-parameter scoring function
Retrosynthesis
44. What is different now?
44
Augmented
design
Autonomous
design
Automatic
design
de novo molecular
design
Synthesis prediction
Automation
Data generation
45. It takes two to tango
45
Artificial Intelligence Chemistry Automation
47. Neural Networks & Deep Learning
47
⢠Neural Networks known for decades
⢠Inputs, Hidden Layers, Outputs
⢠Single layer NNs have been used in QSAR
modelling for years
⢠Recent Applications use more complex
networks such as
⢠Multi-layer Feed-Forward NNs
⢠Convolutional NNs
⢠biological image processing
⢠Auto-encoder NNs
⢠Adversial NNs
⢠Recurrent NNs
48. Why? Generation of Novel Compounds in the 1060 Chemical Space!
48
Where´s the impact?
⢠Use for de novo Molecular Design
⢠Scaffold Hopping
⢠Novelty
⢠Virtual Screening
⢠Library Design
10601010-1012
49. Natural language generation and molecular structure generation
49
⢠Can we borrow concepts from natural language processing and
apply to SMILES description of molecular structures to generate
molecules?
⢠Conditional probability distributions given context
⢠đ đđđđđ đđ , đđđđ đ , đâđ
⢠đ đ =, đś, đś
The grass is ?
C C = ?
50. Tokenization of SMILES
50
⢠Tokenize combinations of characters like âClâ or â[nH]â
⢠Represent the characters as one-hot vectors
52. Reinforcement learning
52
Learning from doing
Action Reward Update behaviour
Design molecule
Active?
Good DMPK?
Synthetically accessible?
Make more like this?
Make something else instead?
Agent
53. AI live: Create Structures Similar to Celecoxib
53
⢠Key Message
⢠RNN generates
structures similar
to Celecoxib
⢠Rapid sampling!
⢠Average score
describes how
many learning
steps are required
to reach similar
compounds
54. Some misconceptions about de novo RNN generated molecules
54
âThe molecules are not diverseâ
âThe molecules are not synthetic feasibleâ
Answer: The generated molecules follows the properties of the dataset used as prior
Segler et al ACS Central Sci. 2018, 4, 120-131 Ertl et al arXiv:1712.07449
Diversity Synthetic feasibility
55. âCambrian explosionâ of different DL based molecular de novo generation
methods
55
PyTorch + RDKit + ChEMBL => anyone with a computer can contribute =>
Benchmarking is urgently needed
56. Which benchmarks? What are the relevant questions?
Does the same algorithm work best for both
scaffold hopping and lead series optimization?
Which algorithm samples the underlying
chemical space most complete?
1
2
3
Which algorithm zooms most efficiently to the
most interesting regions of chemical space?4
Which is best way to describe molecules,
strings or graphs?
57. Benchmark published by the scientific community
⢠MOSES Polykovskiy et al
⢠https://arxiv.org/abs/1811.12823
⢠Diversity and quality of generated molecules
1
2
3
⢠Arus-Pous et al
⢠https://chemrxiv.org/articles/Exploring_the_GDB13_Chemical_Space_Using_Deep_Generative_Models/7172849
⢠Complete sampling of the relevant chemical space
4
⢠Klambauer et al
⢠J. Chem. Inf. Mod. 2018, 58, 1736
⢠Distribution between generated and real molecules
⢠GuacaMol Brown et al
⢠https://arxiv.org/abs/1811.09621
⢠Efficient optimisation of a specific property
58. Artificial Intelligence Guided Drug Design Platform
58
Generation of Novel Chemical
Space
Reaction & Synthesis
Prediction
iLAB
DMTA
Make
Test
Analyse
Design
Desirability
function
ÎŁ IC50, LogP,
Novelty etc.
Iterations
Profiling
AI Design
Platform
Fully Automated
DMTA Cycle
59. 2018 Proof-of-Principle Pilot Study
1st iteration
Novelty
3rd iteration
Expansion
2nd iteration
Novelty
4th iteration
Chemistry Automation
library
~2month ~2month ~2month
Constant re-learning and training
1
⢠Novelty key goal
⢠Crowded IP space
⢠Lots of available data
⢠Selectivity
⢠New promising series
identified
2
⢠Selectivity key goal
⢠Novelty
⢠Several promising
series identified
3
⢠Optimising HI series
⢠Tool compound
⢠Optimization successful
60. 60
Lessons from pilot study
⢠It works!
⢠Novel scaffolds were identified in crowded chemical space
⢠Compound series could be efficiently optimised
⢠Affinity and ADME predictions are still bottlenecks
⢠Too many ideas might make prioritization for synthesis challenging
⢠Chemistry resources need to be frontloaded
⢠Optimisation under constraints might lead to molecules that is difficult to synthesize
61. ⢠Synergize with automation
⢠Better Machine Learning Models
⢠Access to more data (for instance IMI2 Call 14 Topic 3)
⢠Experimental descriptors
⢠Graph convolution, include protein based information
⢠Multi-task modelling
⢠Matrix factorization with side information
⢠Free energy calculations
⢠Progress in speed
⢠Combine with machine learning
⢠Confidence estimation
⢠Conformal prediction
⢠Bayesian methods
⢠Benchmarking
⢠Public Chemogenomics set available (ChEMBL, Excape-DB, Pidgin)
⢠Blind competitions (SAMPL, D3R)
How can we improve affinity prediction?
61
62. Will ML/AI revolutionize drug design?
My personal opinion(s)
62
⢠Only time will tellâŚ.
⢠The last commonly agreed revolution was the introduction of DMPK
departments in the 90s, so the bar is high
⢠ML/AI like other promising technologies (for instance PROTACS) warrants
further investments
⢠More data, automation and ability to learn makes ML/AI bound to have
larger impact on drug design in the future
⢠During my 19 years in industry it has never been as exciting to work with in
silico drug design
63. Acknowledgements
63
Discovery Sciences CompChem ML/AI Team
Thierry Kogej
Hongming Chen
Isabella Feierberg
Atanas Patronov
Esben Jannik Bjerrum
Preeti Iyer
Jiangming Sun (Postdoc 2015-2017)
Noe Sturm (Postdoc 2017-2018)
Philipp Buerger (Postdoc 2017-2020)
Jiazhen He (Postdoc 2019-2022)
Rocio Mercado (Postdoc 2018-2021)
Thomas Blaschke (PhD student 2017-2018)
Josep Arus Pous (PhD student 2018-2019)
Michael Withnall (PhD student 2018-2019)
Oliver LaufkĂśtter (PhD student 2018-2019)
Laurent David (PhD student 2018-2019)
Ave Kuusk (PhD student 2016-2019)
Marcus Olivecrona (AZ GradProgram 2017)
Alexander Aivazidis (AZ GradProgram 2018)
Dhanushka Weerakoon (AZ GradProgram 2018-2019)
Panagiotis-Christos Kotsias (AZ AI GradProgram 2018-2019)
Edvard LindelĂśf (Master Thesis Student 2018-2019)
Simon Johansson (Master Thesis Student 2019)
Oleksii Prykhodko (Master Thesis Student 2019)
Academic Collaborators
Marwin Segler (Munster)
Juergen Bajorath (Bonn)
Jean-Louis Reymond (Bern)
Andreas Bender (Cambridge)
Sepp Hochreiter (Linz)
Gunther Klambauer (Linz)
Sami Kaski (Helsinki)
Discovery Sciences
Garry Pairaudeau
Clive Green
Lars Carlsson
Nidhal Selmi
DSM AI Team
Ernst Ahlberg
Suzanne Winiwarter
Ioana Oprisiu
Ruben Buendia (Postdoc 2018)
PharmSci
Per-Ola Norrby
2018 PoP Pilot Study
Werngard Czechtizky
Ina Terstiege
Christian Tyrchan
Anders Johansson
Jonas BostrĂśm
Kun Song
Alex Hird
Neil Grimster
Richard Ward
Jeff Johannes
64. Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove
it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the
contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus,
Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, www.astrazeneca.com
64
65. Utilize the GDB-13 database (975 Million compounds)
65
If we train with 1 million compounds and sample 2 billion, what will we get?
Josep Arus
https://chemrxiv.org/articles/Exploring_the_GDB-13_Chemical_Space_Using_Deep_Generative_Models/7172849
66. Utilize the GDB-13 database
66
80% of 2B sampled molecule within GDB-13
70% of GDB-13 sampled
Josep Arus
https://chemrxiv.org/articles/Exploring_the_GDB-13_Chemical_Space_Using_Deep_Generative_Models/7172849
67. Utilize the GDB-13 database
67
Long tail distribution, 99.5% of molecules sampled at least once
Molecules with uncommon substrings sampled less often
Josep Arus
https://chemrxiv.org/articles/Exploring_the_GDB-13_Chemical_Space_Using_Deep_Generative_Models/7172849
68. ŠPistoiaAlliance
Getting Involved
68
⢠Suggest Future webinar topics & speakers
⢠Datathon engagement â share and collaborate
⢠Centre of Excellence Community
⢠Planning for London March 2019
⢠New project idea groups
⢠register or involve colleagues
69. ŠPistoiaAlliance
Poll Question 3:
Where do you see the biggest benefits of AI / ML in Drug
Design
A. Finding novel chemical compounds (unbiased)
B. Using full breadth of available data (ADME, Assay, Target etc)
C. Quicker cycle time & speed to lead compound(s)
D. Ability to cope data breadth & volume
E. Other