A full description of the molecular autoencoder for automated exploration of chemical compound space using neural nets and machine learning architectures, developed by the Aspuru-Guzik group at Harvard. Talk given to Prof. Peter W. Chung's research group at the University of Maryland, College Park, August 2017.
2. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 2
What is a Machine Learning?
"Machine Learning is a field of study that gives computers the ability to learn without
being explicitly programmed" - Arthur Samuel, 1959
"A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by
P, improves with the experience E." - Tom M. Mitchell.
Reinforcement
learning
Unsupervised learningSupervised learning
• Regression
• Classification
Model Y = f(x) to match data (x,y)
• Parametric models
• Linear models
• Polynomial model
• Logistic model
• Neural network model
• Convolutional Neural network
• Non parametric models
• Kernel Ridge regression
• Decision tree
• Gaussian Process regression
• Kernel SVM
• Clustering
• Dimensionality reduction
• Autoencoders
• Robotics , etc
3. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 3
Supervised learning workflow
Source : sci-kit-learn.org
4. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 4
What is a neural network?
Dendrites
(input wires)
Terminal axons
(output wires)
5. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 5
What is a neural network?
Input layer hidden layer output layer
are the weights
Activations of layer i Input or activations from layer i-1
is the activation function
6. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 6
Activation functions
Binary step
closest to biological
neurons, but
no gradient info =(
Logistic/Sigmoid
arctan()
Rectified Linear Unit
(ReLU)
Maintains a nice large gradient
Exponential Linear
Unit
(ELU)
7. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 7
What is convolution?
Input
Output
1 dimensional convolution with the filter aka “kernel”
Convolution with stride = 2
8. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 8
What is convolution?
Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
“Feature map”
2 dimensional convolution with the 2x2 filter
Note that the edges were lost.
There are ways to prevent this,
such as padding the edges with
zeros.
1 0 1
0 1 0
1 0 1
9. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 9
What are convolutional neural nets?
By most accounts the CNN was invented by Yan Lecun . He
developed the “LeNet” in 1998 for at ATT’s Bell Laboratories
for reading digits.
Architecture of LeNet:
10. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 10
What are convolutional neural nets?
“2D” images are actually 3D, because they have 3 color channels.
A 3D diagram conveys best what a CNN actually does. The depth of
the non-input layers is the # of filters. Typically the # of filters in each
successive layer increases while the size of the filters decreases:
11. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 11
What are convolutional neural nets?
By many accounts the current deep learning boom began when Krizhevsky, Sutskever and
Hinton used a CNN to win the 2010 ImageNet image classification competition. The
resulting publication has 13,000+ citations.
A Krizhevsky, I Sutskever and GE Hinton Imagenet classification with deep convolutional neural networks Advances
in neural information processing systems, 1097-1105 (2012)
Architecture they used , it has 60 million parameters and 650,000 neurons
12. Why do CNNs work so well?
They learn a hierarchical set of features the same way the mammalian visual cortex does!
Dan Elton, P.W. Chung Group Meeting1/24/2018 12
Hubel & Wiesel, 1959
Receptive fields of single
neurons in the cat’s
striate cortex
Slide from
Yan LeCun
13. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 13
What is an autoencoder?
• The “latent space” is also called the “low dimensional manifold”, “compressed
representation”, or “thought vector”
• See “Decoding the Thought Vector” for amazing examples of how faces are
compressed: http://gabgoh.github.io/ThoughtVectors/
Source: keras blog
14. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 14
What is a variational autoencoder?
• During training, the output is sampled from the enforced
distribution as mean + random_noise * variance, during testing
the output is the mean.
• Minimize Kullback–Leibler divergence
D.P. Kingma, M. Welling
Auto-Encoding Variational Bayes
The International Conference on Learning Representations (ICLR), Banff, 2014
[arXiv preprint].
15. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 15
What are recursive neural networks?
Recursive Neural Networks (RNNs) have loops.
The simplest RNN is shown on the left, it contains one
feedback loop
The mathematics and calculation of gradients (ie backpropagation) can be made
isomorphic to that of a feed-forward neural network via time unrolling
Output we are
interested in
inputs
All of these beautiful figures are taken from http://colah.github.io/posts/2015-08-
Understanding-LSTMs/ Copyright by Christopher Olah.
16. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 16
What are recursive neural networks?
Ex.: video
classification:
Inputs all frames
video, output a
classification for
each frame
Ex.: translation:
input Spanish,
output English
Ex.: sentiment
analysis:
Input text,
output positive
or negative
sentiment
Ex.: image
captioning:
Input image,
output
sequence of
words.
RNNs can be run many different ways…..
“seq2seq”
17. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 17
What is a gated recurrent unit?
RNNs have trouble capturing long range decencies
Suppose we need the output at time t+1 to depend on x0, x1, which happened in the
distant past of the input stream.
Technically this is called the vanishing gradient problem – the dependence (gradient)
becomes exponentially small with the number of layers it has to pass through. There
is also an exploding gradient problem, where the gradient increases exponetially .
18. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 18
What is an LSTM?
Sepp Hochreiter & Jurgen Schmidhuber (right) invented the
Long Short Term Memory (LSTM) unit in 1997 to solve the
vanishing gradient problem. LSTMs were recently used by
Google for human-level accuracy machine translation. Apple
uses LSTMs in Siri, etc etc.
The LSTM looks complicated but it is actually based on an
extremely simple idea – add a memory cell:
Output state
19. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 19
How does an LSTM work?
“forget” gate “input” gate read out gate
tanh()sigmoid/logistic
20. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 20
LSTM vs. Gated Recurrent Unit (GRU)
The GRU unit1 makes major changes to the LSTM:
• Output and memory cells are merged
• “forget” and “input” gates are merged into a single “update” gate
• Performance is similar to LSTM3 or slightly better2,4 but with less free parameters:
(6 vs 12 for a 1D input/output)
1. Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning Phrase
Representations using RNN Encoder-Decoder for Statistical Machine Translation. (2014) arXiv:1406.1078
2. Jozefowicz et al. An Empirical Exploration of Recurrent Network Architectures, Proceedings of the 32nd International Conference on
Machine Learning, 2015
3. Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, LSTM: A Search Space Odyssey (2015)
arXiv:1503.04069
4. Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Bengio, Yoshua. Empirical Evaluation of Gated Recurrent Neural Networks on
Sequence Modeling. (2014) arXiv:1412.3555
If the dimensionality of the input is n
and the dimensionality of the output
is d, then
#of parameters
LSTM 4*d*(n+d+1)
GRU 3*d*(n+d)
21. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 21
What are SMILES strings?
SMILES (simplified molecular-input line-entry system) encode 2D molecular graphs into 1D.
Example
CC(=O)NCCC1=CNc2c1cc(OC)cc2 CN1CCC[C@H]1C2=CN=CC=C2
The only ambiguity in SMILES strings:
• They do not capture 3D structure. However for small molecules and most application
areas this doesn’t matter much as molecules generally only have one conformation, so it
is implicitly contained. It only would matter in something like proteins, which might fold
into more than one conformation, or if the molecules are interacting with something like
an interface.
FC(F)FCCC(=O)O
22. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 22
One-hot encoding
There are 35 characters (C, N, O, @, -, =, etc)
The maximal molecule length is 120, molecules shorter than this are padded with 0s
23. 3 1 Dimensional Convolution Layers.
Gated recurrent unit (GRU) layers with 501 element memory cells
”time distributed dense layer” (a separate dense layer applied to each timestep
“flattening” – reshapes a 2D array to a 1D array
Two dense (fully connected) neural network layers, with
435 and 292 neurons, respectively
latent layer: mean and standard deviation units
Custom layer to sample the Gaussian distributions during training
Overall auto encoder architecture
1/24/2018 Dan Elton, P.W. Chung Group Meeting 23
Architecture
Dense (fully connected) neural network layer, 292 neurons
one-hot inputs
9 convolution filters of length 9
9 convolution filters of length 9
11 convolution filters of length 10
24. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 24
How does one determine architecture?
The JSON file for the molecular autoencoder reveals about 200+
hyperparameters.
The most important are:
• Number of layers
• Types of layers
• Size and # of filters in CNN layers
• # of hidden cells in GRU layers (also called # of units)
• Number of latent variables
There are various ways of regularizing that can be turned on in several or all
layers:
• L1/ L2 weight regularization
• Weight sharing
• Dropout (currently most popular)
25. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 25
How does one determine architecture?
1. This Week in Machine Learning (TWiML) Podcast, interview with Matthew Zeiler and others.
2. J Snoek, H Larochelle, RP Adams, Practical bayesian optimization of machine learning algorithms Advances in neural
information processing systems, 2951-2959 (2012)
3. Google Research Blog: Using Machine Learning to Explore Neural Network Architecture
4 Sean C. Smithson, Guang Yang, Warren J. Gross, Brett H. Meyer
Neural Networks Designing Neural Networks: Multi-Objective Hyper-Parameter Optimization arXiv:1611.02120
• Historically, design for deep networks has been a black art. This is part of the
reason deep learning jobs have such high salaries.1 There are many heuristics but
no overarching theory guiding design yet.
• Bayesian Optimization is one approach 2
• People at Google use reinforcement learning and genetic algorithms to design
complex deep networks, like the GoogleNet shown above, which can create designs
that perform as well as from human designers. 3
• People have even used neural networks to design neural nets. 4
26. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 26
Latent space projection into 2D via t-SNE
250,000
commercially
available drug-like
molecules from the
ZINC database
150,000 Organic LED
molecules,
combinatorically generated1
1.) Rafael Gómez-Bombarelli et al. “Design of efficient molecular organic light-emitting diodes by a high-
throughput virtual screening and experimental approach”. In: Nat. Mater. 15 pp. 1120–1127 (2016)
27. Data sets that are available
1/24/2018 Dan Elton, P.W. Chung Group Meeting 27
Name Description # of molecules Size
GDB-17-Set (50 million)
http://gdb.unibe.ch/downloads/
50,000,000
GBD-13 C-N molecules
GBD-13 C-N-O
molecules
ZINC database
zinc.docking.org
Commercially available
molecules
22,724,825
28. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 28
Adversarially trained autoencoder
1. Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014)
"Generative Adversarial Networks". arXiv:1406.2661
2. A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, in International Conference on Learning Representations, (2016), arxiv.org:1511.05644
3. Kadurin, Artur et al. “The Cornucopia of Meaningful Leads: Applying Deep Adversarial Autoencoders for New Molecule Development in
Oncology.” Oncotarget 8.7 (2017): 10883–10890. PMC. Web. 2 Aug. 2017.
Generative adversarial networks1 (GANs) have exploded in popularity
since 2014. Adversarial autoencoders2 (AAE) apply the GAN framework to
variational autoencoder training.
The adversarial
autoencoder is an
autoencoder that is
regularized by
matching the
aggregated posterior ,
q(z) derived from the
data distribution, to an
arbitrary prior, p(z).
Here p(z) is a ”the
Normal distribution
N(5,1)”
Application to oncology molecular lead discovery (2017)3
29. 1/24/2018 Dan Elton, P.W. Chung Group Meeting 29
“Molecular Tinder” for screening OLED molecules
From Aspuru-Guzik group: http://chimad.northwestern.edu/docs/DDD_WS_II/12_Aspuru_Guzik.p