Paper review: "HyperNetworks" by David Ha, Andrew Dai, Quoc V. Le (ICLR2017)
Presented at Tensorflow-KR paper review forum (#PR12) by Taesu Kim
Paper link: https://arxiv.org/abs/1609.09106
Video link: https://www.youtube.com/watch?v=-tUQXSdEsMk (in Korean)
http://www.neosapience.com
2. HyperNetworks overview
› An approach of using one network to generate the weight for another network
› Motivated by HyperNEAT (Stanley et al 2009) and tried to resemble genotype
and phenotype in nature
› HyperNetwork can be viewed relaxed form of weight sharing across layers.
› It generates non-shared weights for LSTM and achieved near state-of-the-art
result
› It generates shared weights for CNN and achieve respectable results with fewer
learnable parameters
8. Modified HyperRNN
› HyperRNN requires Nz times larger memory requirements than basic RNN
› Make it more scalable and memory efficient
› Use intermediate hidden vector to parameterize a weight matrix: d(z) is
linear projection of z
11. Character-level Penn Treebank Language Model
› 1000 units of MainLSTM & Two version of HyperLSTM
– 128 units of HyperLSTM cell & 4 embedding size
– 128 units of HyperLSTM cell & 16 embedding size à dropout keep probability of 85%
› HyperLSTM outperforms than standard LSTM
› HyperLSTM also achieves similar improvements compared to Layer Normalization à combination of
Layer Normalization and Hyper LSTM achieves the best test perp.
12. Hutter Prize Wikipedia Language Model
› 1800 units of MainLSTM & 256 units of HyperLSTM cell with 64 embedding size & max sequence length : 250
› 2048 units of MainLSTM & 256 units of HyperLSTM cell with 64 embedding size & max sequence length : 300
› HyperLSTM also achieves similar improvements compared to Layer Normalization à combination of Layer Normalization and
Hyper LSTM achieves the best test perp.
› HyperLSTM converges more quickly compared to LSTM and Layer Norm LSTM
13. Hutter Prize Wikipedia Language Model
› Visualizing how the weight scaling vectors of the main LSTM change during the character sampling process.
› Regions of low intensity, where the weights of the main LSTM are relatively static, the types of phrases
generated seem more deterministic
– For example, the weights do not change much during the words Europeans, possessions and reservation.
› The regions of high intensity is when the Hyper LSTM cell is making relatively large changes to the weights
of the main LSTM
14. Hutter Prize Wikipedia Language Model
› Normalized Histogram plots of 𝜙(𝑐$) for different models during sampling
– 𝜙(𝑐$) is the hidden state of the LSTM before applying the output gate.
–
› Layer Norm reduces the saturation effects compared to the vanilla LSTM…..
› In HyperLSTM, most of the time the cell is saturated
– HyperLSTM cell’s dynamic weight adjustment policy appears to be doing something very different compared to statistical
normalization.
– Although this policy came up with ended up providing similar performance as LayerNorm
15. Handwriting sequence generation
› 12179 handwritten lines from 221 writers
› LSTM input is (x, y) coordinate of the pen location and binary indicator of pen-up/pen-down
› It can see that many of these weight changes occur at the boundaries between words, and between characters
› Dynamically generate the generative model is one of the key advantages of HyperLSTM over a normal LSTM
16. Machine translation
› WMT’14 En→Fr using the same test/validation set split described in the GNMT paper.
– GMNT network has 8 layers each of encoder/decoder
› HyperLSTM cell improves the performance of the existing GNMT model, achieving state-
of-the-art single model results for this dataset.
› It is demonstrated the applicability of Hyper Networks to large-scale models used in
production systems.