Tsinghua invited talk_zhou_xing_v2r0

Strictly Confidential
15/11/2016
Applications of Deep
Neural Network
Dr. Z. Xing
Lead of Deep Learning Taskforce, Data Science & Analytics @ NIO USA, Inc
3200 N 1st St, San Jose, CA 95134
September 13th, 2017 @ Tsinghua University, Beijing, China

2
• The volume and heterogeneity of the data that
we are dealing with nowadays has reached a
level of unprecedented complexities and
subtleties
Data “science”
• At the end of the day, data “science” all comes
down to understanding an intricate
representation of the data, so called
representation learning
A typical example of industrial
scale data
• ~ 40 M clicks per second
• ~ 2.5 M servers
• ~ 5.7 terawatt hours annually
≈ 68 M AC units
• ~10-15 exabytes (1018) ≈ 30
M personal laptops
• ….......
= ⨂

3
• Computer/machine can help on representation learning task, but only to a
certain extent…...
• “Conventional” machine learning approaches limited in various ways,
manifested as a difficult learning task when raw format of data is fed to the
system, just as how human learning system processes data.
• Feature extraction is a man-crafted process that needs careful engineering,
domain expertise, etc. Difficult to generalize, scale up with the increasing data
size/model
“Conventional” machine learning
• man-crafted features
• manually create
representation of the data
classification
illustration
purpose only

4
• Artificial intelligence (AI)
“The science and engineering of making intelligent
machines” John McCarthy 1955
Artificial intelligence

5
• Infancy concept of neurons/circuits
originates from Neuro-science,
biophysics, computational physics
Concept of neural network
• The electrical signals (voltage spikes) that
brain processes does NOT represent the
external world at all, how neurons decode
such signals is complicated process, in two
folds: (a) time-dependencies (transient
neuron functions), (b) electrical functionalities
of each cell, activations
• 1011 neurons in human brain, 1015
connections (connectionist….)
soma/body
synapses

6
Concept of “deep” learning
• Representation learning is the key advantage which allows raw format of
data processing, avert the need of man-crafted features
• Multiple levels of representations of data, multiple levels of abstraction.
Accommodate flexible rank of latent space, which locally resembles
Euclidean space
• Each level is often a non-linear module,
aggregating multiple levels allows system to learn
complicated physics
• Higher/deeper levels amplify the components of
input that are relevant/crucial to optimization goals
while suppressing the less relevant part
optimization goal

7
“Deep” against “shallow”
• We want our system to be selective on things that are relevant or important,
while being invariant to things that are not important, for example orientations
of the object, background color, so on and so forth
• “Shallow” or even linear classifier can only carve input space into an over-
simplified regions/hyper-planes
Wolf Samoyed

8
• Unsupervised learning, transfer learning (domain
adaption)
• Auto-encoder
• Variational Auto-encoder (VAE)
• Restricted Boltzmann machine
Different learning mechanisms
• Supervised learning
• Objective function measures an error (𝛿) between system output and
desired target; internal weights keep getting tuned to minimize this 𝛿,
guided by gradients
• However, optimization happens at the level of expected value over many
training instances
• Also optimization goal is to match between two patterns, not taking into
account an overall strategic goal (winning a chess game etc.)
• Stochastic gradient descent, Stochastic Gradient Descent Tricks, (SGD, Bottou, 2007, ref. 18)
• Diederik P. Kingma, Auto-Encoding Variational Bayes, arxiv 1312.6114
encoder
decoder

9
Selectivity–invariance dilemma
• Symmetries in the data: many tasks are invariant to transformations of the
data, for example the recognition task is invariant to changing in pose, light,
location…. (symmetries)
• Human brain can learn to recongnize objects after seen only a few examples
(unsupervised), while most machine learning systems need huge amount of
labelled data (supervised)
• Factoring out the symmetries from the data, while retaining selectivity, is the
key to build artificial intelligence that can compete with human intelligence
Classical learning theory focuses on supervised learning and
postulates that a suitable hypothesis space is given. In other
words, data representation and how to select and learn it, is
classically not considered to be part of the learning problem,
but rather as a prior information.
visual cortex
…...
• Fabio Anselmi, On Invariance and Selectivity in Representation Learning, arxiv 1503.05938
• Attempts of utilizing
group theory, group
average have been
made, on the theory
side, to derive
invariant
representation
learning

10
Local minima for large networks
• Numerical analysis in statistical physics, random matrix theory, neural network
theory shows that local minima rarely an issue for large networks
• Key difference being the dimensionality of the space; proliferation of saddle
points, rather than local minima becomes more relevant in solving high
dimensional problem
• Yann N. Dauphin, Yoshua Bengio et. al (2014), Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization, arxiv 1406.2572
• Anna Choromanska et. al., The Loss Surfaces of Multilayer Networks, arxiv 412 0233
• Similar objective function at various
saddle points
• Statistics of Critical Points of Gaussian Fields on Large-Dimensional
Spaces, Bray and Dean (2007), Phys. Rev. Lett. 98, 150201
• Replica Symmetry Breaking Condition Exposed by Random Matrix
Calculation of Landscape Complexity, Fyodorov Williams (2007),
etc.

Fully connected layer
• Makes no assumption at all on the data
features
• Does not persist any invariance of the input
feature map
• Expensive in terms of computation and
memory consumption
• Multi-layer perceptron (MLP)
11
nonlinearities
nonlinearities

12
Backward propagation (BP)
• BP guides the computer to update
its internal parameters by using
Chain rule of derivatives
• The central problem that BP solves is to evaluate the influences of a parameter on a
function whose computation involves multiple elementary steps (Lagrangian
formalism)
Lagrange
function
Objective
function
Constraints
(network
dynamics)
Lagrange multiplier takes into account the backward
dynamics
Z. Xing, Measurement of the semileptonic CP violating asymmetry a_sl in B_s decays and
the D_s - D_s production asymmetry in 7 TeV pp collisions}", CERN-THESIS-2013-078",
https://inspirehep.net/record/1296591?ln=en
• Y Le Cun, A theoretical
framework of backward
propagation, Proceedings of the
1988 Connectionist Model
Summer School, p21-28, 1988

Convolutional layer
• Convolutional neural network
• 3-dimensional neurons
• local connectivity at each filter/kernel
(local features of data)
• weights-sharing between all neurons
in the same layer, usually named as
an unit of kernel/filter to be convoluted
with input volume (invariance of data)
• output 3D neurons
• depth of filter bank:
Do
• input 3D
neurons
𝐷 𝑜
𝐷𝑖
= 𝐷 𝑜× 𝐷𝑖 × 𝐹 × 𝐹
# of
learnable
weights
𝑁𝑜 =
𝑁𝑖 − 𝐹 + 2 × 𝑝
𝑠
+ 1
s: stride
• the kernels/filters
essential can pick
up latent features
such as brightness
of image, contrast,
RGB color, edges,
etc.
13

Connection to neuro-science
• One route of developing your deep neural net architecture is inspirations
from neuro-science, such as human visual cortex
• Cross channel information learning (cascaded 1x1 convolution) is
biologically inspired because human visual cortex have receptive fields
(kernels) tuned to different orientation
- local groups of
values are often
highly correlated
- invariance to
location, weights
sharing
14
Charles F. Cadieu et. al., Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition, PLOS
Computational Biology, December 2014, Volume 10, Issue 12

Translational invariance
• Convolutional layer relies on translational invariance
(convolution commutes with translation)
• local input regions
• only relative locations are taken into account
stride kernel size
fks determines layer types:
• convolutional
• max pooling
• activation function
translation operator
15

Recurrent architecture
• Recurrent network structures can be used to
learn potential temporal correlations/structures
in the data
• Once “unrolled” or “unfolded”, all layers share
the same weights, can be viewed as
feedforward networks, thus can be optimized
using BP (BPTT, through time)
• However, there is exploding or vanishing
gradient problem along the temporal axis
• Different formalisms and implementations of
recurrent activations are proposed (LSTM, fixed
unit recurrent weights, GRU, etc.) to alleviate
the issue as well as gradient clipping approach
xt=0
ht=0
yt=0
xt=1
ht=1
yt=1
xt=2
ht=2
yt=2
w>1 or
w<1
all these recurrent edges share
the same synaptic weights
16

LSTM – long short-term memory
• Special treatment: memory cells
• Novel inclusion of multiplicative nodes, all edges into or out of these nodes
have fixed unit weight, people used call this fixed unit weight as “constant
error carousel”
• A. Grave, Generating Sequences With Recurrent Neural Networks, arXiv 1308.0850
• A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks, Studies in Computational Intelligence. Springer, 2012
recurrence
components
fixed unit weight alleviates
vanishing gradient problems
errors
memory
flushing
17

Sequence generation
• Recurrent network can be used for sequence generation
• a man riding a wave on top of a surfboard . (p=0.040413)
• a person riding a surf board on a wave (p=0.017452)
• a man riding a wave on a surfboard in the ocean (p=0.005743)
trainin
g
testing/inference
18

Gated Recurrent Unit (GRU)
• GRU also utilized gating unit to regulate the temporal flow, but with
a simple linear interpolation, instead of memory cell
• Kyunghyun Cho, On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, arXiv:1409.1259
LSTM GRU
reset
update
previousstates
previousstates
• Both LSTM and GRU
utilizes an additive
component when
updating the states which
keeps partial influences
from previous timestamp
19

Activation of hidden layers
• A neural network without any activation would simply be a linear
regression model. Activation function accommodates
sophisticated nonlinearities for data such as images, videos,
speeches, etc.
o Sigmoid function: saturation causes vanishing gradients, slow
convergence, not zero-centered
o Tanh: vanishing gradient problem
o ReLu: avoids and rectifies vanishing gradient, no need for input
normalization, could result in “dead” neurons
o “Leaky” ReLu, pReLu (parameterized ReLu)
o Human neuron activations can actually be a stochastic process
20

Normalization
• Local response normalization. Normalize across neighboring kernels,
lateral inhibition, competition for big activities across neurons computed by
different kernel/filter
• Batch normalization, reducing internal covariate shift
(ICS)
• “Whitening” input feature map accelerates the
training speed and convergence
• But simple normalization procedure may violate
the identify transform depending the non-linear
activation form
21

Pooling
• Summarize across neighboring groups of neurons in the same
kernel map to reduce computations, feature map size
• Less over-fitting
• Aggregates localized spatial information
Alternatives:
• Maximum
• Sum
• Average
• Weighted average with
distance from the center pixel
• Overlapped, non-overlapped
• …......
22

Output layer
• Training a deep neural network is a highly non convex optimization
problem that we usually solve using convex methods
• “Softmax” function: original motivation being treat the outputs of NN as
probabilities conditioned on the inputs, normalized to unity
𝑝 𝑦 = 𝑗 𝑧 𝑖
=
𝑒
𝑧 𝑗
(𝑖)
𝑗=0
𝑘
𝑒 𝑧 𝑘
(𝑖)
Anders Øland, Be Careful What You Back propagate: A Case For Linear Output Activations & Gradient Boosting, arxiv 1707.04199
• What output layer generates is actually not a probability distribution as we all
conjectured
• gradient boosting method,
exponentiating the errors
from the output layer, non-
normalized
23

Reduce over fitting
• Data augmentation…....
• “Drop-out” treatment, randomly drops neurons to prevent
overfitting, “re-scaling” needed when making inference
• Nitish Srivastava et. al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research 15 (2014) 1929-1958
24

Image classification
• AlexNet, architecture contributions:
• ReLu, ~6 times faster than saturating approaches
• Local response normalization: ~2% increase in the
precision
• Reduce overfitting:
• data augmentation
• drop-out: a neuron cannot rely on the presence of
particular other neurons thus forced to learn in a
more robust manner
• A. Krizhevsky, ImageNet Classification with
Deep Convolutional Neural Networks, NIPS
2012
output layer: 1000-
way softmax
embedding vector
25

Network in network
• GoogleNet, Inception Net
• Key idea: how to use dense → sparse to improve computational
efficiency
• Local sparsity of using Network-in-network
• “1x1 convolution”, dimensionality reduction in rank of latent feature
manifold (cross-channel pooling layer)
• Hebbian principle – neurons that fire together, wire together
• Christian Szegedy et. al. Going deeper with convolutions
https://arxiv.org/pdf/1409.4842.pdf
• https://arxiv.org/pdf/1512.00567.pdf
• arXiv:1312.4400v3
NIN
enhance representational power
𝑁𝑜 =
𝑁𝑖 − 𝐹 + 2 × 𝑝
𝑠
+ 1
𝐷𝑖 𝐷 𝑜
5x5
3x3
1x1average
pooling
1x1
1x1
9 × 9 > 5 × 5 + 3 × 3 + 1 × 1 = 35
26

Bench mark results
9/13/2017 Invited talk at Tsinghua University
27

Object detection
• Object detection task imposes
additional request to a classifier in
terms of the localization of a/multiple
objects
• Sliding window approach (DPM,
deformable part model) is
computationally too expensive
• Regional proposal method adds some
prior hypothesis on regions that are
promising. However may have multiple
steps pipelined together (RPN for
objectness score, detection network,
classification network)
re-
purposed
28

You Only Look Once (YOLO)
• Labeling images for detection is far more expensive than labelling for
classification or tagging
• Leverage the classification data expands the scope of current detection
systems (transfer learning)
• YOLO is NOT a repurposed classifier
29

YOLO v2 improvements
• Batch norm: un-necessitate
the need of regularization,
drop-out
• Anchor box concept, remove
fully connected layers
• Higher resolution for classifier
part to better adapt to
detections
anchor box increases recall
and does change mAP
30

YOLO v2 results
• Results from most recent YOLO
paper
31

Semantic segmentation
• Approaches such as dilated convolutions are utilized to take
into account the context module in the picture, multi-scale
receptive field
• Enet https://arxiv.org/pdf/1606.02147.pdf
• SegNet https://arxiv.org/pdf/1511.00561.pdf
• “Dense” prediction problem with per-
pixel level precision required
• Context model crucial in this
application
• Typical “encoder-decoder”
architecture: network gets deeper
feature while map narrows down
32

Segmentation performances
• Metrics such as
intersection over union
(IOU) are used to
measure the performance
of segmentation
• Image quality tends to
influence the results
significantly
33

Three-dimensional data
• 3D segmentation, no “voxelization” or cross-
sectional rendering needed even on
unstructured data
• Permutation invariance, learning
transformation matrix of point cloud
combining the
global and local
per-point
embedding
34

Audio and natural language processing (NLP)
• Audio signals can be represented as a localized format, either in the
temporal or frequency/spectral domain
• Z. Xing et. al. Big Data (Big Data), 2016 IEEE International Conference
• Z. Xing et. al. https://arxiv.org/pdf/1705.05229.pdf
• Text/words can also be embedded, so called
“word vectors”
35
music embedding

Generative adversarial network (GAN)
• While discriminative models manifested with a great success, generative models had less
impact due to difficulties with intractable probabilistic computations (MLE). “Two-player”
min-max approach sidesteps this problem.
• Ian J. Goodfellow, et. al. Generative Adversarial Nets, arXiv:1406.2661v1
• Phillip Isola, Image-to-Image Translation with Conditional Adversarial Networks, arXiv:1611.07004v1
generatordiscriminator
D
G
• here z is some random noise
MinMax
36

Reinforcement learning concept
• Environment representation learning framework naturally follows
human/animal learning processes (“agent”-“environment” nomenclature).
Agent’s actions depends on the state, and may or may not change the
future environment
• Deep neural network re-enables R.L. by learning complex data
representation, without any hand-crafted feature extraction
• Agent state to action mapping depicted by a policy function, which can be
stochastic as well
• Volodymyr Mnih et. al. Human-level control through deep reinforcement learning, nature14236, 2015
37

Formalisms
• Value-base approaches such as Deep Q-Network (DQN):
• Learn value function, implicit policy function (ε-greedy)
• “Experience replay” utilizes to remove correlation that causes
divergence problems of R.L.
• Solely in the context of MDP assumption
• Policy-based approaches such policy gradient method
• No value function, learn policy
• High variance issue
• MDP not necessarily assumed
• Actor-Critic
state actionrewards
policy
38

Applications
• Navigating through an intersection under complicated environments
• David Isele, Navigating Intersections with Autonomous Vehicles using Deep Reinforcement Learning, arXiv:1705.01196
• Motion negotiations
between “agents” under
dynamically changing
environment
• Shai Shalev-Shwartz et. al. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving, arXiv:1610.03295v1
39

40
Summary and future research
• Unsupervised learning
• human learns the world more naturally by discovering,
rather than supervision
• Convolutional networks combining with recurrence to takes
into account the temporal correlations, thus make
predictions in a dynamic fashion
• Reinforcement learning to pre-guide the learning into the
“ROI” (region of interest)
data representation
learning
complex
reasoning

Tsinghua invited talk_zhou_xing_v2r0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tsinghua invited talk_zhou_xing_v2r0

Similar to Tsinghua invited talk_zhou_xing_v2r0 (20)

Recently uploaded

Recently uploaded (20)

Tsinghua invited talk_zhou_xing_v2r0