Introduction to Machine Learning and Deep Learning

Terry Taewoong Um (terry.t.um@gmail.com)
University of Waterloo
Department of Electrical & Computer Engineering
Terry Taewoong Um
MACHINE LEARNING,
DEEP LEARNING, AND
MOTION ANALYSIS
1

CAUTION
• I cannot explain everything
• You cannot get every details
2
• Try to get a big picture
• Get some useful keywords
• Connect with your research

CONTENTS
1. What is Machine Learning?
(Part 1 Q & A)
2. What is Deep Learning?
(Part 2 Q & A)
3. Machine Learning in Motion Analysis
(Part 3 Q & A)
3

CONTENTS
4
1. What is Machine Learning?

WHAT IS MACHINE LEARNING?
"A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E“ – T. Michell (1997)
Example: A program for soccer tactics
5
T : Win the game
P : Goals
E : (x) Players’ movements
(y) Evaluation

WHAT IS MACHINE LEARNING?
6
“Toward learning robot table tennis”, J. Peters et al. (2012)
https://youtu.be/SH3bADiB7uQ

TASKS
7
classification
discrete target values
x : pixels (28*28)
y : 0,1, 2,3,…,9
regression
real target values
x ∈ (0,100)
y : 0,1, 2,3,…,9
clustering
no target values
x ∈ (-3,3)×(-3,3)

PERFORMANCE
8
classification
0-1 loss function
regression
L2 loss function
clustering

EXPERIENCE
9
classification
labeled data
(pixels)→(number)
regression
labeled data
(x) → (y)
clustering
unlabeled data
(x1,x2)

A TOY EXAMPLE
10
? Height(cm)
Weight
(kg)
[Input X]
[Output Y]

11
180 Height(cm)
Weight
(kg)
80
Y = aX+b
Model : Y = aX+b Parameter : (a, b)
[Goal] Find (a,b) which best fits the given data
A TOY EXAMPLE

12
[Analytic Solution]
Least square problem
(from AX = b, X=A#b where
A# is A’s pseudo inverse)
Not always available
[Numerical Solution]
1. Set a cost function
2. Apply an optimization method
(e.g. Gradient Descent (GD) Method)
L
(a,b)
http://www.yaldex.com/game-
development/1592730043_ch18lev1sec4.html
Local minima problem
http://mnemstudio.org/neural-networks-
multilayer-perceptron-design.htm
A TOY EXAMPLE

13
32 Age(year)
Running
Record
(min)
140
WHAT WOULD BE THE CORRECT MODEL?
Select a model → Set a cost function → Optimization

14
? X
Y
1. Regularization 2. Nonparametric model
“overfitting”

15
L2 REGULARIZATION
(e.g. w=(a,b) where Y=aX+b)
Avoid a complicated model!
• Another interpretation :
: Maximum a Posteriori (MAP)
http://goo.gl/6GE2ix

16
L2 REGULARIZATION
• Another interpretation :
: Maximum a Posteriori (MAP)
• Bayesian inference
𝑃 𝐵𝑒𝑙𝑖𝑒𝑓 𝐷𝑎𝑡𝑎 =
𝑃 𝐵𝑒𝑙𝑖𝑒𝑓 𝑃(𝐷𝑎𝑡𝑎|𝐵𝑒𝑙𝑖𝑒𝑓)
𝑃(𝐷𝑎𝑡𝑎)
posterior
prior likelihood
ex) fair coin : 50% H, 50% T
falsified coin : 80% H, 20% T
Let’s say we observed ten heads consecutively.
What’s the probability for being a fair coin?
𝑃 𝐵𝑒𝑙𝑖𝑒𝑓 = 0.2
𝑃 𝐷𝑎𝑡𝑎|𝐵𝑒𝑙𝑖𝑒𝑓 = 0.510
≈ 0.001
𝑃 𝐵𝑒𝑙𝑖𝑒𝑓|𝐷𝑎𝑡𝑎 ∝ 0.2 ∗ 0.001 = 0.0002
normalization
(you don’t believe this coin is fair)
Fair
coin?
Falsified
coin?
𝑃 𝐵𝑒𝑙𝑖𝑒𝑓 = 0.8
𝑃 𝐷𝑎𝑡𝑎|𝐵𝑒𝑙𝑖𝑒𝑓 = 0.810
≈ 0.107
𝑃 𝐵𝑒𝑙𝑖𝑒𝑓|𝐷𝑎𝑡𝑎 ∝ 0.8 ∗ 0.107 = 0.0856
Fair =
0.0002
0.0002+0.0856
= 0.23% , Unfair = 99.77%

17
1. Regularization 2. Nonparametric model
training time
error
training error
test error
we should
stop here
training
set
validation
set
test
set
for training
(parameter
optimization)
for early
stopping
(avoid
overfitting)
for evaluation
(measure the
performance)
keep watching the validation error

18
NONPARAMETRIC MODEL
• It does not assume any parametric models (e.g. Y = aX+b, Y=aX2+bX+c, etc.)
• It often requires much more samples
• Kernel methods are frequently applied for modeling the data
• Gaussian Process Regression (GPR), a sort of kernel method, is a widely-used
nonparametric regression method
• Support Vector Machine (SVM), also a sort of kernel method, is a widely-used
nonparametric classification method
kernel function
[Input space] [Feature space]

19
SUPPORT VECTOR MACHINE (SVM)
“Myo”, Thalmic Labs (2013)
https://youtu.be/oWu9TFJjHaM
[Linear classifiers] [Maximum margin]
Support vector Machine Tutorial, J. Weston, http://goo.gl/19ywcj
[Dual formulation] ( )
kernel function
kernel function

20
GAUSSIAN PROCESS REGRESSION (GPR)
https://youtu.be/YqhLnCm0KXY
https://youtu.be/kvPmArtVoFE
• Gaussian Distribution
• Multivariate regression likelihood
posterior
prior
likelihood
prediction conditioning the joint distribution of the observed & predicted values
https://goo.gl/EO54WN
http://goo.gl/XvOOmf

21
DIMENSION REDUCTION
[Original space] [Feature space]
low dim. high dim.
high dim. low dim.
𝑋 → ∅(𝑋)
• Principal Component Analysis
: Find the best orthogonal axes
(=principal components) which
maximize the variance of the data
Y = P X
* The rows in P are m largest eigenvectors
of
1
𝑁
𝑋𝑋 𝑇
(covariance matrix)

22
DIMENSION REDUCTION
http://jbhuang0604.blogspot.kr/2013/04/miss-korea-2013-contestants-face.html

23
SUMMARY - PART 1
• Machine Learning
- Tasks : Classification, Regression, Clustering, etc.
- Performance : 0-1 loss, L2 loss, etc.
- Experience : labeled data, unlabelled data
• Machine Learning Process
(1) Select a parametric / nonparametric model
(2) Set a performance measurement including regularization term
(3) Training data (optimizing parameters) until validation error increases
(4) Evaluate the final performance using test set
• Nonparametric model : Support Vector Machine, Gaussian Process Regression
• Dimension reduction : used as pre-processing data

CONTENTS
24
Questions about Part 1?

CONTENTS
25
2. What is Deep Learning?

26
PARADIGM CHANGE
PAST
Knowledge
ML
Method
(e.g.
GPR, SVM)
PRESENT
What is the best
ML method for
the target task?
Knowledge
Representation
How can we find a
good representation?

27
PARADIGM CHANGE
Knowledge
PRESENT
Representation
How can we find a
kernel function

28
PARADIGM CHANGE
Knowledge
PRESENT
Representation
(Features)
How can we find a
IMAGE
SPEECH
Hand-Crafted Features

29
PARADIGM CHANGE
IMAGE
SPEECH
Hand-Crafted Features
Knowledge
PRESENT
Representation
(Features)
Can we learn a good representation
(feature) for the target task as well?

30
DEEP LEARNING
• What is Deep Learning (DL) ?
- Learning methods which have deep (not shallow) architecture
- It often allows end-to-end learning
- It automatically finds intermediate representation. Thus,
it can be regarded as a representation learning
- It often contains stacked “neural network”. Thus,
Deep learning usually indicates “deep neural network”
“Deep Gaussian Process” (2013)
https://youtu.be/NwoGqYsQifg
http://goo.gl/fxmmPE
http://goo.gl/5Ry08S

31
OUTSTANDING PERFORMANCE OF DL
error rate : 28% → 15% → 8%
(2010) (2014)(2012)
- Object recognition (Simonyan et al., 2015)
- Natural machine translation (Bahdanau et al., 2014)
- Speech recognition (Chorowski et al., 2014)
- Face recognition (Taigman et al., 2014)
- Emotion recognition (Ebrahimi-Kahou et al., 2014)
- Human pose estimation (Jain et al., 2014)
- Deep reinforcement learning(mnih et al., 2013)
- Image/Video caption (Xu et al., 2015)
- Particle physics (Baldi et al., 2014)
- Bioinformatics (Leung et al., 2014)
- And so on….
• State-of-art results achieved by DL
DL has won most of ML challenges!
K. Cho, https://goo.gl/vdfGpu

32
BIOLOGICAL EVIDENCE
• Somatosensory cortex learns to see
• Why do we need different ML methods
for different task?
Yann LeCun, https://goo.gl/VVQXJG
• The vental pathway in the visual cortex has multiple stages
• There exist a lot of intermediate representations
Andrew Ng, https://youtu.be/ZmNOAtZIgIk

33
BIG MOVEMENT
http://goo.gl/zNbBE2 http://goo.gl/Lk64Q4
Going deeper and deeper….

34
NEURAL NETWORK (NN)
Hugo Larochelle, http://www.dmi.usherb.ca/~larocheh/index_en.html
• Universal approximation theorem (Hornik, 1991)
- A single hidden layer NN w/ linear output can approximate any cont. func. arbitrarily well,
given enough hidden units
- This does not imply we have learning method to train them

35
TRAINING NN
• First, calculate the output using data & initial parameters (W ,b)
• Activation functions
http://goo.gl/qMQk5H
1

36
TRAINING NN
• Then, calculate the error and update the weights from top to bottom
• Parameter gradients
: Backpropagation algorithm
2
known

37
TRAINING NN
2
known

38
TRAINING NN
2
known

39
TRAINING NN
2
known

40
TRAINING NN
• Repeat this process with different dataset(mini-batches)
- Forward propagation (calculate the output values)
- Evaluate the error
- Backward propagation (update the weights)
- Repeat this process until the error converges
3
• As you can see here, NN is not a fancy algorithm,
but just a iterative gradient descent method with
huge number of parameters
• NN is often likely to be
stuck in local minima pitfall

41
FROM NN TO DEEP NN
• From NN to deep NN (since 2006)
- NN requires expert’s skill to tune the hyperparameters
- It sometimes gives a good result, but sometimes gives a bad result.
The result is highly depend on the quality of initialization, regularization,
hyperparameters, data, etc.
- Local minima is always problematic
• A long winter of NN
Yann LeCun
(NYU, Facebook)
Yoshua Bengio
(U. Montreal)
Geoffrey Hinton
(U. Toronto, Google)

42
WHY IS DL SO SUCCESSFUL?
http://t-robotics.blogspot.kr/2015/05/deep-learning.html
• Pre-training with unsupervised learning
• Convolutional Neural Network
• Recurrent Neural Net
• GPGPU (parallel processing) & big data
• Advanced algorithms for optimization,
activation, regularization
• Huge research society
(Vision, Speech, NLP, Biology, etc.)

43
UNSUPERVISED LEARNING
• How can we avoid pathologic local minima cases?
(1) First, pre-train the data with unsupervised learning method
and get a new representation
(2) Stack up this block structures
(3) Training each layer in end-to-end manner
(4) Fine tune the final structure with (ordinary) fully-connected NN
• Unsupervised learning method
- Restricted Boltzmann Machine (RBM)
→ Deep RBM, Deep Belief Network (DBN)
- Autoencoder
→ Deep Auto-encoder
http://goo.gl/QGJm5k
Autoencoder http://goo.gl/s6kmqY

44
UNSUPERVISED LEARNING
“Convolutional deep belief networks for scalable unsupervised learning of hierarchical representation”, Lee et al., 2012

45
CONVOLUTIONAL NN
• How can we deal with real images which is
much bigger than MNIST digit images?
- Use not fully-connected, but locally-connected NN
- Use convolutions to get various feature maps
- Abstract the results into higher layer by using pooling
- Fine tune with fully-connected NN
https://goo.gl/G7kBjI
https://goo.gl/Xswsbd
http://goo.gl/5OR5oH

46
CONVOLUTIONAL NN
“Visualization and Understanding Convolutional Network”, Zeiler et al., 2012

47
CONVNET + RNN
“Large-scale Video Classification with Convolutional Neural Network”,
A. Karpathy 2014, https://youtu.be/qrzQ_AB1DZk

48
RECURRENT NEURAL NETWORK (RNN)
t-1 t t+1
[Neural Network] [Recurrent Neural Network]
http://www.dmi.usherb.ca/~larocheh/index_en.html

49
RECURRENT NEURAL NETWORK (RNN)
[Neural Network] [Recurrent Neural Network]
back propagation
back propagation
through time
(BPTT)
• Vanishing gradient problem : Can’t have long memory!
“Training Recurrent Neural Networks, I. Sutskever, 2013

50
RNN + LSTM
• Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997)
“Training Recurrent Neural Networks, I. Sutskever, 2013

51
INTERESTING RESULTS FROM RNN
http://pail.unist.ac.kr/carpedm20/poet/
http://cs.stanford.edu/people/karpathy/deepimagesent/
“generating sequences with RNN”,
A.Graves, 2013

52
WHY IS DL SO SUCCESSFUL?
http://t-robotics.blogspot.kr/2015/05/deep-learning.html
• Pre-training with unsupervised learning
• Convolutional Neural Network
• Recurrent Neural Net
• GPGPU (parallel processing) & big data
• Advanced algorithms for optimization,
activation, regularization
• Huge research society
(Vision, Speech, NLP, Biology, etc.)

CONTENTS
53
Questions about Part 2?

CONTENTS
54
3. Machine Learning in
Motion Analysis

55
MOTION DATA
“츄리닝”, 이상신 국중록

56
MOTION DATA
We need to know the state not only at time t
but also at time t-1, t-2, t-3, etc.
𝑓 = 𝑓(𝑥, 𝑡)
“츄리닝”, 이상신 국중록

57
MOTION DATA
• Why do motion data need special treatment?
- In general, most machine learning techniques assume i.i.d. (independent
& identically distributed) sampling condition.
e.g.) coins tossing
- However, motion data is temporally & spatially correlated
http://goo.gl/LQulvcswing motion manipulability ellipsoid https://goo.gl/dHjFO9

58
MOTION DATA
http://goo.gl/ll3sq6
We can infer the next state
based on the temporal &
spatial information
But, how can we exploit
those benefits in ML method?

59
WHAT CAN WE DO WITH MOTION DATA?
• Learning the kinematic/dynamic model
• Motion segmentation
• Motion generation / synthesis
• Motion imitation (Imitation learning)
• Activity / Gesture recognition
TASKS
Data
• Motion capture data
• Vision Data
• Dynamic-level data
Applications
• Biomechanics
• Humanoid
• Animation
http://goo.gl/gFOVWL

60
HIDDEN MARKOV MODEL (HMM)
Prob. of (n+1) state only depends on state at (n+1)

61
LIMITATIONS OF HMM
1. Extract features (e.g. PCA)
2. Define the HMM structure (e.g. using GMM)
3. Train a separate HMM per class (Baum-Welch algorithm)
4. Evaluate probability under each HMM (Fwd/Bwd algorithm)
or 3. Choose most probable sequence (Viterbi algorithm)
- HMM handle discrete states only!
- HMM has short memory! (using just the previous state)
- HMM has limited expressive power!
- [Trend1] features-GMM → unsupervised learning methods
- [Trend2] features-GMM-HMM → recurrent neural network
• A common procedure of HMM for motion analysis
• Limitations & trend change in speech recognition area

62
CAPTURE TEMPORAL INFORMATION
• 3D ConvNet
- “3D Convolutional Neural Network for
Human Action Recognition” (Ji et al., 2010)
- 3D convolution
- Activity recognition / Pose estimation from video
“Joint Training of a Convolutional Network
and a Graphical Model for Human Pose
Estimation”, Tompson et al., 2014

63
CAPTURE TEMPORAL INFORMATION
• Recurrent Neural Network (RNN)
“Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition”, Y. Du et al., 2015
• However, how can we capture the
spatial information about motions?

64
CHALLENGES
We should connect the geometric information with deep neural network!
• The link transformation from the i-1 th link to the i th link
• Forward Kinematics
constant, Mvariable, 𝜃
c.f.)
𝑋𝑖−1,𝑖 = 𝑅𝑜𝑡 𝑧, 𝜃𝑖 𝑇𝑟𝑎𝑛𝑠 𝑧, 𝑑𝑖 𝑇𝑟𝑎𝑛𝑠 𝑥, 𝑎𝑖 𝑅𝑜𝑡 𝑧, 𝛼𝑖 = 𝑒[𝐴 𝑖]𝜃 𝑖 𝑀𝑖−1,𝑖
𝑋0,𝑛 = 𝑒[𝐴1]𝜃1 𝑀0,1 𝑒[𝐴2]𝜃2 𝑀1,2 ⋯ 𝑒 𝐴 𝑛 𝜃 𝑛 𝑀 𝑛−1,𝑛
= 𝑒[𝑆1]𝜃1 𝑒[𝑆2]𝜃2 ⋯ 𝑒[𝑆 𝑛]𝜃 𝑛 𝑀0,𝑛
𝑆𝑖 = 𝐴𝑑 𝑀01⋯𝑀 𝑖−2,𝑖−1
𝐴𝑖 , 𝑖 = 1, ⋯ , 𝑛
propagated forces
external force acting
on the ith body where
• Newton-Euler formulation for inverse dynamics
Lie group & Lie algebra,
http://goo.gl/uqilDV

65
CHALLENGES
https://www.youtube.com/watch?v=oxA2O-tHftI

66
Thank you

Introduction to Machine Learning and Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Machine Learning and Deep Learning

Similar to Introduction to Machine Learning and Deep Learning (20)

More from Terry Taewoong Um

More from Terry Taewoong Um (9)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning and Deep Learning