SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Downloaden Sie, um offline zu lesen
© Eric Xing @ CMU, 2006-2010
Machine Learning
Neural Networks
Eric Xing
Lecture 3, August 12, 2010
Reading: Chap. 5 CB
© Eric Xing @ CMU, 2006-2010
Learning highly non-linear
functions
f: X  Y
 f might be non-linear function
 X (vector of) continuous and/or discrete vars
 Y (vector of) continuous and/or discrete vars
The XOR gate Speech recognition
© Eric Xing @ CMU, 2006-2010
 From biological neuron to artificial neuron (perceptron)
 Activation function
 Artificial neuron networks
 supervised learning
 gradient descent
Perceptron and Neural Nets
Soma Soma
Synapse
Synapse
Dendrites
Axon
Synapse
Dendrites
Axon
Threshold
Inputs
x1
x2
Output
Y∑
Hard
Limiter
w2
w1
Linear
Combiner
θ
∑
=
=
n
i
iiwxX
1 


ω<−
ω≥+
=
0
0
X
X
Y
if,
if,
1
1
Input Layer Output Layer
Middle Layer
InputSignals
OutputSignals
© Eric Xing @ CMU, 2006-2010
Connectionist Models
 Consider humans:
 Neuron switching time
~ 0.001 second
 Number of neurons
~ 1010
 Connections per neuron
~ 104-5
 Scene recognition time
~ 0.1 second
 100 inference steps doesn't seem like enough
 much parallel computation
 Properties of artificial neural nets (ANN)
 Many neuron-like threshold switching units
 Many weighted interconnections among units
 Highly parallel, distributed processes
Synapses
Axon
Dendrites
Synapses
+
+
+
-
-
(weights)
Nodes
© Eric Xing @ CMU, 2006-2010
Male Age Temp WBC Pain
Intensity
Pain
Duration
37 10 11 20 1
adjustable
weights
0 1 0 0000
AppendicitisDiverticulitis
Perforated
Non-specific
Cholecystitis
Small Bowel
PancreatitisObstructionPain
Duodenal
Ulcer
Abdominal Pain Perceptron
© Eric Xing @ CMU, 2006-2010
0.8
Myocardial Infarction
“Probability” of MI
112 150
MaleAgeSmokerECG: STPain
Intensity
4
Pain
Duration Elevation
Myocardial Infarction Network
© Eric Xing @ CMU, 2006-2010
The "Driver" Network
© Eric Xing @ CMU, 2006-2010
weights
Output units
No disease Pneumonia Flu Meningitis
Input units
Cough Headache
what we got
what we wanted-
error
∆ rule
change weights to
decrease the error
Perceptrons
© Eric Xing @ CMU, 2006-2010
Input units
Input to unit j: aj = Σwijai
j
i
Input to unit i: ai
measured value of variable i
Output of unit j:
oj = 1/ (1 + e- (aj+θ
j) )Output
units
Perceptrons
© Eric Xing @ CMU, 2006-2010
Jargon Pseudo-Correspondence
 Independent variable = input variable
 Dependent variable = output variable
 Coefficients = “weights”
 Estimates = “targets”
Logistic Regression Model (the sigmoid unit)
Inputs Output
Age 34
1Gender
Stage 4
“Probability
of beingAlive”
5
8
4
0.6
Σ
Coefficients
a, b, c
Independent variables
x1, x2, x3
Dependent variable
p Prediction
© Eric Xing @ CMU, 2006-2010
The perceptron learning
algorithm
 Recall the nice property of sigmoid function
 Consider regression problem f:XY , for scalar Y:
 Let’s maximize the conditional data likelihood
© Eric Xing @ CMU, 2006-2010
Gradient Descent
xd = input
td = target output
od =observed unit
output
wi =weight i
© Eric Xing @ CMU, 2006-2010
The perceptron learning rules
xd = input
td = target output
od =observed unit
output
wi =weight i
Batch mode:
Do until converge:
1. compute gradient ∇ED[w]
2.
Incremental mode:
Do until converge:
 For each training example d in D
1. compute gradient ∇Ed[w]
2.
where
© Eric Xing @ CMU, 2006-2010
MLE vs MAP
 Maximum conditional likelihood estimate
 Maximum a posteriori estimate
© Eric Xing @ CMU, 2006-2010
What decision surface does a
perceptron define?
x y Z (color)
0 0 1
0 1 1
1 0 1
1 1 0
NAND
some possible values for w1 and w2
w1 w2
0.20
0.20
0.25
0.40
0.35
0.40
0.30
0.20
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 1
f(0w1 + 1w2) = 1
f(1w1 + 0w2) = 1
f(1w1 + 1w2) = 0
y
x1 x2
w1 w2
θ = 0.5
f(a) = 1, for a > θ
0, for a ≤ θ
θ
© Eric Xing @ CMU, 2006-2010
x y Z (color)
0 0 0
0 1 1
1 0 1
1 1 0
What decision surface does a
perceptron define?
NAND
some possible values for w1 and w2
w1 w2
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0
f(0w1 + 1w2) = 1
f(1w1 + 0w2) = 1
f(1w1 + 1w2) = 0
y
x1 x2
w1 w2
θ = 0.5
f(a) = 1, for a > θ
0, for a ≤ θ
θ
© Eric Xing @ CMU, 2006-2010
x y Z (color)
0 0 0
0 1 1
1 0 1
1 1 0
What decision surface does a
perceptron define?
NAND
f(a) = 1, for a > θ
0, for a ≤ θ
θ
w1 w4w3
w2
w5 w6
θ = 0.5 for all units
a possible set of values for (w1, w2, w3, w4, w5 , w6):
(0.6,-0.6,-0.7,0.8,1,1)
© Eric Xing @ CMU, 2006-2010
CoughNo cough
CoughNo cough
No headache No headache
Headache Headache
No disease
Meningitis Flu
Pneumonia
No treatment
Treatment
00 10
01 11
000 100
010
101
111011
110
Non Linear Separation
© Eric Xing @ CMU, 2006-2010
Inputs
Weights
Output
Independent
variables
Dependent
variable
Prediction
Age 34
2Gender
Stage 4
.6
.5
.8
.2
.1
.3
.7
.2
WeightsHidden
Layer
“Probability
of beingAlive”
0.6
Σ
Σ
.4
.2
Σ
Neural Network Model
© Eric Xing @ CMU, 2006-2010
Inputs
Weights
Output
Independent
variables
Dependent
variable
Prediction
Age 34
2Gender
Stage 4
.6
.5
.8
.1
.7
WeightsHidden
Layer
“Probability
of beingAlive”
0.6
Σ
“Combined logistic models”
© Eric Xing @ CMU, 2006-2010
Inputs
Weights
Output
Independent
variables
Dependent
variable
Prediction
Age 34
2Gender
Stage 4
.5
.8
.2
.3
.2
WeightsHidden
Layer
“Probability
of beingAlive”
0.6
Σ
© Eric Xing @ CMU, 2006-2010
Inputs
Weights
Output
Independent
variables
Dependent
variable
Prediction
Age 34
1Gender
Stage 4
.6
.5
.8
.2
.1
.3
.7
.2
WeightsHidden
Layer
“Probability
of beingAlive”
0.6
Σ
© Eric Xing @ CMU, 2006-2010
WeightsIndependent
variables
Dependent
variable
Prediction
Age 34
2Gender
Stage 4
.6
.5
.8
.2
.1
.3
.7
.2
WeightsHidden
Layer
“Probability
of beingAlive”
0.6
Σ
Σ
.4
.2
Σ
Not really,
no target for hidden units...
© Eric Xing @ CMU, 2006-2010
weights
Output units
No disease Pneumonia Flu Meningitis
Input units
Cough Headache
what we got
what we wanted-
error
∆ rule
change weights to
decrease the error
Perceptrons
© Eric Xing @ CMU, 2006-2010
Input units
Output units
Hidden
units
what we got
what we wanted-
error
∆ rule
∆ rule
Hidden Units and
Backpropagation
© Eric Xing @ CMU, 2006-2010
Backpropagation Algorithm
 Initialize all weights to small random numbers
Until convergence, Do
1. Input the training example to the network
and compute the network outputs
1. For each output unit k
2. For each hidden unit h
3. Undate each network weight wi,j
where
xd = input
td = target output
od =observed unit
output
wi =weight i
© Eric Xing @ CMU, 2006-2010
More on Backpropatation
 It is doing gradient descent over entire network weight vector
 Easily generalized to arbitrary directed graphs
 Will find a local, not necessarily global error minimum
 In practice, often works well (can run multiple times)
 Often include weight momentum α
 Minimizes error over training examples
 Will it generalize well to subsequent testing examples?
 Training can take thousands of iterations,  very slow!
 Using network after training is very fast
© Eric Xing @ CMU, 2006-2010
Learning Hidden Layer
Representation
 A network:
 A target function:
 Can this be learned?
© Eric Xing @ CMU, 2006-2010
Learning Hidden Layer
Representation
 A network:
 Learned hidden layer representation:
© Eric Xing @ CMU, 2006-2010
Training
© Eric Xing @ CMU, 2006-2010
Training
© Eric Xing @ CMU, 2006-2010
Expressive Capabilities of ANNs
 Boolean functions:
 Every Boolean function can be represented by network with single hidden layer
 But might require exponential (in number of inputs) hidden units
 Continuous functions:
 Every bounded continuous function can be approximated with arbitrary small
error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989]
 Any function can be approximated to arbitrary accuracy by a network with two
hidden layers [Cybenko 1988].
© Eric Xing @ CMU, 2006-2010
Application: ANN for Face Reco.
 The model  The learned hidden unit
weights
© Eric Xing @ CMU, 2006-2010
X1 X2 X3
“X1” “X1X3” “X1X2X3”
Y
“X2”
X1 X2 X3 X1X2 X1X3 X2X3
Y
(23-1) possible combinations
X1X2X3
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
Regression vs. Neural Networks
© Eric Xing @ CMU, 2006-2010
winitial
wtrained
initial error
final error
Error surface
positive change
negative derivative
local minimum
Minimizing the Error
© Eric Xing @ CMU, 2006-2010
Overfitting in Neural Nets
CHD
age0
Overfitted model “Real” model
cycles
error
Overfitted model
holdout
training
© Eric Xing @ CMU, 2006-2010
Alternative Error Functions
 Penalize large weights:
 Training on target slopes as well as values
 Tie together weights
 E.g., in phoneme recognition
© Eric Xing @ CMU, 2006-2010
Artificial neural networks – what
you should know
 Highly expressive non-linear functions
 Highly parallel network of logistic function units
 Minimizing sum of squared training errors
 Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output
values
 Minimizing sum of sq errors plus weight squared (regularization)
 MAP estimates assuming weight priors are zero mean Gaussian
 Gradient descent as training procedure
 How to derive your own gradient descent procedure
 Discover useful representations at hidden units
 Local minima is greatest problem
 Overfitting, regularization, early stopping
Neural Network Approaches for
Visual Recognition
L. Fei-Fei
Computer Science Dept.
Stanford University
Machine learning in computer vision
• Aug 12, Lecture 3: Neural Network
– Convolutional Nets for object recognition
– Unsupervised feature learning via Deep Belief Net
7 August 2010 40
L. Fei-Fei, Dragon Star 2010,
Stanford
Machine learning in computer vision
• Aug 12, Lecture 3: Neural Network
– Convolutional Nets for object recognition
(slides courtesy to Yann LeCun (NYU))
– Unsupervised feature learning via Deep Belief Net
7 August 2010 41
L. Fei-Fei, Dragon Star 2010,
Stanford
[[Reference paper: Yann LeCun et al. Proc. IEEE, 1998]]
Mammalian visual pathway is hierarchical
• The ventral (recognition) pathway in the visual cortex
– Retina → LGN → V1 → V2 → V4 → PIT → AIT (80-100ms)
[picturefromSimonThorpe]
7 August 2010 42
L. Fei-Fei, Dragon Star 2010,
Stanford
• Efficient Learning Algorithms for multi-stage “Deep Architectures”
– Multi-stage learning is difficult
• Learning good internal representations of the world (features)
– Hierarchy of features that capture the relevant information and
eliminates irrelevant variabilities
• Learning with unlabeled samples (unsupervised learning)
– Animals can learn vision by just looking at the world
– No supervision is required
• Deep learning algorithms that can be applied to other modalities
– If we can learn feature hierarchies for vision, we can learn feature
hierarchies for audition, natural language, ....
Some challenges for machine learning in vision
• The raw input is pre-processed through a hand-crafted feature
extractor
• The features are not learned
• The trainable classifier is often generic (task independent), and
“simple” (linear classifier, kernel machine, nearest neighbor,.....)
• The most common Machine Learning architecture: the Kernel
Machine
“Simple” Trainable
Classifier
Pre-processing /
Feature Extraction
this part is mostly hand-crafted
Internal Representation
7 August 2010 44
L. Fei-Fei, Dragon Star 2010,
Stanford
The traditional “shallow” architecture
for recognition
• In Language: hierarchy in syntax and semantics
–Words->Parts of Speech->Sentences->Text
–Objects,Actions,Attributes...-> Phrases -> Statements ->
Stories
• In Vision: part-whole hierarchy
–Pixels->Edges->Textons->Parts->Objects->Scenes
Trainable
Feature
Extractor
Trainable
Feature
Extractor
Trainable
Classifier
Good representations are hierarchical
7 August 2010 45
L. Fei-Fei, Dragon Star 2010,
Stanford
• Deep Learning: learning a hierarchy of internal representations
• From low-level features to mid-level invariant representations, to
object identities
• Representations are increasingly invariant as we go up the layers
• using multiple stages gets around the specificity/invariance
dilemma
Trainable
Feature
Extractor
Trainable
Feature
Extractor
Trainable
Classifier
Learned Internal Representation
“Deep” learning: learning hierarchical
representations
7 August 2010 46
L. Fei-Fei, Dragon Star 2010,
Stanford
• We can approximate any function as close as we
want with shallow architecture. Why would we
need deep ones?
–kernel machines and 2-layer neural net are “universal”.
• Deep learning machines
• Deep machines are more efficient for representing
certain classes of functions, particularly those
involved in visual recognition
–they can represent more complex functions with less
“hardware”
• We need an efficient parameterization of the class
of functions that are useful for “AI” tasks.
7 August 2010 47
L. Fei-Fei, Dragon Star 2010,
Stanford
Do we really need deep architecture?
Filter
Bank
Spatial
Pooling
Non-
Linearity
• Multiple Stages
• Each Stage is composed of
– A bank of local filters (convolutions)
– A non-linear layer (may include harsh non-linearities, such as
rectification, contrast normalization, etc...).
– A feature pooling layer
• Multiple stages can be stacked to produce high-level representations
– Each stage makes the representation more global, and more
invariant
• The systems can be trained with a combination of unsupervised and
supervised methods
ClassifierFilter
Bank
Spatial
Pooling
Non-
Linearity
Hierarchical/deep architectures for vision
7 August 2010 48
L. Fei-Fei, Dragon Star 2010,
Stanford
• [Hubel & Wiesel 1962]:
– simple cells detect local features
– complex cells “pool” the outputs of simple cells within a retinotopic
neighborhood.
pooling subsampling
“Simple cells”
“Complex cells”
Multiple
convolutions
Retinotopic Feature Maps
Filtering+NonLinearity+Pooling = 1 stage of a Convolutional Net
7 August 2010 49
L. Fei-Fei, Dragon Star 2010,
Stanford
• Biologically-inspired models of low-level feature extraction
– Inspired by [Hubel and Wiesel 1962]
• Only 3 types of operations needed for traditional ConvNets:
– Convolution/Filtering
– Pointwise Non-Linearity
– Pooling/Subsampling
Filter
Bank
Spatial
Pooling
Non-
Linearity
Feature extraction by filtering and pooling
7 August 2010 50
L. Fei-Fei, Dragon Star 2010,
Stanford
Convolutional Network: Multi-Stage Trainable Architecture
Hierarchical Architecture
Representations are more global, more invariant, and more
abstract as we go up the layers
Alternated Layers of Filtering and Spatial Pooling
Filtering detects conjunctions of features
Pooling computes local disjunctions of features
Fully Trainable
All the layers are trainable
Convolutions,
Filtering
Pooling
Subsampling
Convolutions,
Filtering Pooling
Subsampling
Convolutions,
Filtering
Convolutions,
Classification
7 August 2010 51
L. Fei-Fei, Dragon Star 2010,
Stanford
7 August 2010 52
L. Fei-Fei, Dragon Star 2010,
Stanford
Convolutional network architecture
• Building a complete artificial vision system:
– Stack multiple stages of simple cells / complex cells layers
– Higher stages compute more global, more invariant features
– Stick a classification layer on top
– [Fukushima 1971-1982]
• neocognitron
– [LeCun 1988-2007]
• convolutional net
– [Poggio 2002-2006]
• HMAX
– [Ullman 2002-2006]
• fragment hierarchy
– [Lowe 2006]
• HMAX
QUESTION: How do we find
(or learn) the filters?
Convolutional Nets and other Multistage Hubel-Wiesel Architectures
7 August 2010 53
L. Fei-Fei, Dragon Star 2010,
Stanford
Convolutional Net Training: Gradient Back-Propagation
7 August 2010 54
L. Fei-Fei, Dragon Star 2010,
Stanford
7 August 2010 55
L. Fei-Fei, Dragon Star 2010,
Stanford
7 August 2010 56
L. Fei-Fei, Dragon Star 2010,
Stanford
7 August 2010 57
L. Fei-Fei, Dragon Star 2010,
Stanford
7 August 2010 59
L. Fei-Fei, Dragon Star 2010,
Stanford
Convolutional Net Architecture for Hand-writing recognition
input
1@32x32
Layer 1
6@28x28
Layer 2
6@14x14
Layer 3
12@10x10
Layer 4
12@5x5
Layer 5
100@1x1
10
5x5
convolution
5x5
convolution
5x5
convolution2x2
pooling/
subsampling
2x2
pooling/
subsampling
Layer 6: 10
Convolutional net for handwriting recognition (400,000 synapses)
Convolutional layers (simple cells): all units in a feature plane share the same weights
Pooling/subsampling layers (complex cells): for invariance to small distortions.
Supervised gradient-descent learning using back-propagation
The entire network is trained end-to-end. All the layers are trained simultaneously.
[LeCun et al. Proc IEEE, 1998]
7 August 2010 60
L. Fei-Fei, Dragon Star 2010,
Stanford
MNIST Handwritten Digit Dataset
Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples
7 August 2010 61
L. Fei-Fei, Dragon Star 2010,
Stanford
CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Reference
linear classifier (1-layer NN) none 12.00 LeCun et al. 1998
linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998
pairwise linear classifier deskewing 7.60 LeCun et al. 1998
K-nearest-neighbors, (L2) none 3.09 Kenneth Wilder, U. Chicago
K-nearest-neighbors, (L2) deskewing 2.40 LeCun et al. 1998
K-nearest-neighbors, (L2) deskew, clean, blur 1.80 Kenneth Wilder, U. Chicago
K-NN L3, 2 pixel jitter deskew, clean, blur 1.22 Kenneth Wilder, U. Chicago
K-NN, shape context m atching shape context feature 0.63 Belongie et al. IEEE PAMI 2002
40 PCA + quadratic classifier none 3.30 LeCun et al. 1998
1000 RBF + linear classifier none 3.60 LeCun et al. 1998
K-NN, Tangent Distance subsam p 16x16 pixels 1.10 LeCun et al. 1998
SVM, Gaussian Kernel none 1.40
SVM deg 4 polynom ial deskewing 1.10 LeCun et al. 1998
Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998
Virtual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998
V-SVM, 2-pixel jittered none 0.68 DeCoste and Scholkopf, MLJ2002
V-SVM, 2-pixel jittered deskewing 0.56 DeCoste and Scholkopf, MLJ2002
2-layer NN, 300 HU, MSE none 4.70 LeCun et al. 1998
2-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 1998
2-layer NN, 300 HU deskewing 1.60 LeCun et al. 1998
3-layer NN, 500+ 150 HU none 2.95 LeCun et al. 1998
3-layer NN, 500+ 150 HU Affine none 2.45 LeCun et al. 1998
3-layer NN, 500+ 300 HU, CE, reg none 1.53 Hinton, unpublished, 2005
2-layer NN, 800 HU, CE none 1.60 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, CE Affine none 1.10 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, MSE Elastic none 0.90 Sim ard et al., ICDAR 2003
2-layer NN, 800 HU, CE Elastic none 0.70 Sim ard et al., ICDAR 2003
Convolutional net LeNet-1 subsam p 16x16 pixels 1.70 LeCun et al. 1998
Convolutional net LeNet-4 none 1.10 LeCun et al. 1998
Convolutional net LeNet-5, none 0.95 LeCun et al. 1998
Conv. net LeNet-5, Affine none 0.80 LeCun et al. 1998
Boosted LeNet-4 Affine none 0.70 LeCun et al. 1998
Conv. net, CE Affine none 0.60 Sim ard et al., ICDAR 2003
Com v net, CE Elastic none 0.40 Sim ard et al., ICDAR 2003
7 August 2010 62
L. Fei-Fei, Dragon Star 2010,
Stanford
Results on MNIST handwritten Digits
Invariance and Robustness to Noise
7 August 2010 63
L. Fei-Fei, Dragon Star 2010,
Stanford
Face Detection and Pose Estimation with Convolutional Nets
• Training: 52,850, 32x32 grey-level images of faces, 52,850 non-faces.
• Each sample: used 5 times with random variation in scale, in-plane rotation, brightness and
contrast.
• 2nd phase: half of the initial negative set was replaced by false positives of the initial version
of the detector .
7 August 2010 64
L. Fei-Fei, Dragon Star 2010,
Stanford
Face Detection: Results
x93%86%Schneiderman & Kanade
x96%89%Rowley et al
x83%70%xJones & Viola (profile)
xx95%90%Jones & Viola (tilted)
88%83%83%67%97%90%Our Detector
1.280.53.360.4726.94.42
MIT+CMUPROFILETILTEDData Set->
False positives per image->
7 August 2010 65
L. Fei-Fei, Dragon Star 2010,
Stanford
Face Detection and Pose Estimation: Results
7 August 2010 66
L. Fei-Fei, Dragon Star 2010,
Stanford
Face Detection with a Convolutional Net
7 August 2010 67
L. Fei-Fei, Dragon Star 2010,
Stanford
Yann LeCun
Industrial Applications of ConvNets
• AT&T/Lucent/NCR
– Check reading, OCR, handwriting recognition (deployed 1996)
• Vidient Inc
– Vidient Inc's “SmartCatch” system deployed in several airports
and facilities around the US for detecting intrusions, tailgating,
and abandoned objects (Vidient is a spin-off of NEC)
• NEC Labs
– Cancer cell detection, automotive applications, kiosks
• Google
– OCR, face and license plate removal from StreetView
• Microsoft
– OCR, handwriting recognition, speech detection
• France Telecom
– Face detection, HCI, cell phone-based applications
• Other projects: HRL (3D vision)....
Machine learning in computer vision
• Aug 12, Lecture 3: Neural Network
– Convolutional Nets for object recognition
– Unsupervised feature learning via Deep Belief Net
(slides courtesy to Honglak Lee (Stanford))
7 August 2010 69
L. Fei-Fei, Dragon Star 2010,
Stanford
Machine Learning’s Success
• Data mining
– Web data mining
– Biomedical data mining
– Time series data mining
• Artificial Intelligence
– Computer vision
– Speech recognition
– Autonomous car driving
However, machine learning’s success has relied on having
a good feature representation of the data.
How can we develop good representations automatically?7 August 2010 70
L. Fei-Fei, Dragon Star 2010,
Stanford
The Learning Pipeline
Input
Input space
Motorbikes
“Non”-Motorbikes
Learning
Algorithm
pixel 1
pixel2
pixel 1
pixel 2
7 August 2010 71
L. Fei-Fei, Dragon Star 2010,
Stanford
The Learning Pipeline
Input
Input space Feature space
Motorbikes
“Non”-Motorbikes
Low-level
features
Learning
Algorithm
pixel 1
pixel2
“wheel”
“handle”
“feature engineering”
handle
wheel
7 August 2010 72
L. Fei-Fei, Dragon Star 2010,
Stanford
Computer vision features
SIFT Spin image
HoG RIFT
Textons GLOH
Drawbacks of feature engineering
1. Needs expert knowledge
2. Time consuming hand-tuning
7 August 2010 73
L. Fei-Fei, Dragon Star 2010,
Stanford
Feature learning from unlabeled data
• Main idea
– Finding underlying structure (cause) or statistical
correlation from the input data.
• Sparse coding [Olshausen and Field, 1997]
– Objective: Given input data {x}, search for a set of
bases {bj} such that
where aj are mostly zeros.
∑=
j
jjbax

7 August 2010 74
L. Fei-Fei, Dragon Star 2010,
Stanford
Sparse coding on images
Natural Images Learned bases: “Edges”
= 0.8 * + 0.3 * + 0.5 *
x = 0.8 * b36
+ 0.3 * b42 + 0.5 * b65
[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representation)
New example
Lee, Ng, et al. 20077 August 2010 75
L. Fei-Fei, Dragon Star 2010,
Stanford
Optimization
Given input data {x(1), …, x(m)}, we want to find good
bases {b1, …, bn}:
∑∑ ∑ +−
i
i
i j
j
i
j
i
ab abax 1
)(2
2
)()(
, ||||||||min β
Reconstruction error Sparsity penalty
1||||: ≤∀ jbj Normalization
constraint
Solve by alternating minimization:
-- Keep b fixed, find optimal a.
-- Keep a fixed, find optimal b. Lee, Ng, 20067 August 2010 76
L. Fei-Fei, Dragon Star 2010,
Stanford
Evaluated on Caltech101 object category dataset.
Image classification
Previous reported results:
Fei-Fei et al, 2004: 16%
Berg et al., 2005: 17%
Holub et al., 2005: 40%
Serre et al., 2005: 35%
Berg et al, 2005: 48%
Lazebnik et al, 2006: 64%
Varma et al. 2008: 78%
Classification
Algorithm
(SVM)
Algorithm Accuracy
Baseline (Fei-Fei et al., 2004) 16%
PCA 37%
Our method 47%
36% error reduction
Input Image Features (coefficients)
Learned
bases
7 August 2010 77
L. Fei-Fei, Dragon Star 2010,
Stanford
Learning Feature Hierarchy
Input image (pixels)
“Sparse coding”
(edges)
[Related work: Hinton, Bengio, LeCun, and others.]
DBN (Hinton et al., 2006) with additional sparseness constraint.
Higher layer
(Combinations
of edges)
7 August 2010 78
L. Fei-Fei, Dragon Star 2010,
Stanford
Restricted Boltzmann Machine (RBM)
– Undirected, bipartite graphical model
– Inference is easy
– Training by approximating maximum likelihood
(Contrastive Divergence [Hinton, 2002])
visible nodes (data)
hidden nodes
Weights W (bases) encodes statistical
relationship between h and v.
7 August 2010 79
L. Fei-Fei, Dragon Star 2010,
Stanford
Deep Belief Network
• Deep Belief Network (DBN) [Hinton et al., 2006]
– Generative model with multiple hidden layers
– Successful applications
• Recognizing handwritten digits
• Learning motion capture data
• Collaborative filtering
– Bottom-up, layer-wise training
using Restricted Boltzmann machines
2W
3W
1W
visible nodes (data)
V
H1
H2
H3
7 August 2010 80
L. Fei-Fei, Dragon Star 2010,
Stanford
Sparse Restricted Boltzmann Machines [NIPS-2008]
• Main idea
– Constrain the hidden layer nodes to have “sparse”
average activation (similar to sparse coding)
– Regularize with a sparsity penalty
Log-likelihood Sparsity penalty
Average activation Target sparsity7 August 2010 81
L. Fei-Fei, Dragon Star 2010,
Stanford
Sparse representation from digits [NIPS 2008]
Sparse RBM bases
“pen-strokes”
Sparse RBMs often give readily interpretable features
and good discriminative power.
Training examples
7 August 2010 82
L. Fei-Fei, Dragon Star 2010,
Stanford
Learning object representations
• Learning objects and parts in images
• Large image patches contain interesting higher-
level structures.
– E.g., object parts and full objects
7 August 2010 83
L. Fei-Fei, Dragon Star 2010,
Stanford
Applying DBN to large image
2W
3W
1W
Input image
Faces
Problem:
- Typically, input dimension ~ 1,000 (30x30 pixels)
- Computationally intractable to learn from realistic
image sizes (e.g. 200x200 pixels)
??
7 August 2010 84
L. Fei-Fei, Dragon Star 2010,
Stanford
Convolutional architectures
 Weight sharing by convolution (e.g.,
[Lecun et al., 1989])
 “Max-pooling”
Invariance
Computational efficiency
Deterministic and feed-forward
 We develop convolutional Restricted
Boltzmann machine (CRBM).
 We define probabilistic max-pooling
that combine bottom-up and top-down
information.
convolution filter
Detection layer
maximum 2x2 grid
Max-pooling layer
Detection layer
Max-pooling layer
Input
convolution
convolution
maximum 2x2 grid
max
conv
conv
max
7 August 2010 85
L. Fei-Fei, Dragon Star 2010,
Stanford
Wk
V (visible layer)
Detection layer H
Max-pooling layer P
Convolutional RBM (CRBM) [ICML 2009]
Hidden nodes (binary)
“Filter“ weights (shared)
For “filter” k,
At most one
hidden nodes
are active.
‘’max-pooling’’ node (binary)
Input data V
7 August 2010 86
L. Fei-Fei, Dragon Star 2010,
Stanford
Probabilistic max pooling
Xj are stochastic binary
and mutually exclusive.
X3X1 X2 X4
Collapse 2n configurations into n+1
configurations. Permits bottom up
and top down inference.
Y
Pooling node
Detection nodes
1
1 0 0 0
1
0 1 0 0
1
0 0 1 0
1
0 0 0 1
0
0 0 0 0
7 August 2010 87
L. Fei-Fei, Dragon Star 2010,
Stanford
Probabilistic max pooling
X3X1 X2 X4
Y
Bottom-up inference
I1 I2 I3 I4
Pooling node
Detection nodes
Probability can be written as a
softmax function. Sample from
multinomial distribution.
Output of convolution
W*V from below7 August 2010 88
L. Fei-Fei, Dragon Star 2010,
Stanford
Convolutional Deep Belief Networks
• Bottom-up (greedy), layer-wise training
– Train one layer (convolutional RBM) at a time.
• Inference (approximate)
– Undirected connections for all layers (Markov net)
[Related work: Salakhutdinov and Hinton, 2009]
– Block Gibbs sampling or mean-field
– Hierarchical probabilistic inference
7 August 2010 89
L. Fei-Fei, Dragon Star 2010,
Stanford
Unsupervised learning of object-parts
Faces Cars Elephants Chairs
7 August 2010 90
L. Fei-Fei, Dragon Star 2010,
Stanford
Object category classification (Caltech 101)
• Our model is comparable to the results using
state-of-the-art features (e.g., SIFT).
40
45
50
55
60
65
70
75
CDBN (first layer)
CDBN (first+second layer)
Ranzato et al. (2007)
Mutch and Lowe (2006)
Lazebnik et al. (2006)
Zhang et al. (2006)
Our method
TestAccuracy
7 August 2010 91
L. Fei-Fei, Dragon Star 2010,
Stanford
Unsupervised learning of object-parts
Trained from multiple classes
(cars, faces, motorbikes, airplanes)
Object-specific features
& shared features
“Grouping” the object parts
(highly specific)
7 August 2010 92
L. Fei-Fei, Dragon Star 2010,
Stanford
Review of main contributions
 Developed efficient algorithms for unsupervised feature learning.
 Showed that unsupervised feature learning is useful for many
machine learning tasks.
 Object recognition, image segmentation, audio classification, text
classification, robotic perception, and others.
Object parts “Filling in”
7 August 2010 93
L. Fei-Fei, Dragon Star 2010,
Stanford
Weaknesses & Criticisms
• Learning everything. Better to encode prior
knowledge about structure of images.
A: Compare with machine learning vs. linguists
debate in NLP.
• Results not yet competitive with best
engineered systems.
A: Agreed. True for some domains.
7 August 2010 94
L. Fei-Fei, Dragon Star 2010,
Stanford

Weitere ähnliche Inhalte

Was ist angesagt?

Efficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationEfficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationYogendra Tamang
 
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.GeeksLab Odessa
 
ujava.org Deep Learning with Convolutional Neural Network
ujava.org Deep Learning with Convolutional Neural Network ujava.org Deep Learning with Convolutional Neural Network
ujava.org Deep Learning with Convolutional Neural Network 신동 강
 
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...Seonho Park
 
Deep Learning behind Prisma
Deep Learning behind PrismaDeep Learning behind Prisma
Deep Learning behind Prismalostleaves
 
HardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image DescriptionHardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image DescriptionDmytro Mishkin
 
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016Universitat Politècnica de Catalunya
 
Neuroevolution and deep learing
Neuroevolution and deep learing Neuroevolution and deep learing
Neuroevolution and deep learing Accenture
 
Neural Network as a function
Neural Network as a functionNeural Network as a function
Neural Network as a functionTaisuke Oe
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksTianxiang Xiong
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Paper overview: "Deep Residual Learning for Image Recognition"
Paper overview: "Deep Residual Learning for Image Recognition"Paper overview: "Deep Residual Learning for Image Recognition"
Paper overview: "Deep Residual Learning for Image Recognition"Ilya Kuzovkin
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningCharles Deledalle
 
Review on cs231 part-2
Review on cs231 part-2Review on cs231 part-2
Review on cs231 part-2Jeong Choi
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural NetworkJunho Cho
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep LearningDavid Khosid
 
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Vincenzo Lomonaco
 

Was ist angesagt? (20)

Efficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationEfficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image Classfication
 
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
 
ujava.org Deep Learning with Convolutional Neural Network
ujava.org Deep Learning with Convolutional Neural Network ujava.org Deep Learning with Convolutional Neural Network
ujava.org Deep Learning with Convolutional Neural Network
 
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
 
Deep Learning behind Prisma
Deep Learning behind PrismaDeep Learning behind Prisma
Deep Learning behind Prisma
 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
HardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image DescriptionHardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image Description
 
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
 
Neuroevolution and deep learing
Neuroevolution and deep learing Neuroevolution and deep learing
Neuroevolution and deep learing
 
Neural Network as a function
Neural Network as a functionNeural Network as a function
Neural Network as a function
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
Paper overview: "Deep Residual Learning for Image Recognition"
Paper overview: "Deep Residual Learning for Image Recognition"Paper overview: "Deep Residual Learning for Image Recognition"
Paper overview: "Deep Residual Learning for Image Recognition"
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learning
 
Review on cs231 part-2
Review on cs231 part-2Review on cs231 part-2
Review on cs231 part-2
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural Network
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
 
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 

Ähnlich wie Lecture3 xing fei-fei

MPerceptron
MPerceptronMPerceptron
MPerceptronbutest
 
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...AILABS Academy
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
8_Neural Networks in artificial intelligence.ppt
8_Neural Networks in artificial intelligence.ppt8_Neural Networks in artificial intelligence.ppt
8_Neural Networks in artificial intelligence.pptssuser7e63fd
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Convolution Neural Network Lecture Slides
Convolution Neural Network Lecture SlidesConvolution Neural Network Lecture Slides
Convolution Neural Network Lecture SlidesAdnanHaider234505
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 
Dance With AI – An interactive dance learning platform
Dance With AI – An interactive dance learning platformDance With AI – An interactive dance learning platform
Dance With AI – An interactive dance learning platformIRJET Journal
 
System Monitoring
System MonitoringSystem Monitoring
System Monitoringbutest
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-feiTianlu Wang
 
On Security and Sparsity of Linear Classifiers for Adversarial Settings
On Security and Sparsity of Linear Classifiers for Adversarial SettingsOn Security and Sparsity of Linear Classifiers for Adversarial Settings
On Security and Sparsity of Linear Classifiers for Adversarial SettingsPluribus One
 
An Information-Theoretic Approach for Clonal Selection Algorithms
An Information-Theoretic Approach for Clonal Selection AlgorithmsAn Information-Theoretic Approach for Clonal Selection Algorithms
An Information-Theoretic Approach for Clonal Selection AlgorithmsMario Pavone
 

Ähnlich wie Lecture3 xing fei-fei (20)

MPerceptron
MPerceptronMPerceptron
MPerceptron
 
Perceptron
PerceptronPerceptron
Perceptron
 
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Neural networks
Neural networksNeural networks
Neural networks
 
8_Neural Networks in artificial intelligence.ppt
8_Neural Networks in artificial intelligence.ppt8_Neural Networks in artificial intelligence.ppt
8_Neural Networks in artificial intelligence.ppt
 
Back propagation
Back propagationBack propagation
Back propagation
 
Convolution Neural Network Lecture Slides
Convolution Neural Network Lecture SlidesConvolution Neural Network Lecture Slides
Convolution Neural Network Lecture Slides
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 
Dance With AI – An interactive dance learning platform
Dance With AI – An interactive dance learning platformDance With AI – An interactive dance learning platform
Dance With AI – An interactive dance learning platform
 
System Monitoring
System MonitoringSystem Monitoring
System Monitoring
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-fei
 
Lecture15 xing
Lecture15 xingLecture15 xing
Lecture15 xing
 
ann-ics320Part4.ppt
ann-ics320Part4.pptann-ics320Part4.ppt
ann-ics320Part4.ppt
 
ann-ics320Part4.ppt
ann-ics320Part4.pptann-ics320Part4.ppt
ann-ics320Part4.ppt
 
On Security and Sparsity of Linear Classifiers for Adversarial Settings
On Security and Sparsity of Linear Classifiers for Adversarial SettingsOn Security and Sparsity of Linear Classifiers for Adversarial Settings
On Security and Sparsity of Linear Classifiers for Adversarial Settings
 
An Information-Theoretic Approach for Clonal Selection Algorithms
An Information-Theoretic Approach for Clonal Selection AlgorithmsAn Information-Theoretic Approach for Clonal Selection Algorithms
An Information-Theoretic Approach for Clonal Selection Algorithms
 

Mehr von Tianlu Wang

14 pro resolution
14 pro resolution14 pro resolution
14 pro resolutionTianlu Wang
 
13 propositional calculus
13 propositional calculus13 propositional calculus
13 propositional calculusTianlu Wang
 
12 adversal search
12 adversal search12 adversal search
12 adversal searchTianlu Wang
 
11 alternative search
11 alternative search11 alternative search
11 alternative searchTianlu Wang
 
21 situation calculus
21 situation calculus21 situation calculus
21 situation calculusTianlu Wang
 
20 bayes learning
20 bayes learning20 bayes learning
20 bayes learningTianlu Wang
 
19 uncertain evidence
19 uncertain evidence19 uncertain evidence
19 uncertain evidenceTianlu Wang
 
18 common knowledge
18 common knowledge18 common knowledge
18 common knowledgeTianlu Wang
 
17 2 expert systems
17 2 expert systems17 2 expert systems
17 2 expert systemsTianlu Wang
 
17 1 knowledge-based system
17 1 knowledge-based system17 1 knowledge-based system
17 1 knowledge-based systemTianlu Wang
 
16 2 predicate resolution
16 2 predicate resolution16 2 predicate resolution
16 2 predicate resolutionTianlu Wang
 
16 1 predicate resolution
16 1 predicate resolution16 1 predicate resolution
16 1 predicate resolutionTianlu Wang
 
09 heuristic search
09 heuristic search09 heuristic search
09 heuristic searchTianlu Wang
 
08 uninformed search
08 uninformed search08 uninformed search
08 uninformed searchTianlu Wang
 

Mehr von Tianlu Wang (20)

L7 er2
L7 er2L7 er2
L7 er2
 
L8 design1
L8 design1L8 design1
L8 design1
 
L9 design2
L9 design2L9 design2
L9 design2
 
14 pro resolution
14 pro resolution14 pro resolution
14 pro resolution
 
13 propositional calculus
13 propositional calculus13 propositional calculus
13 propositional calculus
 
12 adversal search
12 adversal search12 adversal search
12 adversal search
 
11 alternative search
11 alternative search11 alternative search
11 alternative search
 
10 2 sum
10 2 sum10 2 sum
10 2 sum
 
22 planning
22 planning22 planning
22 planning
 
21 situation calculus
21 situation calculus21 situation calculus
21 situation calculus
 
20 bayes learning
20 bayes learning20 bayes learning
20 bayes learning
 
19 uncertain evidence
19 uncertain evidence19 uncertain evidence
19 uncertain evidence
 
18 common knowledge
18 common knowledge18 common knowledge
18 common knowledge
 
17 2 expert systems
17 2 expert systems17 2 expert systems
17 2 expert systems
 
17 1 knowledge-based system
17 1 knowledge-based system17 1 knowledge-based system
17 1 knowledge-based system
 
16 2 predicate resolution
16 2 predicate resolution16 2 predicate resolution
16 2 predicate resolution
 
16 1 predicate resolution
16 1 predicate resolution16 1 predicate resolution
16 1 predicate resolution
 
15 predicate
15 predicate15 predicate
15 predicate
 
09 heuristic search
09 heuristic search09 heuristic search
09 heuristic search
 
08 uninformed search
08 uninformed search08 uninformed search
08 uninformed search
 

Kürzlich hochgeladen

定制多伦多大学毕业证(UofT毕业证)成绩单(学位证)原版一比一
定制多伦多大学毕业证(UofT毕业证)成绩单(学位证)原版一比一定制多伦多大学毕业证(UofT毕业证)成绩单(学位证)原版一比一
定制多伦多大学毕业证(UofT毕业证)成绩单(学位证)原版一比一meq5nzfnk
 
Production documentary.ppt. x
Production documentary.ppt.               xProduction documentary.ppt.               x
Production documentary.ppt. x21005760
 
Alina 7042364481 Call Girls Service Pochanpur Colony - independent Pochanpur ...
Alina 7042364481 Call Girls Service Pochanpur Colony - independent Pochanpur ...Alina 7042364481 Call Girls Service Pochanpur Colony - independent Pochanpur ...
Alina 7042364481 Call Girls Service Pochanpur Colony - independent Pochanpur ...Hot Call Girls In Sector 58 (Noida)
 
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp NumberVip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Numberkumarajju5765
 
How To Fix Mercedes Benz Anti-Theft Protection Activation Issue
How To Fix Mercedes Benz Anti-Theft Protection Activation IssueHow To Fix Mercedes Benz Anti-Theft Protection Activation Issue
How To Fix Mercedes Benz Anti-Theft Protection Activation IssueTerry Sayther Automotive
 
John deere 425 445 455 Maitenance Manual
John deere 425 445 455 Maitenance ManualJohn deere 425 445 455 Maitenance Manual
John deere 425 445 455 Maitenance ManualExcavator
 
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfSales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfAggregage
 
Russian Call Girls Delhi Indirapuram {9711199171} Aarvi Gupta ✌️Independent ...
Russian  Call Girls Delhi Indirapuram {9711199171} Aarvi Gupta ✌️Independent ...Russian  Call Girls Delhi Indirapuram {9711199171} Aarvi Gupta ✌️Independent ...
Russian Call Girls Delhi Indirapuram {9711199171} Aarvi Gupta ✌️Independent ...shivangimorya083
 
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile GirlsVip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girlsshivangimorya083
 
Hauz Khas Call Girls ☎ 7042364481 independent Escorts Service in delhi
Hauz Khas Call Girls ☎ 7042364481 independent Escorts Service in delhiHauz Khas Call Girls ☎ 7042364481 independent Escorts Service in delhi
Hauz Khas Call Girls ☎ 7042364481 independent Escorts Service in delhiHot Call Girls In Sector 58 (Noida)
 
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
audience feedback draft 3.pptxxxxxxxxxxx
audience feedback draft 3.pptxxxxxxxxxxxaudience feedback draft 3.pptxxxxxxxxxxx
audience feedback draft 3.pptxxxxxxxxxxxMollyBrown86
 
Dubai Call Girls Size E6 (O525547819) Call Girls In Dubai
Dubai Call Girls  Size E6 (O525547819) Call Girls In DubaiDubai Call Girls  Size E6 (O525547819) Call Girls In Dubai
Dubai Call Girls Size E6 (O525547819) Call Girls In Dubaikojalkojal131
 
Chapter-1.3-Four-Basic-Computer-periods.pptx
Chapter-1.3-Four-Basic-Computer-periods.pptxChapter-1.3-Four-Basic-Computer-periods.pptx
Chapter-1.3-Four-Basic-Computer-periods.pptxAnjieVillarba1
 
Bandra Escorts, (*Pooja 09892124323), Bandra Call Girls Services
Bandra Escorts, (*Pooja 09892124323), Bandra Call Girls ServicesBandra Escorts, (*Pooja 09892124323), Bandra Call Girls Services
Bandra Escorts, (*Pooja 09892124323), Bandra Call Girls ServicesPooja Nehwal
 
Innovating Manufacturing with CNC Technology
Innovating Manufacturing with CNC TechnologyInnovating Manufacturing with CNC Technology
Innovating Manufacturing with CNC Technologyquickpartslimitlessm
 
Hot And Sexy 🥵 Call Girls Delhi Daryaganj {9711199171} Ira Malik High class G...
Hot And Sexy 🥵 Call Girls Delhi Daryaganj {9711199171} Ira Malik High class G...Hot And Sexy 🥵 Call Girls Delhi Daryaganj {9711199171} Ira Malik High class G...
Hot And Sexy 🥵 Call Girls Delhi Daryaganj {9711199171} Ira Malik High class G...shivangimorya083
 

Kürzlich hochgeladen (20)

定制多伦多大学毕业证(UofT毕业证)成绩单(学位证)原版一比一
定制多伦多大学毕业证(UofT毕业证)成绩单(学位证)原版一比一定制多伦多大学毕业证(UofT毕业证)成绩单(学位证)原版一比一
定制多伦多大学毕业证(UofT毕业证)成绩单(学位证)原版一比一
 
Production documentary.ppt. x
Production documentary.ppt.               xProduction documentary.ppt.               x
Production documentary.ppt. x
 
Alina 7042364481 Call Girls Service Pochanpur Colony - independent Pochanpur ...
Alina 7042364481 Call Girls Service Pochanpur Colony - independent Pochanpur ...Alina 7042364481 Call Girls Service Pochanpur Colony - independent Pochanpur ...
Alina 7042364481 Call Girls Service Pochanpur Colony - independent Pochanpur ...
 
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Kirti Nagar 7042364481 Escort Service 24x7 Delhi
Call Girls In Kirti Nagar 7042364481 Escort Service 24x7 DelhiCall Girls In Kirti Nagar 7042364481 Escort Service 24x7 Delhi
Call Girls In Kirti Nagar 7042364481 Escort Service 24x7 Delhi
 
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp NumberVip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
 
How To Fix Mercedes Benz Anti-Theft Protection Activation Issue
How To Fix Mercedes Benz Anti-Theft Protection Activation IssueHow To Fix Mercedes Benz Anti-Theft Protection Activation Issue
How To Fix Mercedes Benz Anti-Theft Protection Activation Issue
 
John deere 425 445 455 Maitenance Manual
John deere 425 445 455 Maitenance ManualJohn deere 425 445 455 Maitenance Manual
John deere 425 445 455 Maitenance Manual
 
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfSales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
 
Russian Call Girls Delhi Indirapuram {9711199171} Aarvi Gupta ✌️Independent ...
Russian  Call Girls Delhi Indirapuram {9711199171} Aarvi Gupta ✌️Independent ...Russian  Call Girls Delhi Indirapuram {9711199171} Aarvi Gupta ✌️Independent ...
Russian Call Girls Delhi Indirapuram {9711199171} Aarvi Gupta ✌️Independent ...
 
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile GirlsVip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
Vip Hot🥵 Call Girls Delhi Delhi {9711199012} Avni Thakur 🧡😘 High Profile Girls
 
Hauz Khas Call Girls ☎ 7042364481 independent Escorts Service in delhi
Hauz Khas Call Girls ☎ 7042364481 independent Escorts Service in delhiHauz Khas Call Girls ☎ 7042364481 independent Escorts Service in delhi
Hauz Khas Call Girls ☎ 7042364481 independent Escorts Service in delhi
 
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
audience feedback draft 3.pptxxxxxxxxxxx
audience feedback draft 3.pptxxxxxxxxxxxaudience feedback draft 3.pptxxxxxxxxxxx
audience feedback draft 3.pptxxxxxxxxxxx
 
Dubai Call Girls Size E6 (O525547819) Call Girls In Dubai
Dubai Call Girls  Size E6 (O525547819) Call Girls In DubaiDubai Call Girls  Size E6 (O525547819) Call Girls In Dubai
Dubai Call Girls Size E6 (O525547819) Call Girls In Dubai
 
Chapter-1.3-Four-Basic-Computer-periods.pptx
Chapter-1.3-Four-Basic-Computer-periods.pptxChapter-1.3-Four-Basic-Computer-periods.pptx
Chapter-1.3-Four-Basic-Computer-periods.pptx
 
Bandra Escorts, (*Pooja 09892124323), Bandra Call Girls Services
Bandra Escorts, (*Pooja 09892124323), Bandra Call Girls ServicesBandra Escorts, (*Pooja 09892124323), Bandra Call Girls Services
Bandra Escorts, (*Pooja 09892124323), Bandra Call Girls Services
 
Innovating Manufacturing with CNC Technology
Innovating Manufacturing with CNC TechnologyInnovating Manufacturing with CNC Technology
Innovating Manufacturing with CNC Technology
 
Hot And Sexy 🥵 Call Girls Delhi Daryaganj {9711199171} Ira Malik High class G...
Hot And Sexy 🥵 Call Girls Delhi Daryaganj {9711199171} Ira Malik High class G...Hot And Sexy 🥵 Call Girls Delhi Daryaganj {9711199171} Ira Malik High class G...
Hot And Sexy 🥵 Call Girls Delhi Daryaganj {9711199171} Ira Malik High class G...
 
Stay Cool and Compliant: Know Your Window Tint Laws Before You Tint
Stay Cool and Compliant: Know Your Window Tint Laws Before You TintStay Cool and Compliant: Know Your Window Tint Laws Before You Tint
Stay Cool and Compliant: Know Your Window Tint Laws Before You Tint
 

Lecture3 xing fei-fei

  • 1. © Eric Xing @ CMU, 2006-2010 Machine Learning Neural Networks Eric Xing Lecture 3, August 12, 2010 Reading: Chap. 5 CB
  • 2. © Eric Xing @ CMU, 2006-2010 Learning highly non-linear functions f: X  Y  f might be non-linear function  X (vector of) continuous and/or discrete vars  Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition
  • 3. © Eric Xing @ CMU, 2006-2010  From biological neuron to artificial neuron (perceptron)  Activation function  Artificial neuron networks  supervised learning  gradient descent Perceptron and Neural Nets Soma Soma Synapse Synapse Dendrites Axon Synapse Dendrites Axon Threshold Inputs x1 x2 Output Y∑ Hard Limiter w2 w1 Linear Combiner θ ∑ = = n i iiwxX 1    ω<− ω≥+ = 0 0 X X Y if, if, 1 1 Input Layer Output Layer Middle Layer InputSignals OutputSignals
  • 4. © Eric Xing @ CMU, 2006-2010 Connectionist Models  Consider humans:  Neuron switching time ~ 0.001 second  Number of neurons ~ 1010  Connections per neuron ~ 104-5  Scene recognition time ~ 0.1 second  100 inference steps doesn't seem like enough  much parallel computation  Properties of artificial neural nets (ANN)  Many neuron-like threshold switching units  Many weighted interconnections among units  Highly parallel, distributed processes Synapses Axon Dendrites Synapses + + + - - (weights) Nodes
  • 5. © Eric Xing @ CMU, 2006-2010 Male Age Temp WBC Pain Intensity Pain Duration 37 10 11 20 1 adjustable weights 0 1 0 0000 AppendicitisDiverticulitis Perforated Non-specific Cholecystitis Small Bowel PancreatitisObstructionPain Duodenal Ulcer Abdominal Pain Perceptron
  • 6. © Eric Xing @ CMU, 2006-2010 0.8 Myocardial Infarction “Probability” of MI 112 150 MaleAgeSmokerECG: STPain Intensity 4 Pain Duration Elevation Myocardial Infarction Network
  • 7. © Eric Xing @ CMU, 2006-2010 The "Driver" Network
  • 8. © Eric Xing @ CMU, 2006-2010 weights Output units No disease Pneumonia Flu Meningitis Input units Cough Headache what we got what we wanted- error ∆ rule change weights to decrease the error Perceptrons
  • 9. © Eric Xing @ CMU, 2006-2010 Input units Input to unit j: aj = Σwijai j i Input to unit i: ai measured value of variable i Output of unit j: oj = 1/ (1 + e- (aj+θ j) )Output units Perceptrons
  • 10. © Eric Xing @ CMU, 2006-2010 Jargon Pseudo-Correspondence  Independent variable = input variable  Dependent variable = output variable  Coefficients = “weights”  Estimates = “targets” Logistic Regression Model (the sigmoid unit) Inputs Output Age 34 1Gender Stage 4 “Probability of beingAlive” 5 8 4 0.6 Σ Coefficients a, b, c Independent variables x1, x2, x3 Dependent variable p Prediction
  • 11. © Eric Xing @ CMU, 2006-2010 The perceptron learning algorithm  Recall the nice property of sigmoid function  Consider regression problem f:XY , for scalar Y:  Let’s maximize the conditional data likelihood
  • 12. © Eric Xing @ CMU, 2006-2010 Gradient Descent xd = input td = target output od =observed unit output wi =weight i
  • 13. © Eric Xing @ CMU, 2006-2010 The perceptron learning rules xd = input td = target output od =observed unit output wi =weight i Batch mode: Do until converge: 1. compute gradient ∇ED[w] 2. Incremental mode: Do until converge:  For each training example d in D 1. compute gradient ∇Ed[w] 2. where
  • 14. © Eric Xing @ CMU, 2006-2010 MLE vs MAP  Maximum conditional likelihood estimate  Maximum a posteriori estimate
  • 15. © Eric Xing @ CMU, 2006-2010 What decision surface does a perceptron define? x y Z (color) 0 0 1 0 1 1 1 0 1 1 1 0 NAND some possible values for w1 and w2 w1 w2 0.20 0.20 0.25 0.40 0.35 0.40 0.30 0.20 f(x1w1 + x2w2) = y f(0w1 + 0w2) = 1 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0 y x1 x2 w1 w2 θ = 0.5 f(a) = 1, for a > θ 0, for a ≤ θ θ
  • 16. © Eric Xing @ CMU, 2006-2010 x y Z (color) 0 0 0 0 1 1 1 0 1 1 1 0 What decision surface does a perceptron define? NAND some possible values for w1 and w2 w1 w2 f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0 y x1 x2 w1 w2 θ = 0.5 f(a) = 1, for a > θ 0, for a ≤ θ θ
  • 17. © Eric Xing @ CMU, 2006-2010 x y Z (color) 0 0 0 0 1 1 1 0 1 1 1 0 What decision surface does a perceptron define? NAND f(a) = 1, for a > θ 0, for a ≤ θ θ w1 w4w3 w2 w5 w6 θ = 0.5 for all units a possible set of values for (w1, w2, w3, w4, w5 , w6): (0.6,-0.6,-0.7,0.8,1,1)
  • 18. © Eric Xing @ CMU, 2006-2010 CoughNo cough CoughNo cough No headache No headache Headache Headache No disease Meningitis Flu Pneumonia No treatment Treatment 00 10 01 11 000 100 010 101 111011 110 Non Linear Separation
  • 19. © Eric Xing @ CMU, 2006-2010 Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2Gender Stage 4 .6 .5 .8 .2 .1 .3 .7 .2 WeightsHidden Layer “Probability of beingAlive” 0.6 Σ Σ .4 .2 Σ Neural Network Model
  • 20. © Eric Xing @ CMU, 2006-2010 Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2Gender Stage 4 .6 .5 .8 .1 .7 WeightsHidden Layer “Probability of beingAlive” 0.6 Σ “Combined logistic models”
  • 21. © Eric Xing @ CMU, 2006-2010 Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2Gender Stage 4 .5 .8 .2 .3 .2 WeightsHidden Layer “Probability of beingAlive” 0.6 Σ
  • 22. © Eric Xing @ CMU, 2006-2010 Inputs Weights Output Independent variables Dependent variable Prediction Age 34 1Gender Stage 4 .6 .5 .8 .2 .1 .3 .7 .2 WeightsHidden Layer “Probability of beingAlive” 0.6 Σ
  • 23. © Eric Xing @ CMU, 2006-2010 WeightsIndependent variables Dependent variable Prediction Age 34 2Gender Stage 4 .6 .5 .8 .2 .1 .3 .7 .2 WeightsHidden Layer “Probability of beingAlive” 0.6 Σ Σ .4 .2 Σ Not really, no target for hidden units...
  • 24. © Eric Xing @ CMU, 2006-2010 weights Output units No disease Pneumonia Flu Meningitis Input units Cough Headache what we got what we wanted- error ∆ rule change weights to decrease the error Perceptrons
  • 25. © Eric Xing @ CMU, 2006-2010 Input units Output units Hidden units what we got what we wanted- error ∆ rule ∆ rule Hidden Units and Backpropagation
  • 26. © Eric Xing @ CMU, 2006-2010 Backpropagation Algorithm  Initialize all weights to small random numbers Until convergence, Do 1. Input the training example to the network and compute the network outputs 1. For each output unit k 2. For each hidden unit h 3. Undate each network weight wi,j where xd = input td = target output od =observed unit output wi =weight i
  • 27. © Eric Xing @ CMU, 2006-2010 More on Backpropatation  It is doing gradient descent over entire network weight vector  Easily generalized to arbitrary directed graphs  Will find a local, not necessarily global error minimum  In practice, often works well (can run multiple times)  Often include weight momentum α  Minimizes error over training examples  Will it generalize well to subsequent testing examples?  Training can take thousands of iterations,  very slow!  Using network after training is very fast
  • 28. © Eric Xing @ CMU, 2006-2010 Learning Hidden Layer Representation  A network:  A target function:  Can this be learned?
  • 29. © Eric Xing @ CMU, 2006-2010 Learning Hidden Layer Representation  A network:  Learned hidden layer representation:
  • 30. © Eric Xing @ CMU, 2006-2010 Training
  • 31. © Eric Xing @ CMU, 2006-2010 Training
  • 32. © Eric Xing @ CMU, 2006-2010 Expressive Capabilities of ANNs  Boolean functions:  Every Boolean function can be represented by network with single hidden layer  But might require exponential (in number of inputs) hidden units  Continuous functions:  Every bounded continuous function can be approximated with arbitrary small error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989]  Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
  • 33. © Eric Xing @ CMU, 2006-2010 Application: ANN for Face Reco.  The model  The learned hidden unit weights
  • 34. © Eric Xing @ CMU, 2006-2010 X1 X2 X3 “X1” “X1X3” “X1X2X3” Y “X2” X1 X2 X3 X1X2 X1X3 X2X3 Y (23-1) possible combinations X1X2X3 Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ... Regression vs. Neural Networks
  • 35. © Eric Xing @ CMU, 2006-2010 winitial wtrained initial error final error Error surface positive change negative derivative local minimum Minimizing the Error
  • 36. © Eric Xing @ CMU, 2006-2010 Overfitting in Neural Nets CHD age0 Overfitted model “Real” model cycles error Overfitted model holdout training
  • 37. © Eric Xing @ CMU, 2006-2010 Alternative Error Functions  Penalize large weights:  Training on target slopes as well as values  Tie together weights  E.g., in phoneme recognition
  • 38. © Eric Xing @ CMU, 2006-2010 Artificial neural networks – what you should know  Highly expressive non-linear functions  Highly parallel network of logistic function units  Minimizing sum of squared training errors  Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output values  Minimizing sum of sq errors plus weight squared (regularization)  MAP estimates assuming weight priors are zero mean Gaussian  Gradient descent as training procedure  How to derive your own gradient descent procedure  Discover useful representations at hidden units  Local minima is greatest problem  Overfitting, regularization, early stopping
  • 39. Neural Network Approaches for Visual Recognition L. Fei-Fei Computer Science Dept. Stanford University
  • 40. Machine learning in computer vision • Aug 12, Lecture 3: Neural Network – Convolutional Nets for object recognition – Unsupervised feature learning via Deep Belief Net 7 August 2010 40 L. Fei-Fei, Dragon Star 2010, Stanford
  • 41. Machine learning in computer vision • Aug 12, Lecture 3: Neural Network – Convolutional Nets for object recognition (slides courtesy to Yann LeCun (NYU)) – Unsupervised feature learning via Deep Belief Net 7 August 2010 41 L. Fei-Fei, Dragon Star 2010, Stanford [[Reference paper: Yann LeCun et al. Proc. IEEE, 1998]]
  • 42. Mammalian visual pathway is hierarchical • The ventral (recognition) pathway in the visual cortex – Retina → LGN → V1 → V2 → V4 → PIT → AIT (80-100ms) [picturefromSimonThorpe] 7 August 2010 42 L. Fei-Fei, Dragon Star 2010, Stanford
  • 43. • Efficient Learning Algorithms for multi-stage “Deep Architectures” – Multi-stage learning is difficult • Learning good internal representations of the world (features) – Hierarchy of features that capture the relevant information and eliminates irrelevant variabilities • Learning with unlabeled samples (unsupervised learning) – Animals can learn vision by just looking at the world – No supervision is required • Deep learning algorithms that can be applied to other modalities – If we can learn feature hierarchies for vision, we can learn feature hierarchies for audition, natural language, .... Some challenges for machine learning in vision
  • 44. • The raw input is pre-processed through a hand-crafted feature extractor • The features are not learned • The trainable classifier is often generic (task independent), and “simple” (linear classifier, kernel machine, nearest neighbor,.....) • The most common Machine Learning architecture: the Kernel Machine “Simple” Trainable Classifier Pre-processing / Feature Extraction this part is mostly hand-crafted Internal Representation 7 August 2010 44 L. Fei-Fei, Dragon Star 2010, Stanford The traditional “shallow” architecture for recognition
  • 45. • In Language: hierarchy in syntax and semantics –Words->Parts of Speech->Sentences->Text –Objects,Actions,Attributes...-> Phrases -> Statements -> Stories • In Vision: part-whole hierarchy –Pixels->Edges->Textons->Parts->Objects->Scenes Trainable Feature Extractor Trainable Feature Extractor Trainable Classifier Good representations are hierarchical 7 August 2010 45 L. Fei-Fei, Dragon Star 2010, Stanford
  • 46. • Deep Learning: learning a hierarchy of internal representations • From low-level features to mid-level invariant representations, to object identities • Representations are increasingly invariant as we go up the layers • using multiple stages gets around the specificity/invariance dilemma Trainable Feature Extractor Trainable Feature Extractor Trainable Classifier Learned Internal Representation “Deep” learning: learning hierarchical representations 7 August 2010 46 L. Fei-Fei, Dragon Star 2010, Stanford
  • 47. • We can approximate any function as close as we want with shallow architecture. Why would we need deep ones? –kernel machines and 2-layer neural net are “universal”. • Deep learning machines • Deep machines are more efficient for representing certain classes of functions, particularly those involved in visual recognition –they can represent more complex functions with less “hardware” • We need an efficient parameterization of the class of functions that are useful for “AI” tasks. 7 August 2010 47 L. Fei-Fei, Dragon Star 2010, Stanford Do we really need deep architecture?
  • 48. Filter Bank Spatial Pooling Non- Linearity • Multiple Stages • Each Stage is composed of – A bank of local filters (convolutions) – A non-linear layer (may include harsh non-linearities, such as rectification, contrast normalization, etc...). – A feature pooling layer • Multiple stages can be stacked to produce high-level representations – Each stage makes the representation more global, and more invariant • The systems can be trained with a combination of unsupervised and supervised methods ClassifierFilter Bank Spatial Pooling Non- Linearity Hierarchical/deep architectures for vision 7 August 2010 48 L. Fei-Fei, Dragon Star 2010, Stanford
  • 49. • [Hubel & Wiesel 1962]: – simple cells detect local features – complex cells “pool” the outputs of simple cells within a retinotopic neighborhood. pooling subsampling “Simple cells” “Complex cells” Multiple convolutions Retinotopic Feature Maps Filtering+NonLinearity+Pooling = 1 stage of a Convolutional Net 7 August 2010 49 L. Fei-Fei, Dragon Star 2010, Stanford
  • 50. • Biologically-inspired models of low-level feature extraction – Inspired by [Hubel and Wiesel 1962] • Only 3 types of operations needed for traditional ConvNets: – Convolution/Filtering – Pointwise Non-Linearity – Pooling/Subsampling Filter Bank Spatial Pooling Non- Linearity Feature extraction by filtering and pooling 7 August 2010 50 L. Fei-Fei, Dragon Star 2010, Stanford
  • 51. Convolutional Network: Multi-Stage Trainable Architecture Hierarchical Architecture Representations are more global, more invariant, and more abstract as we go up the layers Alternated Layers of Filtering and Spatial Pooling Filtering detects conjunctions of features Pooling computes local disjunctions of features Fully Trainable All the layers are trainable Convolutions, Filtering Pooling Subsampling Convolutions, Filtering Pooling Subsampling Convolutions, Filtering Convolutions, Classification 7 August 2010 51 L. Fei-Fei, Dragon Star 2010, Stanford
  • 52. 7 August 2010 52 L. Fei-Fei, Dragon Star 2010, Stanford Convolutional network architecture
  • 53. • Building a complete artificial vision system: – Stack multiple stages of simple cells / complex cells layers – Higher stages compute more global, more invariant features – Stick a classification layer on top – [Fukushima 1971-1982] • neocognitron – [LeCun 1988-2007] • convolutional net – [Poggio 2002-2006] • HMAX – [Ullman 2002-2006] • fragment hierarchy – [Lowe 2006] • HMAX QUESTION: How do we find (or learn) the filters? Convolutional Nets and other Multistage Hubel-Wiesel Architectures 7 August 2010 53 L. Fei-Fei, Dragon Star 2010, Stanford
  • 54. Convolutional Net Training: Gradient Back-Propagation 7 August 2010 54 L. Fei-Fei, Dragon Star 2010, Stanford
  • 55. 7 August 2010 55 L. Fei-Fei, Dragon Star 2010, Stanford
  • 56. 7 August 2010 56 L. Fei-Fei, Dragon Star 2010, Stanford
  • 57. 7 August 2010 57 L. Fei-Fei, Dragon Star 2010, Stanford
  • 58. 7 August 2010 59 L. Fei-Fei, Dragon Star 2010, Stanford
  • 59. Convolutional Net Architecture for Hand-writing recognition input 1@32x32 Layer 1 6@28x28 Layer 2 6@14x14 Layer 3 12@10x10 Layer 4 12@5x5 Layer 5 100@1x1 10 5x5 convolution 5x5 convolution 5x5 convolution2x2 pooling/ subsampling 2x2 pooling/ subsampling Layer 6: 10 Convolutional net for handwriting recognition (400,000 synapses) Convolutional layers (simple cells): all units in a feature plane share the same weights Pooling/subsampling layers (complex cells): for invariance to small distortions. Supervised gradient-descent learning using back-propagation The entire network is trained end-to-end. All the layers are trained simultaneously. [LeCun et al. Proc IEEE, 1998] 7 August 2010 60 L. Fei-Fei, Dragon Star 2010, Stanford
  • 60. MNIST Handwritten Digit Dataset Handwritten Digit Dataset MNIST: 60,000 training samples, 10,000 test samples 7 August 2010 61 L. Fei-Fei, Dragon Star 2010, Stanford
  • 61. CLASSIFIER DEFORMATION PREPROCESSING ERROR (%) Reference linear classifier (1-layer NN) none 12.00 LeCun et al. 1998 linear classifier (1-layer NN) deskewing 8.40 LeCun et al. 1998 pairwise linear classifier deskewing 7.60 LeCun et al. 1998 K-nearest-neighbors, (L2) none 3.09 Kenneth Wilder, U. Chicago K-nearest-neighbors, (L2) deskewing 2.40 LeCun et al. 1998 K-nearest-neighbors, (L2) deskew, clean, blur 1.80 Kenneth Wilder, U. Chicago K-NN L3, 2 pixel jitter deskew, clean, blur 1.22 Kenneth Wilder, U. Chicago K-NN, shape context m atching shape context feature 0.63 Belongie et al. IEEE PAMI 2002 40 PCA + quadratic classifier none 3.30 LeCun et al. 1998 1000 RBF + linear classifier none 3.60 LeCun et al. 1998 K-NN, Tangent Distance subsam p 16x16 pixels 1.10 LeCun et al. 1998 SVM, Gaussian Kernel none 1.40 SVM deg 4 polynom ial deskewing 1.10 LeCun et al. 1998 Reduced Set SVM deg 5 poly deskewing 1.00 LeCun et al. 1998 Virtual SVM deg-9 poly Affine none 0.80 LeCun et al. 1998 V-SVM, 2-pixel jittered none 0.68 DeCoste and Scholkopf, MLJ2002 V-SVM, 2-pixel jittered deskewing 0.56 DeCoste and Scholkopf, MLJ2002 2-layer NN, 300 HU, MSE none 4.70 LeCun et al. 1998 2-layer NN, 300 HU, MSE, Affine none 3.60 LeCun et al. 1998 2-layer NN, 300 HU deskewing 1.60 LeCun et al. 1998 3-layer NN, 500+ 150 HU none 2.95 LeCun et al. 1998 3-layer NN, 500+ 150 HU Affine none 2.45 LeCun et al. 1998 3-layer NN, 500+ 300 HU, CE, reg none 1.53 Hinton, unpublished, 2005 2-layer NN, 800 HU, CE none 1.60 Sim ard et al., ICDAR 2003 2-layer NN, 800 HU, CE Affine none 1.10 Sim ard et al., ICDAR 2003 2-layer NN, 800 HU, MSE Elastic none 0.90 Sim ard et al., ICDAR 2003 2-layer NN, 800 HU, CE Elastic none 0.70 Sim ard et al., ICDAR 2003 Convolutional net LeNet-1 subsam p 16x16 pixels 1.70 LeCun et al. 1998 Convolutional net LeNet-4 none 1.10 LeCun et al. 1998 Convolutional net LeNet-5, none 0.95 LeCun et al. 1998 Conv. net LeNet-5, Affine none 0.80 LeCun et al. 1998 Boosted LeNet-4 Affine none 0.70 LeCun et al. 1998 Conv. net, CE Affine none 0.60 Sim ard et al., ICDAR 2003 Com v net, CE Elastic none 0.40 Sim ard et al., ICDAR 2003 7 August 2010 62 L. Fei-Fei, Dragon Star 2010, Stanford Results on MNIST handwritten Digits
  • 62. Invariance and Robustness to Noise 7 August 2010 63 L. Fei-Fei, Dragon Star 2010, Stanford
  • 63. Face Detection and Pose Estimation with Convolutional Nets • Training: 52,850, 32x32 grey-level images of faces, 52,850 non-faces. • Each sample: used 5 times with random variation in scale, in-plane rotation, brightness and contrast. • 2nd phase: half of the initial negative set was replaced by false positives of the initial version of the detector . 7 August 2010 64 L. Fei-Fei, Dragon Star 2010, Stanford
  • 64. Face Detection: Results x93%86%Schneiderman & Kanade x96%89%Rowley et al x83%70%xJones & Viola (profile) xx95%90%Jones & Viola (tilted) 88%83%83%67%97%90%Our Detector 1.280.53.360.4726.94.42 MIT+CMUPROFILETILTEDData Set-> False positives per image-> 7 August 2010 65 L. Fei-Fei, Dragon Star 2010, Stanford
  • 65. Face Detection and Pose Estimation: Results 7 August 2010 66 L. Fei-Fei, Dragon Star 2010, Stanford
  • 66. Face Detection with a Convolutional Net 7 August 2010 67 L. Fei-Fei, Dragon Star 2010, Stanford
  • 67. Yann LeCun Industrial Applications of ConvNets • AT&T/Lucent/NCR – Check reading, OCR, handwriting recognition (deployed 1996) • Vidient Inc – Vidient Inc's “SmartCatch” system deployed in several airports and facilities around the US for detecting intrusions, tailgating, and abandoned objects (Vidient is a spin-off of NEC) • NEC Labs – Cancer cell detection, automotive applications, kiosks • Google – OCR, face and license plate removal from StreetView • Microsoft – OCR, handwriting recognition, speech detection • France Telecom – Face detection, HCI, cell phone-based applications • Other projects: HRL (3D vision)....
  • 68. Machine learning in computer vision • Aug 12, Lecture 3: Neural Network – Convolutional Nets for object recognition – Unsupervised feature learning via Deep Belief Net (slides courtesy to Honglak Lee (Stanford)) 7 August 2010 69 L. Fei-Fei, Dragon Star 2010, Stanford
  • 69. Machine Learning’s Success • Data mining – Web data mining – Biomedical data mining – Time series data mining • Artificial Intelligence – Computer vision – Speech recognition – Autonomous car driving However, machine learning’s success has relied on having a good feature representation of the data. How can we develop good representations automatically?7 August 2010 70 L. Fei-Fei, Dragon Star 2010, Stanford
  • 70. The Learning Pipeline Input Input space Motorbikes “Non”-Motorbikes Learning Algorithm pixel 1 pixel2 pixel 1 pixel 2 7 August 2010 71 L. Fei-Fei, Dragon Star 2010, Stanford
  • 71. The Learning Pipeline Input Input space Feature space Motorbikes “Non”-Motorbikes Low-level features Learning Algorithm pixel 1 pixel2 “wheel” “handle” “feature engineering” handle wheel 7 August 2010 72 L. Fei-Fei, Dragon Star 2010, Stanford
  • 72. Computer vision features SIFT Spin image HoG RIFT Textons GLOH Drawbacks of feature engineering 1. Needs expert knowledge 2. Time consuming hand-tuning 7 August 2010 73 L. Fei-Fei, Dragon Star 2010, Stanford
  • 73. Feature learning from unlabeled data • Main idea – Finding underlying structure (cause) or statistical correlation from the input data. • Sparse coding [Olshausen and Field, 1997] – Objective: Given input data {x}, search for a set of bases {bj} such that where aj are mostly zeros. ∑= j jjbax  7 August 2010 74 L. Fei-Fei, Dragon Star 2010, Stanford
  • 74. Sparse coding on images Natural Images Learned bases: “Edges” = 0.8 * + 0.3 * + 0.5 * x = 0.8 * b36 + 0.3 * b42 + 0.5 * b65 [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representation) New example Lee, Ng, et al. 20077 August 2010 75 L. Fei-Fei, Dragon Star 2010, Stanford
  • 75. Optimization Given input data {x(1), …, x(m)}, we want to find good bases {b1, …, bn}: ∑∑ ∑ +− i i i j j i j i ab abax 1 )(2 2 )()( , ||||||||min β Reconstruction error Sparsity penalty 1||||: ≤∀ jbj Normalization constraint Solve by alternating minimization: -- Keep b fixed, find optimal a. -- Keep a fixed, find optimal b. Lee, Ng, 20067 August 2010 76 L. Fei-Fei, Dragon Star 2010, Stanford
  • 76. Evaluated on Caltech101 object category dataset. Image classification Previous reported results: Fei-Fei et al, 2004: 16% Berg et al., 2005: 17% Holub et al., 2005: 40% Serre et al., 2005: 35% Berg et al, 2005: 48% Lazebnik et al, 2006: 64% Varma et al. 2008: 78% Classification Algorithm (SVM) Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Our method 47% 36% error reduction Input Image Features (coefficients) Learned bases 7 August 2010 77 L. Fei-Fei, Dragon Star 2010, Stanford
  • 77. Learning Feature Hierarchy Input image (pixels) “Sparse coding” (edges) [Related work: Hinton, Bengio, LeCun, and others.] DBN (Hinton et al., 2006) with additional sparseness constraint. Higher layer (Combinations of edges) 7 August 2010 78 L. Fei-Fei, Dragon Star 2010, Stanford
  • 78. Restricted Boltzmann Machine (RBM) – Undirected, bipartite graphical model – Inference is easy – Training by approximating maximum likelihood (Contrastive Divergence [Hinton, 2002]) visible nodes (data) hidden nodes Weights W (bases) encodes statistical relationship between h and v. 7 August 2010 79 L. Fei-Fei, Dragon Star 2010, Stanford
  • 79. Deep Belief Network • Deep Belief Network (DBN) [Hinton et al., 2006] – Generative model with multiple hidden layers – Successful applications • Recognizing handwritten digits • Learning motion capture data • Collaborative filtering – Bottom-up, layer-wise training using Restricted Boltzmann machines 2W 3W 1W visible nodes (data) V H1 H2 H3 7 August 2010 80 L. Fei-Fei, Dragon Star 2010, Stanford
  • 80. Sparse Restricted Boltzmann Machines [NIPS-2008] • Main idea – Constrain the hidden layer nodes to have “sparse” average activation (similar to sparse coding) – Regularize with a sparsity penalty Log-likelihood Sparsity penalty Average activation Target sparsity7 August 2010 81 L. Fei-Fei, Dragon Star 2010, Stanford
  • 81. Sparse representation from digits [NIPS 2008] Sparse RBM bases “pen-strokes” Sparse RBMs often give readily interpretable features and good discriminative power. Training examples 7 August 2010 82 L. Fei-Fei, Dragon Star 2010, Stanford
  • 82. Learning object representations • Learning objects and parts in images • Large image patches contain interesting higher- level structures. – E.g., object parts and full objects 7 August 2010 83 L. Fei-Fei, Dragon Star 2010, Stanford
  • 83. Applying DBN to large image 2W 3W 1W Input image Faces Problem: - Typically, input dimension ~ 1,000 (30x30 pixels) - Computationally intractable to learn from realistic image sizes (e.g. 200x200 pixels) ?? 7 August 2010 84 L. Fei-Fei, Dragon Star 2010, Stanford
  • 84. Convolutional architectures  Weight sharing by convolution (e.g., [Lecun et al., 1989])  “Max-pooling” Invariance Computational efficiency Deterministic and feed-forward  We develop convolutional Restricted Boltzmann machine (CRBM).  We define probabilistic max-pooling that combine bottom-up and top-down information. convolution filter Detection layer maximum 2x2 grid Max-pooling layer Detection layer Max-pooling layer Input convolution convolution maximum 2x2 grid max conv conv max 7 August 2010 85 L. Fei-Fei, Dragon Star 2010, Stanford
  • 85. Wk V (visible layer) Detection layer H Max-pooling layer P Convolutional RBM (CRBM) [ICML 2009] Hidden nodes (binary) “Filter“ weights (shared) For “filter” k, At most one hidden nodes are active. ‘’max-pooling’’ node (binary) Input data V 7 August 2010 86 L. Fei-Fei, Dragon Star 2010, Stanford
  • 86. Probabilistic max pooling Xj are stochastic binary and mutually exclusive. X3X1 X2 X4 Collapse 2n configurations into n+1 configurations. Permits bottom up and top down inference. Y Pooling node Detection nodes 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 7 August 2010 87 L. Fei-Fei, Dragon Star 2010, Stanford
  • 87. Probabilistic max pooling X3X1 X2 X4 Y Bottom-up inference I1 I2 I3 I4 Pooling node Detection nodes Probability can be written as a softmax function. Sample from multinomial distribution. Output of convolution W*V from below7 August 2010 88 L. Fei-Fei, Dragon Star 2010, Stanford
  • 88. Convolutional Deep Belief Networks • Bottom-up (greedy), layer-wise training – Train one layer (convolutional RBM) at a time. • Inference (approximate) – Undirected connections for all layers (Markov net) [Related work: Salakhutdinov and Hinton, 2009] – Block Gibbs sampling or mean-field – Hierarchical probabilistic inference 7 August 2010 89 L. Fei-Fei, Dragon Star 2010, Stanford
  • 89. Unsupervised learning of object-parts Faces Cars Elephants Chairs 7 August 2010 90 L. Fei-Fei, Dragon Star 2010, Stanford
  • 90. Object category classification (Caltech 101) • Our model is comparable to the results using state-of-the-art features (e.g., SIFT). 40 45 50 55 60 65 70 75 CDBN (first layer) CDBN (first+second layer) Ranzato et al. (2007) Mutch and Lowe (2006) Lazebnik et al. (2006) Zhang et al. (2006) Our method TestAccuracy 7 August 2010 91 L. Fei-Fei, Dragon Star 2010, Stanford
  • 91. Unsupervised learning of object-parts Trained from multiple classes (cars, faces, motorbikes, airplanes) Object-specific features & shared features “Grouping” the object parts (highly specific) 7 August 2010 92 L. Fei-Fei, Dragon Star 2010, Stanford
  • 92. Review of main contributions  Developed efficient algorithms for unsupervised feature learning.  Showed that unsupervised feature learning is useful for many machine learning tasks.  Object recognition, image segmentation, audio classification, text classification, robotic perception, and others. Object parts “Filling in” 7 August 2010 93 L. Fei-Fei, Dragon Star 2010, Stanford
  • 93. Weaknesses & Criticisms • Learning everything. Better to encode prior knowledge about structure of images. A: Compare with machine learning vs. linguists debate in NLP. • Results not yet competitive with best engineered systems. A: Agreed. True for some domains. 7 August 2010 94 L. Fei-Fei, Dragon Star 2010, Stanford