How AI, OpenAI, and ChatGPT impact business and software.
Fcv learn le_cun
1. 5 years from now,
5 years from now,
everyone will learn
everyone will learn
their features
their features
(you might as well start now)
(you might as well start now)
Yann LeCun
Yann LeCun
Courant Institute of Mathematical Sciences
Courant Institute of Mathematical Sciences
and
and
Center for Neural Science,
Center for Neural Science,
New York University
New York University
Yann LeCun
2. IIHave aaTerrible Confession to Make
Have Terrible Confession to Make
I'm interested in vision, but no more in vision than in audition or in
other perceptual modalities.
I'm interested in perception (and in control).
I'd like to find a learning algorithm and architecture that could work
(with minor changes) for many modalities
Nature seems to have found one.
Almost all natural perceptual signals have a local structure (in space
and time) similar to images and videos
Heavy correlation between neighboring variables
Local patches of variables have structure, and are representable
by feature vectors.
I like vision because it's challenging, it's useful, it's fun, we have data
the image recognition community is not yet stuck in a deep
local minimum like the speech recognition community.
Yann LeCun
3. The Unity of
The Unity of
Recognition
Recognition
Architectures
Architectures
Yann LeCun
4. Most Recognition Systems Are Built on the Same Architecture
Most Recognition Systems Are Built on the Same Architecture
Filter Non feature Norma
Classifier
Bank Linearity Pooling lization
Filter Non Filter Non
Pool Norm Pool Norm Classifier
Bank Lin Bank Lin
First stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders.....
Second stage: K-means, sparse coding, LCC....
Pooling: average, L2, max, max with bias (elastic templates).....
Convolutional Nets: same architecture, but everything is trained.
Yann LeCun
5. Filter Bank + Non-Linearity + Pooling + Normalization
Filter Bank + Non-Linearity + Pooling + Normalization
Filter Non Spatial
Bank Linearity Pooling
This model of a feature extraction stage is biologically-inspired
...whether you like it or not (just ask David Lowe)
Inspired by [Hubel and Wiesel 1962]
The use of this module goes back to Fukushima's Neocognitron
(and even earlier models in the 60's).
Yann LeCun
6. How well does this work?
How well does this work?
Filter Non feature Filter Non feature
Classifier
Bank Linearity Pooling Bank Linearity Pooling
Oriented Winner Histogram Pyramid SVM or
Kmeans
Edges Takes (sum) Histogram. Another
Or
All Elastic parts Simple
Sparse Coding
SIFT Models,... classifier
Some results on C101 (I know, I know....)
SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65%
[Lazebnik et al. CVPR 2006]
SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75%
[Boureau et al. CVPR 2010] [Yang et al. 2008]
SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77%
[Boureau et al. ICCV 2011]
(Small) supervised ConvNet with sparsity penalty: >71%
[rejected from CVPR,ICCV,etc] REAL TIME
Yann LeCun
8. Why do two stages work better than one stage?
Why do two stages work better than one stage?
Filter Non Filter Non
Pool Norm Pool Norm Classifier
Bank Lin Bank Lin
The second stage extracts mid-level features
Having multiple stages helps the selectivity-invariance dilemma
Yann LeCun
9. Learning Hierarchical Representations
Learning Hierarchical Representations
Trainable Trainable
Trainable
Feature Feature
Classifier
Transform Transform
Learned Internal Representation
I agree with David Lowe: we should learn the features
It worked for speech, handwriting, NLP.....
In a way, the vision community has been running a ridiculously
inefficient evolutionary learning algorithm to learn features:
Mutation: tweak existing features in many different ways
Selection: Publish the best ones at CVPR
Reproduction: combine several features from the last CVPR
Iterate. Problem: Moore's law works against you
Yann LeCun
10. Sometimes,
Sometimes,
Biology gives you
Biology gives you
good hints
good hints
example:
example:
contrast normalization
contrast normalization
Yann LeCun
11. Harsh Non-Linearity + Contrast Normalization + Sparsity
Harsh Non-Linearity + Contrast Normalization + Sparsity
C Convolutions (filter bank)
Soft Thresholding + Abs
N Subtractive and Divisive Local Normalization
P Pooling downsampling layer: average or max?
Pooling, subsampling
contrast normalization
subtractive+divisive
Thresholding
Convolutions
Rectification
THIS IS ONE STAGE OF THE CONVNET
Yann LeCun
13. Local Contrast Normalization
Local Contrast Normalization
Performed on the state of every layer, including
the input
Subtractive Local Contrast Normalization
Subtracts from every value in a feature a
Gaussian-weighted average of its
neighbors (high-pass filter)
Divisive Local Contrast Normalization
Divides every value in a layer by the
standard deviation of its neighbors over
space and over all feature maps
Subtractive + Divisive LCN performs a kind of
approximate whitening.
Yann LeCun
14. C101 Performance (I know, IIknow)
C101 Performance (I know, know)
Small network: 64 features at stage-1, 256 features at stage-2:
Tanh non-linearity, No Rectification, No normalization: 29%
Tanh non-linearity, Rectification, normalization: 65%
Shrink non-linearity, Rectification, norm, sparsity penalty 71%
Yann LeCun
15. Results on Caltech101 with sigmoid non-linearity
Results on Caltech101 with sigmoid non-linearity
← like HMAX model
Yann LeCun
16. Feature Learning
Feature Learning
Works Really Well
Works Really Well
on everything but C101
on everything but C101
Yann LeCun
17. C101 is very unfavorable to learning-based systems
C101 is very unfavorable to learning-based systems
Because it's so small. We are switching to ImageNet
Some results on NORB
No normalization
Random filters
Unsup filters
Sup filters
Unsup+Sup filters
Yann LeCun
18. Sparse Auto-Encoders
Sparse Auto-Encoders
Inference by gradient descent starting from the encoder output
i i 2 i 2
E Y , Z =∥Y −W d Z∥ ∥Z −g e W e ,Y ∥ ∑ j ∣z j∣
i i
Z =argmin z E Y , z ; W
i
∥Y −Y∥
2
WdZ ∑j .
INPUT Y Z ∣z j∣
FEATURES
i
ge W e ,Y 2
∥Z − Z∥
Yann LeCun
19. Using PSD to Train aaHierarchy of Features
Using PSD to Train Hierarchy of Features
Phase 1: train first layer using PSD
∥Y i −Y∥2
WdZ ∑j .
Y Z ∣z j∣
ge W e ,Y i 2
∥Z − Z∥
FEATURES
Yann LeCun
20. Using PSD to Train aaHierarchy of Features
Using PSD to Train Hierarchy of Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Y ∣z j∣
ge W e ,Y i
FEATURES
Yann LeCun
21. Using PSD to Train aaHierarchy of Features
Using PSD to Train Hierarchy of Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD
∥Y i −Y∥2
WdZ ∑j .
Y ∣z j∣ Y Z ∣z j∣
ge W e ,Y i ge W e ,Y i 2
∥Z − Z∥
FEATURES
Yann LeCun
22. Using PSD to Train aaHierarchy of Features
Using PSD to Train Hierarchy of Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD
Phase 4: use encoder + absolute value as 2 nd feature extractor
Y ∣z j∣ ∣z j∣
ge W e ,Y i ge W e ,Y i
FEATURES
Yann LeCun
23. Using PSD to Train aaHierarchy of Features
Using PSD to Train Hierarchy of Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD
Phase 4: use encoder + absolute value as 2 nd feature extractor
Phase 5: train a supervised classifier on top
Phase 6 (optional): train the entire system with supervised back-propagation
Y ∣z j∣ ∣z j∣ classifier
ge W e ,Y i ge W e ,Y i
FEATURES
Yann LeCun
24. Learned Features on natural patches: V1-like receptive fields
Learned Features on natural patches: V1-like receptive fields
Yann LeCun
25. Using PSD Features for Object Recognition
Using PSD Features for Object Recognition
64 filters on 9x9 patches trained with PSD
with Linear-Sigmoid-Diagonal Encoder
Yann LeCun
27. Convolutional Training
Convolutional Training
Problem:
With patch-level training, the learning algorithm must reconstruct
the entire patch with a single feature vector
But when the filters are used convolutionally, neighboring feature
vectors will be highly redundant
Patchlevel training produces
lots of filters that are shifted
versions of each other.
Yann LeCun
28. Convolutional Sparse Coding
Convolutional Sparse Coding
Replace the dot products with dictionary element by convolutions.
Input Y is a full image
Each code component Zk is a feature map (an image)
Each dictionary element is a convolution kernel
Regular sparse coding
Convolutional S.C.
Y = ∑. * Zk
k
Wk
“deconvolutional networks” [Zeiler, Taylor, Fergus CVPR 2010]
Yann LeCun
29. Convolutional PSD: Encoder with aasoft sh() Function
Convolutional PSD: Encoder with soft sh() Function
Convolutional Formulation
Extend sparse coding from PATCH to IMAGE
PATCH based learning CONVOLUTIONAL learning
Yann LeCun
30. Cifar-10 Dataset
Cifar-10 Dataset
Dataset of tiny images
Images are 32x32 color images
10 object categories with 50000 training and 10000 testing
Example Images
Yann LeCun
31. Comparative Results on Cifar-10 Dataset
Comparative Results on Cifar-10 Dataset
* Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto
**Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine.
CVPR 2010
Yann LeCun
32. Road Sign Recognition Competition
Road Sign Recognition Competition
GTSRB Road Sign Recognition Competition (phase 1)
32x32 images
The 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIA
No 6 is humans!
Yann LeCun
33. Pedestrian Detection (INRIA Dataset)
Pedestrian Detection (INRIA Dataset)
[Sermanet et al., Rejected from ICCV 2011]]
Yann LeCun
35. Learning
Learning
Invariant Features
Invariant Features
Yann LeCun
36. Why just pool over space? Why not over orientation?
Why just pool over space? Why not over orientation?
Using an idea from Hyvarinen: topographic square pooling (subspace ICA)
1. Apply filters on a patch (with suitable non-linearity)
2. Arrange filter outputs on a 2D plane
3. square filter outputs
4. minimize sqrt of sum of blocks of squared filter outputs
Yann LeCun
37. Why just pool over space? Why not over orientation?
Why just pool over space? Why not over orientation?
The filters arrange
themselves spontaneously so
that similar filters enter the
same pool.
The pooling units can be seen
as complex cells
They are invariant to local
transformations of the input
For some it's translations,
for others rotations, or
other transformations.
Yann LeCun
38. Pinwheels?
Pinwheels?
Does that look
pinwheely to
you?
Yann LeCun
39. Sparsity through
Sparsity through
Lateral Inhibition
Lateral Inhibition
Yann LeCun
40. Invariant Features Lateral Inhibition
Invariant Features Lateral Inhibition
Replace the L1 sparsity term by a lateral inhibition matrix
Yann LeCun
41. Invariant Features Lateral Inhibition
Invariant Features Lateral Inhibition
Zeros I S matrix have tree structure
Yann LeCun
42. Invariant Features Lateral Inhibition
Invariant Features Lateral Inhibition
Non-zero values in S form a ring in a 2D topology
Input patches are high-pass filtered
Yann LeCun
43. Invariant Features Lateral Inhibition
Invariant Features Lateral Inhibition
Non-zero values in S form a ring in a 2D topology
Left: non high-pass filtering of input
Right: patch-level mean removal
Yann LeCun
44. Invariant Features Short-Range Lateral Excitation + L1
Invariant Features Short-Range Lateral Excitation + L1
l
Yann LeCun
45. Disentangling the
Disentangling the
Explanatory Factors
Explanatory Factors
of Images
of Images
Yann LeCun
46. Separating
Separating
I used to think that recognition was all about eliminating irrelevant
information while keeping the useful one
Building invariant representations
Eliminating irrelevant variabilities
I now think that recognition is all about disentangling independent factors
of variations:
Separating “what” and “where”
Separating content from instantiation parameters
Hinton's “capsules”; Karol Gregor's what-where auto-encoders
Yann LeCun
47. Invariant Features through Temporal Constancy
Invariant Features through Temporal Constancy
Object is cross-product of object type and instantiation parameters
[Hinton 1981]
small medium large
Object type Object size
[Karol Gregor et al.]
Yann LeCun
48. Invariant Features through Temporal Constancy
Invariant Features through Temporal Constancy
Decoder Predicted
St St1 St2
input
W1 W1 W1 W2
t t1 t2 t Inferred
C 1
C 1
C 1
C 2
code
C t
C t1
C t2
C t Predicted
1 1 1 2
code
f
1 1 f °W1 2
f °W f °W W
2
W W2
t t1 t2
Yann LeCun
Encoder S S S Input
49. Invariant Features through Temporal Constancy
Invariant Features through Temporal Constancy
C1
(where)
C2
(what)
Yann LeCun
51. What is the right
What is the right
criterion to train
criterion to train
hierarchical feature
hierarchical feature
extraction
extraction
architectures?
architectures?
Yann LeCun
52. Flattening the Data Manifold?
Flattening the Data Manifold?
The manifold of all images of <Category-X> is low-dimensional
and highly curvy
Feature extractors should “flatten” the manifold
Yann LeCun
53. Flattening the
Flattening the
Data Manifold?
Data Manifold?
Yann LeCun
54. The Ultimate Recognition System
The Ultimate Recognition System
Trainable Trainable
Trainable
Feature Feature
Classifier
Transform Transform
Learned Internal Representation
Bottom-up and top-down information
Top-down: complex inference and disambiguation
Bottom-up: learns to quickly predict the result of the top-down
inference
Integrated supervised and unsupervised learning
Capture the dependencies between all observed variables
Compositionality
Each stage has latent instantiation variables
Yann LeCun