Fcv learn le_cun

5 years from now,
            5 years from now,
            everyone will learn
            everyone will learn
        their features
        their features
           (you might as well start now)
           (you might as well start now)

Yann LeCun
Yann LeCun
        Courant Institute of Mathematical Sciences
        Courant Institute of Mathematical Sciences
and
and
    Center for Neural Science,
    Center for Neural Science,
      New York University
      New York University

Yann LeCun

IIHave aaTerrible Confession to Make
Have Terrible Confession to Make

I'm interested in vision, but no more in vision than in audition or in
other perceptual modalities.
I'm interested in perception (and in control).
I'd like to find a learning algorithm and architecture that could work
(with minor changes) for many modalities
Nature seems to have found one.
Almost all natural perceptual signals have a local structure (in space
and time) similar to images and videos
Heavy correlation between neighboring variables
Local patches of variables have structure, and are representable
by feature vectors.
I like vision because it's challenging, it's useful, it's fun, we have data
the image recognition community is not yet stuck in a deep
local minimum like the speech recognition community.

Yann LeCun

The Unity of
The Unity of
Recognition
Recognition
Architectures
Architectures

Yann LeCun

Most Recognition Systems Are Built on the Same Architecture
Most Recognition Systems Are Built on the Same Architecture

Filter Non feature Norma
Classifier
Bank Linearity Pooling lization

Filter Non Filter Non
Pool Norm Pool Norm Classifier
Bank Lin Bank Lin

First stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders.....
Second stage: K-means, sparse coding, LCC....
Pooling: average, L2, max, max with bias (elastic templates).....
Convolutional Nets: same architecture, but everything is trained.
Yann LeCun

Filter Bank + Non-Linearity + Pooling + Normalization
Filter Bank + Non-Linearity + Pooling + Normalization

Filter Non Spatial
Bank Linearity Pooling

This model of a feature extraction stage is biologically-inspired
...whether you like it or not (just ask David Lowe)
Inspired by [Hubel and Wiesel 1962]
The use of this module goes back to Fukushima's Neocognitron
(and even earlier models in the 60's).

Yann LeCun

How well does this work?
How well does this work?

Filter Non feature Filter Non feature
Classifier
Bank Linearity Pooling Bank Linearity Pooling

Oriented Winner Histogram Pyramid SVM or
Kmeans
Edges Takes (sum) Histogram. Another
Or
All Elastic parts Simple
Sparse Coding
SIFT Models,... classifier
Some results on C101 (I know, I know....)
SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65%
[Lazebnik et al. CVPR 2006]
SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75%
[Boureau et al. CVPR 2010] [Yang et al. 2008]
SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77%
[Boureau et al. ICCV 2011]
(Small) supervised ConvNet with sparsity penalty: >71%
[rejected from CVPR,ICCV,etc] REAL TIME
Yann LeCun

Convolutional Networks (ConvNets) fits that model
Convolutional Networks (ConvNets) fits that model

Yann LeCun

Why do two stages work better than one stage?
Why do two stages work better than one stage?

Filter Non Filter Non
Pool Norm Pool Norm Classifier
Bank Lin Bank Lin

The second stage extracts mid-level features
Having multiple stages helps the selectivity-invariance dilemma

Yann LeCun

Learning Hierarchical Representations
Learning Hierarchical Representations

Trainable Trainable
Trainable
Feature Feature
Classifier
Transform Transform

Learned Internal Representation
I agree with David Lowe: we should learn the features
It worked for speech, handwriting, NLP.....
In a way, the vision community has been running a ridiculously
inefficient evolutionary learning algorithm to learn features:
Mutation: tweak existing features in many different ways
Selection: Publish the best ones at CVPR
Reproduction: combine several features from the last CVPR
Iterate. Problem: Moore's law works against you
Yann LeCun

Sometimes,
Sometimes,
Biology gives you
Biology gives you
good hints
good hints
example:
example:
contrast normalization

Yann LeCun

Harsh Non-Linearity + Contrast Normalization + Sparsity
Harsh Non-Linearity + Contrast Normalization + Sparsity
C     Convolutions (filter bank)
Soft Thresholding + Abs
N     Subtractive and Divisive Local Normalization
P     Pooling downsampling layer: average or max?

Pooling, subsampling

subtractive+divisive
Thresholding
Convolutions

Rectification

THIS IS ONE STAGE OF THE CONVNET
Yann LeCun

Soft Thresholding Non-Linearity
Soft Thresholding Non-Linearity

Yann LeCun

Local Contrast Normalization
Local Contrast Normalization
Performed on the state of every layer, including
the input
Subtractive Local Contrast Normalization
Subtracts from every value in a feature a
Gaussian-weighted average of its
neighbors (high-pass filter)
Divisive Local Contrast Normalization
Divides every value in a layer by the
standard deviation of its neighbors over
space and over all feature maps
Subtractive + Divisive LCN performs a kind of
approximate whitening.

Yann LeCun

C101 Performance (I know, IIknow)
C101 Performance (I know, know)

Small network: 64 features at stage-1, 256 features at stage-2:

Tanh non-linearity, No Rectification, No normalization: 29%
Tanh non-linearity, Rectification, normalization: 65%
Shrink non-linearity, Rectification, norm, sparsity penalty 71%

Yann LeCun

Results on Caltech101 with sigmoid non-linearity
Results on Caltech101 with sigmoid non-linearity

← like HMAX model

Yann LeCun

Feature Learning
Feature Learning
Works Really Well
Works Really Well
on everything but C101
on everything but C101

Yann LeCun

C101 is very unfavorable to learning-based systems
C101 is very unfavorable to learning-based systems
Because it's so small. We are switching to ImageNet
Some results on NORB
No normalization

Random filters
Unsup filters
Sup filters
Unsup+Sup filters

Yann LeCun

Sparse Auto-Encoders
Sparse Auto-Encoders
Inference by gradient descent starting from the encoder output

i i 2 i 2
E Y , Z =∥Y −W d Z∥ ∥Z −g e W e ,Y ∥  ∑ j ∣z j∣
i i
Z =argmin z E Y , z ; W 

i
∥Y −Y∥
 2
WdZ ∑j .

INPUT Y Z ∣z j∣
FEATURES

i
ge W e ,Y   2
∥Z − Z∥

Yann LeCun

Using PSD to Train aaHierarchy of Features
Using PSD to Train Hierarchy of Features
Phase 1: train first layer using PSD

∥Y i −Y∥2
 WdZ ∑j .

Y Z ∣z j∣

ge W e ,Y i   2
∥Z − Z∥

FEATURES

Yann LeCun

Phase 2: use encoder + absolute value as feature extractor

Y ∣z j∣

ge W e ,Y i 

FEATURES

Yann LeCun

Phase 3: train the second layer using PSD

∥Y i −Y∥2
 WdZ ∑j .

Y ∣z j∣ Y Z ∣z j∣

ge W e ,Y i  ge W e ,Y i   2
∥Z − Z∥

FEATURES

Yann LeCun

Phase 4: use encoder + absolute value as 2 nd feature extractor

Y ∣z j∣ ∣z j∣

ge W e ,Y i  ge W e ,Y i 

FEATURES

Yann LeCun

Phase 4: use encoder + absolute value as 2 nd feature extractor
Phase 5: train a supervised classifier on top
Phase 6 (optional): train the entire system with supervised back-propagation

Y ∣z j∣ ∣z j∣ classifier

ge W e ,Y i  ge W e ,Y i 

FEATURES

Yann LeCun

Learned Features on natural patches: V1-like receptive fields
Learned Features on natural patches: V1-like receptive fields

Yann LeCun

Using PSD Features for Object Recognition
Using PSD Features for Object Recognition
64 filters on 9x9 patches trained with PSD
with Linear-Sigmoid-Diagonal Encoder

Yann LeCun

Convolutional Sparse Coding

[Kavukcuoglu et al. NIPS 2010]: convolutional PSD

[Zeiler, Krishnan, Taylor, Fergus, CVPR 2010]: Deconvolutional Network
[Lee, Gross, Ranganath, Ng, ICML 2009]: Convolutional Boltzmann Machine
[Norouzi, Ranjbar, Mori, CVPR 2009]: Convolutional Boltzmann Machine
[Chen, Sapiro, Dunson, Carin, Preprint 2010]: Deconvolutional Network with
automatic adjustment of code dimension.

Yann LeCun

Convolutional Training
Convolutional Training

Problem:
With patch-level training, the learning algorithm must reconstruct
the entire patch with a single feature vector
But when the filters are used convolutionally, neighboring feature
vectors will be highly redundant

Patchlevel training produces
lots of filters that are shifted
versions of each other.

Yann LeCun

Replace the dot products with dictionary element by convolutions.
Input Y is a full image
Each code component Zk is a feature map (an image)
Each dictionary element is a convolution kernel

Regular sparse coding

Convolutional S.C.

Y = ∑. * Zk
k
Wk
“deconvolutional networks” [Zeiler, Taylor, Fergus CVPR 2010]
Yann LeCun

Convolutional PSD: Encoder with aasoft sh() Function
Convolutional PSD: Encoder with soft sh() Function

Convolutional Formulation
Extend sparse coding from PATCH to IMAGE

PATCH based learning CONVOLUTIONAL learning

Yann LeCun

Cifar-10 Dataset
Cifar-10 Dataset
Dataset of tiny images
Images are 32x32 color images
10 object categories with 50000 training and 10000 testing
Example Images

Yann LeCun

Comparative Results on Cifar-10 Dataset
Comparative Results on Cifar-10 Dataset

* Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto

**Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine.
CVPR 2010
Yann LeCun

Road Sign Recognition Competition
Road Sign Recognition Competition
GTSRB Road Sign Recognition Competition (phase 1)
32x32 images
The 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIA
No 6 is humans!

Yann LeCun

Pedestrian Detection (INRIA Dataset)
Pedestrian Detection (INRIA Dataset)

[Sermanet et al., Rejected from ICCV 2011]]
Yann LeCun

Pedestrian Detection: Examples
Pedestrian Detection: Examples

Yann LeCun [Kavukcuoglu et al. NIPS 2010]

Learning
       Learning
        Invariant Features
        Invariant Features

Yann LeCun

Why just pool over space? Why not over orientation?
Using an idea from Hyvarinen: topographic square pooling (subspace ICA)
1. Apply filters on a patch (with suitable non-linearity)
2. Arrange filter outputs on a 2D plane
3. square filter outputs
4. minimize sqrt of sum of blocks of squared filter outputs

Yann LeCun

The filters arrange
themselves spontaneously so
that similar filters enter the
same pool.
The pooling units can be seen
as complex cells
They are invariant to local
transformations of the input
For some it's translations,
for others rotations, or
other transformations.

Yann LeCun

Pinwheels?
Pinwheels?
Does that look
pinwheely to
you?

Yann LeCun

Sparsity through
Sparsity through
Lateral Inhibition
Lateral Inhibition

Yann LeCun

Invariant Features Lateral Inhibition
Replace the L1 sparsity term by a lateral inhibition matrix

Yann LeCun

Zeros I S matrix have tree structure

Yann LeCun

Non-zero values in S form a ring in a 2D topology
Input patches are high-pass filtered

Yann LeCun

Non-zero values in S form a ring in a 2D topology
Left: non high-pass filtering of input
Right: patch-level mean removal

Yann LeCun

Invariant Features Short-Range Lateral Excitation + L1
Invariant Features Short-Range Lateral Excitation + L1
l

Yann LeCun

Disentangling the
Disentangling the
Explanatory Factors
Explanatory Factors
of Images
of Images

Yann LeCun

Separating
Separating

I used to think that recognition was all about eliminating irrelevant
information while keeping the useful one
Building invariant representations
Eliminating irrelevant variabilities
I now think that recognition is all about disentangling independent factors
of variations:
Separating “what” and “where”
Separating content from instantiation parameters
Hinton's “capsules”; Karol Gregor's what-where auto-encoders

Yann LeCun

Invariant Features through Temporal Constancy
Object is cross-product of object type and instantiation parameters
[Hinton 1981]

small medium large

Object type Object size
[Karol Gregor et al.]
Yann LeCun


Decoder Predicted
St St1 St2
input

W1 W1 W1 W2
t t1 t2 t Inferred
C 1
C 1
C 1
C 2
code

C t
C t1
C t2
C t Predicted
1 1 1 2
code
f
1 1 f °W1 2
f °W f °W W
2 
W W2
t t1 t2
Yann LeCun
Encoder S S S Input


C1
(where)

C2
(what)

Yann LeCun

Generating from the Network
Generating from the Network

Input

Yann LeCun

What is the right
What is the right
criterion to train
criterion to train
hierarchical feature
hierarchical feature
extraction
extraction
architectures?
architectures?

Yann LeCun

Flattening the Data Manifold?
Flattening the Data Manifold?

The manifold of all images of <Category-X> is low-dimensional
and highly curvy
Feature extractors should “flatten” the manifold

Yann LeCun

Flattening the
Flattening the
Data Manifold?
Data Manifold?

Yann LeCun

The Ultimate Recognition System
The Ultimate Recognition System

Trainable Trainable
Trainable
Feature Feature
Classifier
Transform Transform

Learned Internal Representation
Bottom-up and top-down information
Top-down: complex inference and disambiguation
Bottom-up: learns to quickly predict the result of the top-down
inference
Integrated supervised and unsupervised learning
Capture the dependencies between all observed variables
Compositionality
Each stage has latent instantiation variables
Yann LeCun

Fcv learn le_cun

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Fcv learn le_cun

Ähnlich wie Fcv learn le_cun (19)

Mehr von zukun

Mehr von zukun (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Fcv learn le_cun