SlideShare ist ein Scribd-Unternehmen logo
1 von 82
Downloaden Sie, um offline zu lesen
Deep Learning Hardware:
Past, Present, & Future
ISSCC, San Francisco, 2019-02-18
Yann LeCun
Facebook AI Research
New York University
http://yann.lecun.com
Y. LeCun
AI today is mostly supervised learning
Training a machine by showing examples instead of programming it
When the output is wrong, tweak the parameters of the machine
PLANE
CAR
Works well for:
Speech→words
Image→categories
Portrait→ name
Photo→caption
Text→topic

.
Y. LeCun
The History of Neural Nets is Inextricable from Hardware
The McCulloch-Pitts Binar Neuron
Perceptron: weights are motorized potentiometers
Adaline: Weights are electrochemical “memistors”
y=sign(∑
i=1
N
W i X i+ b)
https://youtu.be/X1G2g3SiCwU
Y. LeCun
The Standard Paradigm of Pattern Recognition
...and “traditional” Machine Learning
Trainable
Classifier
Feature
Extractor
Hand engineered Trainable
Y. LeCun
1969→1985: Neural Net Winter
No learning for multilayer nets, why?
People used the wrong “neuron”: the McCulloch & Pitts binary neuron
Binary neurons are easier to implement: No multiplication necessary!
Binary neurons prevented people from thinking about gradient-based
methods for multi-layer nets
Early 1980s: The second wave of neural nets
1982: Hopfield nets: fully-connected recurrent binary networks
1983: Boltzmann Machines: binary stochastic networks with hidden units
1985/86: Backprop! Q: Why only then? A: sigmoid neurons!
Sigmoid neurons were enabled by “fast” floating point (Sun Workstations)
Y. LeCun
Multilayer Neural Nets and Deep Learning
Traditional Machine Learning
Trainable
Classifier
Feature
Extractor
Deep Learning
Trainable
Classifier
Low-Level
Features
Mid-Level
Features
High-Level
Features
Hand engineered Trainable
Trainable
Y. LeCun
Multi-Layer Neural Nets
Multiple Layers of simple units
Each units computes a weighted sum of its inputs
Weighted sum is passed through a non-linear function
The learning algorithm changes the weights
Weight
matrix
Ceci est une voiture
ReLU (x )=max (x , 0)
Hidden
Layer
Y. LeCun
Supervised Machine Learning = Function Optimization
It's like walking in the mountains in a fog
and following the direction of steepest
descent to reach the village in the valley
But each sample gives us a noisy
estimate of the direction. So our path is
a bit random.
traffic light: -1
Function with
adjustable parameters
Objective
Function Error
W i ←W i−η
∂ L(W , X )
∂ W i
Stochastic Gradient Descent (SGD)
Y. LeCun
Computing Gradients by Back-Propagation
●
A practical Application of Chain Rule
●
Backprop for the state gradients:
●
dC/dXi-1 = dC/dXi . dXi/dXi-1
●
dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1
●
Backprop for the weight gradients:
●
dC/dWi = dC/dXi . dXi/dWi
●
dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi
Cost
Fn(Xn-1,Wn)
C(X,Y,Θ)
X (input)
Y (desired output)
Fi(Xi-1,Wi)
F1(X0,W1)
Xi-1
Xi
dC/dXi-
1
dC/dXi
dC/dWn
Wn
dC/dWi
Wi
Y. LeCun
1986-1996 Neural Net Hardware at Bell Labs, Holmdel
1986: 12x12 resistor array
Fixed resistor values
E-beam lithography: 6x6microns
1988: 54x54 neural net
Programmable ternary weights
On-chip amplifiers and I/O
1991: Net32k: 256x128 net
Programmable ternary weights
320GOPS, 1-bit convolver.
1992: ANNA: 64x64 net
ConvNet accelerator: 4GOPS
6-bit weights, 3-bit activations
6 microns
Y. LeCun
Convolutional Network Architecture [LeCun et al. NIPS 1989]
Inspired by [Hubel & Wiesel 1962] &
[Fukushima 1982] (Neocognitron):
simple cells detect local features
complex cells “pool” the outputs of simple
cells within a retinotopic neighborhood.
Filter Bank +non-linearity
Filter Bank +non-linearity
Pooling
Pooling
Filter Bank +non-linearity
Y. LeCun
LeNet character recognition demo 1992
Running on an AT&T DSP32C (floating-point DSP, 20 MFLOPS)
Y. LeCun
Convolutional Network (LeNet5, vintage 1990)
Filters-tanh → pooling → filters-tanh → pooling → filters-tanh
Y. LeCun
ConvNets can recognize multiple objects
All layers are convolutional
Networks performs simultaneous segmentation and recognition
Y. LeCun
Check Reader (AT&T 1995)
Check amount reader
ConvNet+Language Model
trained at the sequence level.
50% percent correct, 49% reject,
1% error (detectable later in the
process).
Fielded in 1996, used in many
banks in the US and Europe.
Processed an estimated 10% to
20% of all the checks written in
the US in the early 2000s.
[LeCun, Bottou, Bengio ICASSP1997]
[LeCun, Bottou, Bengio, Haffner 1998]
Y. LeCun
1996→2006: 2nd
NN Winter! Few teams could train large NNs
Hardware was slow for floating point computation
Training a character recognizer took 2 weeks on a Sun or SGI workstation
A very small ConvNet by today’s standard (500,000 connections)
Data was scarce and NN were data hungry
No large datasets besides character and speech recognition
Interactive software tools had to be built from scratch
We wrote a NN simulator with a custom Lisp interpreter/compiler
SN [Bottou & LeCun 1988] → SN2 [1992] → Lush (open sourced in 2002).
Open sourcing wasn’t common in the pre-Internet days
The “black art” of NN training could not be communicated easily
SN/SN2/Lush gave us superpowers: tools shape research directions
Y. LeCun
Lessons learned #1
1.1: It’s hard to succeed with exotic hardware
Hardwired analog → programmable hybrid → digital
1.2: Hardware limitations influence research directions
It constrains what algorithm designers will let themselves imagine
1.3: Good software tools shape research and give superpowers
But require a significant investment
Common tools for Research and Development facilitates productization
1.4: Hardware performance matters
Fast turn-around is important for R&D
But high-end production models always take 2-3 weeks to train
1.5: When hardware is too slow, software is not readily available, or
experiments are not easily reproducible, good ideas can be abandoned.
The 2nd
Neural Net
Winter (1995-2005)
& Spring (2006-2012)
The Lunatic Fringe and
the Deep Learning Conspircy
Y. LeCun
Semantic Segmentation with ConvNet for off-Road Driving
Input imageInput image Stereo LabelsStereo Labels Classifier OutputClassifier Output
Input imageInput image Stereo LabelsStereo Labels Classifier OutputClassifier Output
DARPA LAGR program 2005-2009
[Hadsell et al., J. of Field Robotics 2009]
[Sermanet et al., J. of Field Robotics 2009]
Y. LeCun
LAGR Video
Y. LeCun
Semantic Segmentation with ConvNets (33 categories)
Y. LeCun
FPGA ConvNet Accelerator: NewFlow [Farabet 2011]
NeuFlow: Reconfigurable Dataflow architecture
Implemented on Xilinx Virtex6 FPGA
20 configurable tiles. 150GOPS, 10 Watts
Semantic Segmentation: 20 frames/sec at 320x240
Exploits the structure of convolutions
NeuFlow ASIC [Pham 2012]
150GOPS, 0.5 Watts (simulated)
Y. LeCun
Driving Cars with Convolutional Nets
MobilEye
NVIDIA
The Deep Learning Revolution
State of the Art
Y. LeCun
Deep ConvNets for Object Recognition (on GPU)
AlexNet [Krizhevsky et al. NIPS 2012], OverFeat [Sermanet et al. 2013]
1 to 10 billion connections, 10 million to 1 billion parameters, 8 to 20 layers.
Y. LeCun
Error Rate on ImageNet
Depth inflation
(Figure: Anirudh Koul)
Y. LeCun
Deep ConvNets (depth inflation)
VGG
[Simonyan 2013]
GoogLeNet
Szegedy 2014]
ResNet
[He et al. 2015]
DenseNet
[Huang et al 2017]
Y. LeCun
GOPS vs Accuracy on ImageNet vs #Parameters
[Canziani 2016]
ResNet50 and
ResNet100 are used
routinely in
production.
Each of the few
billions photos
uploaded on
Facebook every day
goes through a
handful of ConvNets
within 2 seconds.
Y. LeCun
Progress in Computer Vision
[He 2017]
Y. LeCun
Mask R-CNN: instance segmentation
[He, Gkioxari, Dollar, Girshick
arXiv:1703.06870]
ConvNet produces an object
mask for each region of
interest
Combined ventral and dorsal
pathways
Y. LeCun
RetinaNet, feature pyramid network
One-pass object detection
[Lin et al. ArXiv:1708.02002]
Y. LeCun
Mask-RCNN Results on COCO dataset
Individual
objects are
segmented.
Y. LeCun
Mask R-CNN Results on COCO test set
Y. LeCun
Real-Time Pose Estimation on Mobile Devices
Maks R-CNN
running on
Caffe2Go
Y. LeCun
Detectron: open source vision in PyTorch
https://github.com/facebookresearch/maskrcnn-benchmark
Y. LeCun
3D ConvNet for Medical Image Analysis
Segmentation Femur from MR Images
[Deniz et al. Nature 2018]
Y. LeCun
3D ConvNet for Medical Image Analysis
Y. LeCun
Applications of Deep Learning
Medical image analysis
Self-driving cars
Accessibility
Face recognition
Language translation
Virtual assistants*
Content Understanding for:
Filtering
Selection/ranking
Search
Games
Security, anomaly detection
Diagnosis, prediction
Science!
[Geras 2017]
[Mnih 2015]
[MobilEye]
[Esteva 2017]
Y. LeCun
Lessons learned #2
2.1: Good results are not enough
Making them easily reproducible also makes them credible.
2.2: Hardware progress enables new breakthroughs
General-Purpose GPUs should have come 10 years earlier!
But can we please have hardware that doesn’t require batching?
2.3: Open-source software platforms disseminate ideas
But making platforms that are good for research and production is hard.
2.4: Convolutional Nets will soon be everywhere
Hardware should exploit the properties of convolutions better
There is a need for low-cost, low-power ConvNet accelerators
Cars, cameras, vacuum cleaners, lawn mowers, toys, maintenance robots...
New DL Architectures
With different hardware/software requirements:
Memory-Augmented Networks
Dynamic Networks
Graph Convolutional Nets
Networks with Sparse Activations
Y. LeCun
Augmenting Neural Nets with a Memory Module
Recurrent net memory
Recurrent networks cannot remember things for very long
The cortex only remember things for 20 seconds
We need a “hippocampus” (a separate memory module)
LSTM [Hochreiter 1997], registers
Memory networks [Weston et 2014] (FAIR), associative memory
Stacked-Augmented Recurrent Neural Net [Joulin & Mikolov 2014] (FAIR)
Neural Turing Machine [Graves 2014],
Differentiable Neural Computer [Graves 2016]
Y. LeCun
Differentiable Associative Memory
Used very widely in NLP
MemNN, Transformer Network, ELMO,
GPT, BERT, GPT2, GLoMO
Essentially a “soft” RAM or hash table
Input (Address) X
Keys Ki
Values Vi
Dot Products
Softmax
Sum
Y =∑
i
Ci V i
Ci=
e
Ki
T
X
∑
j
e
K j
T
X
Y. LeCun
Learning to synthesize neural programs for visual reasoning
https://research.fb.com/visual-reasoning-and-dialog-towards-natural-language-conversations-about-visual-data/
Y. LeCun
PyTorch: differentiable programming
Software 2.0:
The operations in a program are only partially specified
They are trainable parameterized modules.
The precise operations are learned from data, only the general structure
of the program is designed.
Dynamic computational graph
Automatic differentiation by recording a “tape” of operations and rolling it
backwards with the Jacobian of each operator.
Implemented in PyTorch1.0, Chainer

Easy if the front-end language is dynamic and interpreted (e.g Python)
Not so easy if we want to run without a Python runtime...
Y. LeCun
ConvNets on Graphs (fixed and data-dependent)
Graphs can represent: Natural
language, social networks, chemistry,
physics, communication networks...
Review paper: “Geometric deep learning: going
beyond euclidean data”, MM Bronstein, J Bruna, Y
LeCun, A Szlam, P Vandergheynst, IEEE Signal
Processing Magazine 34 (4), 18-42, 2017
[ArXiv:1611.08097]
Y. LeCun
Spectral ConvNets / Graph ConvNets
Regular grid graph
Standard ConvNet
Fixed irregular graph
Spectral ConvNet
Dynamic irregular graph
Graph ConvNet
IPAM workshop:
http://www.ipam.ucla.edu/programs/workshops/new-deep-learning-techniques/
Y. LeCun
Sparse ConvNets: for sparse voxel-based 3D data
ShapeNet competition results ArXiv:1710.06104]
Winner: Submanifold Sparse ConvNet
[Graham & van der Maaten arXiv 1706.01307]
PyTorch: https://github.com/facebookresearch/SparseConvNet
Y. LeCun
Lessons learned #3
3.1: Dynamic networks are gaining in popularity (e.g. for NLP)
Dynamicity breaks many assumptions of current hardware
Can’t optimize the compute graph distribution at compile time.
Can’t do batching easily!
3.2: Large-Scale Memory-Augmented Networks...
...Will require efficient associative memory/nearest-neighbor search
3.3: Graph ConvNets are very promising for many applications
Say goodbye to matrix multiplications?
Say goodbye to tensors?
3.4: Large Neural Nets may have sparse activity
How to exploit sparsity in hardware?
What About (Deep)
Reinforcement Learning?
It works great 


for games and virtual environments
Y. LeCun
Reinforcement Learning works fine for games
RL works well for games
Playing Atari games [Mnih 2013], Go
[Silver 2016, Tian 2018], Doom [Tian
2017], StarCraft...
RL requires too many trials.
100 hours to reach the performance that
a human can reach in 15 minutes on
Atari games [Hessel ArXiv:1710.02298]
RL often doesn’t really work in the real
world
FAIR open Source go player: OpenGo
https://github.com/pytorch/elf
Y. LeCun
Pure RL is hard to use in the real world
Pure RL requires too many
trials to learn anything
it’s OK in a game
it’s not OK in the real world
RL works in simple virtual
world that you can run faster
than real-time on many
machines in parallel.
Anything you do in the real world can kill you
You can’t run the real world faster than real time
Y. LeCun
What are we missing to get to “real” AI?
What we can have
Safer cars, autonomous cars
Better medical image analysis
Personalized medicine
Adequate language translation
Useful but stupid chatbots
Information search, retrieval, filtering
Numerous applications in energy,
finance, manufacturing,
environmental protection, commerce,
law, artistic creation, games,
..
What we cannot have (yet)
Machines with common sense
Intelligent personal assistants
“Smart” chatbots”
Household robots
Agile and dexterous robots
Artificial General Intelligence
(AGI)
How do Humans
and Animal Learn?
So quickly
Y. LeCun
Babies learn how the world works by observation
Largely by observation, with remarkably little interaction.
Photos courtesy of
Emmanuel Dupoux
Y. LeCun
Early Conceptual Acquisition in Infants [from Emmanuel Dupoux]
Perceptionroduction
Physic
s
Action
s
Objects
0 1 2 3 4 5 6 7 8 9 10 11 12
13 14
Time (months)
stability, support
gravity, inertia
conservation of
momentum
Object permanence
solidity, rigidity
shape
constancy
crawling walking
emotional contagion
Social-
communicati
ve
rational, goal-
directed actions
face tracking
prooto-imitation
pointing
biological
motion
false perceptual
beliefs
helping vs
hindering
natural kind categories
Y. LeCun
Prediction is the essence of Intelligence
We learn models of the world by predicting
The Future:
Self-Supervised Learning
With massive amounts of data
and very large networks
Y. LeCun
Self-Supervised Learning
Predict any part of the input from any
other part.
Predict the future from the past.
Predict the future from the recent past.
Predict the past from the present.
Predict the top from the bottom.
Predict the occluded from the visible
Pretend there is a part of the input you
don’t know and predict that.
← Past
Present
Time →
Future →
Y. LeCun
How Much Information is the Machine Given during Learning?
“Pure” Reinforcement Learning (cherry)
The machine predicts a scalar reward given once in a
while.
A few bits for some samples
Supervised Learning (icing)
The machine predicts a category or a few numbers
for each input
Predicting human-supplied data
10→10,000 bits per sample
Self-Supervised Learning (cake génoise)
The machine predicts any part of its input for any
observed part.
Predicts future frames in videos
Millions of bits per sample
Y. LeCun
Self-Supervised Learning: Filling in the Blanks
Y. LeCun
Self-Supervised Learning works well for text
Word2vec
[Mikolov 2013]
FastText
[Joulin 2016]
BERT
Bidirectional Encoder
Representations from
Transformers
[Devlin 2018]
Figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/
Y. LeCun
But it doesn’t really work for high-dim continuous signals
Video prediction:
Multiple futures are possible.
Training a system to make a single
prediction results in “blurry” results
the average of all the possible futures
Y. LeCun
The Next AI Revolution
THE REVOLUTIONTHE REVOLUTION
WILL NOT BE SUPERVISEDWILL NOT BE SUPERVISED
(nor purely reinforced)(nor purely reinforced)
With thanks
To
Alyosha Efros
Learning Predictive Models
of the World
Learning to predict, reason, and plan,
Learning Common Sense.
Y. LeCun
Planning Requires Prediction
To plan ahead, we simulate the world
World
Agent
Percepts
Objective
Cost
Agent State
Actions/
Outputs
Agent
World
Simulator
Actor
Predicted
Percepts
Critic Predicted
Cost
Action
Proposals
Inferred
World State
Actor State
Y. LeCun
Training the Actor with Optimized Action Sequences
1. Find action sequence through optimization
2. Use sequence as target to train the actor
Over time we get a compact policy that requires no run-time optimization
Agent
World
Simulator
Actor
Critic
World
Simulator
Actor
Critic
World
Simulator
Actor
Critic
World
Simulator
Actor
Critic
Perception
Y. LeCun
The Hard Part: Prediction Under Uncertainty
Invariant prediction: The training samples are merely representatives of a
whole set of possible outputs (e.g. a manifold of outputs).
Percepts
Hidden State
Of the World
Y. LeCun
Faces “invented” by a GAN (Generative Adversarial Network)
Random vector → Generator Network → output image [Goodfellow NIPS 2014]
[Karras et al. ICLR 2018] (from NVIDIA)
Y. LeCun
Generative Adversarial Networks for Creation
[Sbai 2017]
Y. LeCun
Self-supervised Adversarial Learning for Video Prediction
Our brains are “prediction machines”
Can we train machines to predict the future?
Some success with “adversarial training”
[Mathieu, Couprie, LeCun arXiv:1511:05440]
But we are far from a complete solution.
Y. LeCun
Predicting Instance Segmentation Maps
[Luc, Couprie, LeCun, Verbeek ECCV 2018]
Mask R-CNN Feature Pyramid Network backbone
Trained for instance segmentation on COCO
Separate predictors for each feature level
Y. LeCun
Predictions
Y. LeCun
Long-term predictions (10 frames, 1.8 seconds)
Y. LeCun
Using Forward Models to Plan (and to learn to drive)
Overhead camera on
highway.
Vehicles are tracked
A “state” is a pixel
representation of a
rectangular window
centered around each
car.
Forward model is
trained to predict how
every car moves relative
to the central car.
steering and acceleration
are computed
Y. LeCun
Forward Model Architecture
Architecture: Encoder Decoder
Latent variable predictor
expander
+
Y. LeCun
Predictions
Y. LeCun
Learning to Drive by Simulating it in your Head
Feed initial state
Sample latent variable
sequences of length 20
Run the forward model
with these sequences
Backpropagate gradient of
cost to train a policy
network.
Iterate
No need for planning at
run time.
Y. LeCun
Adding an Uncertainty Cost (doesn’t work without it)
Estimates epistemic
uncertainty
Samples multiple drop-
puts in forward model
Computes variance of
predictions
(differentiably)
Train the policy network
to minimize the
lane&proximity cost plus
the uncertainty cost.
Avoids unpredictable
outcomes
Y. LeCun
Driving an Invisible Car in “Real” Traffic
Y. LeCun
Lessons learned #4
4.1: Self-Supervised learning is the future
Networks will be much larger than today, perhaps sparse
4.2: Reasoning/inference through minimization
4.3: DL hardware use cases
A. DL R&D: 32-bit FP, high parallelism, fast inter-node communication,
flexible hardware and software.
B. Routine training: 16-bit FP, some parallelism, moderate cost.
C. inference in data centers: 8 or 16-bit FP, low latency, low power
consumption, standard interface.
D. inference on embedded devices: low cost, low power, exotic number
systems?
AR/VR, consumer items, household robots, toys, manufacturing, monitoring,...
Y. LeCun
Speculations
Spiking Neural Nets, and neuromorphic architectures?
I’m skeptical
..
No spike-based NN comes close to state of the art on practical tasks
Why build chips for algorithms that don’t work?
Exotic technologies?
Resistor/Memristor matrices, and other analog implementations?
Conversion to and from digital kills us.
No possibility of hardware multiplexing
Spintronics?
Optical implementations?
Thank you

Weitere Àhnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to batch normalization
Introduction to batch normalizationIntroduction to batch normalization
Introduction to batch normalization
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Machine learning
Machine learningMachine learning
Machine learning
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
1.Introduction to deep learning
1.Introduction to deep learning1.Introduction to deep learning
1.Introduction to deep learning
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 

Ähnlich wie Deep Learning Hardware: Past, Present, & Future

What's Wrong With Deep Learning?
What's Wrong With Deep Learning?What's Wrong With Deep Learning?
What's Wrong With Deep Learning?
Philip Zheng
 
med_poster_spie
med_poster_spiemed_poster_spie
med_poster_spie
Joe Robinson
 
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Lviv Startup Club
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
BalĂĄzs KĂ©gl
 

Ähnlich wie Deep Learning Hardware: Past, Present, & Future (20)

What's Wrong With Deep Learning?
What's Wrong With Deep Learning?What's Wrong With Deep Learning?
What's Wrong With Deep Learning?
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
 
ç‰©ä»¶ć”æžŹèˆ‡èŸšè­˜æŠ€èĄ“
ç‰©ä»¶ć”æžŹèˆ‡èŸšè­˜æŠ€èĄ“ç‰©ä»¶ć”æžŹèˆ‡èŸšè­˜æŠ€èĄ“
ç‰©ä»¶ć”æžŹèˆ‡èŸšè­˜æŠ€èĄ“
 
æœ€èż‘ăźç ”ç©¶æƒ…ć‹ąă«ă€ă„ăŠă„ăăŸă‚ă« - Deep Learningă‚’äž­ćżƒă« -
æœ€èż‘ăźç ”ç©¶æƒ…ć‹ąă«ă€ă„ăŠă„ăăŸă‚ă« - Deep Learningă‚’äž­ćżƒă« - æœ€èż‘ăźç ”ç©¶æƒ…ć‹ąă«ă€ă„ăŠă„ăăŸă‚ă« - Deep Learningă‚’äž­ćżƒă« -
æœ€èż‘ăźç ”ç©¶æƒ…ć‹ąă«ă€ă„ăŠă„ăăŸă‚ă« - Deep Learningă‚’äž­ćżƒă« -
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
 
BEST IMAGE PROCESSING TOOLS TO EXPECT in 2023 – Tutors India
BEST IMAGE PROCESSING TOOLS TO EXPECT in 2023 – Tutors IndiaBEST IMAGE PROCESSING TOOLS TO EXPECT in 2023 – Tutors India
BEST IMAGE PROCESSING TOOLS TO EXPECT in 2023 – Tutors India
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image Processing
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problems
 
Cognitive Engine: Boosting Scientific Discovery
Cognitive Engine:  Boosting Scientific DiscoveryCognitive Engine:  Boosting Scientific Discovery
Cognitive Engine: Boosting Scientific Discovery
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
med_poster_spie
med_poster_spiemed_poster_spie
med_poster_spie
 
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...
IRJET- A Real Time Yolo Human Detection in Flood Affected Areas based on Vide...
 
A presentation on the Convolutional Neural Network (CNN)
A presentation on the Convolutional Neural Network (CNN)A presentation on the Convolutional Neural Network (CNN)
A presentation on the Convolutional Neural Network (CNN)
 
Portfolio
PortfolioPortfolio
Portfolio
 
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
 
YolactEdge Review [cdm]
YolactEdge Review [cdm]YolactEdge Review [cdm]
YolactEdge Review [cdm]
 
Road Object Detection
Road Object DetectionRoad Object Detection
Road Object Detection
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
 

Mehr von Rouyun Pan

æœ‰é»žçĄŹćˆäžæœƒć€Ș硬的DNN抠速晹
æœ‰é»žçĄŹćˆäžæœƒć€Ș硬的DNNćŠ é€Ÿć™šæœ‰é»žçĄŹćˆäžæœƒć€Ș硬的DNN抠速晹
æœ‰é»žçĄŹćˆäžæœƒć€Ș硬的DNN抠速晹
Rouyun Pan
 
Deep learning
Deep learningDeep learning
Deep learning
Rouyun Pan
 
èČĄć ±ćˆ†æž1
èČĄć ±ćˆ†æž1èČĄć ±ćˆ†æž1
èČĄć ±ćˆ†æž1
Rouyun Pan
 
Device tree
Device treeDevice tree
Device tree
Rouyun Pan
 
Android ćŸ…æ©Ÿèˆ‡æ“äœœè€—é›»æȘąæŸ„
Android ćŸ…æ©Ÿèˆ‡æ“äœœè€—é›»æȘąæŸ„Android ćŸ…æ©Ÿèˆ‡æ“äœœè€—é›»æȘąæŸ„
Android ćŸ…æ©Ÿèˆ‡æ“äœœè€—é›»æȘąæŸ„
Rouyun Pan
 
Analyzing Display and Performance with Systrace
Analyzing Display and Performance with SystraceAnalyzing Display and Performance with Systrace
Analyzing Display and Performance with Systrace
Rouyun Pan
 

Mehr von Rouyun Pan (20)

èȘżè‰Č筆蚘
èȘżè‰Č筆蚘èȘżè‰Č筆蚘
èȘżè‰Č筆蚘
 
æœ‰é»žçĄŹćˆäžæœƒć€Ș硬的DNN抠速晹
æœ‰é»žçĄŹćˆäžæœƒć€Ș硬的DNNćŠ é€Ÿć™šæœ‰é»žçĄŹćˆäžæœƒć€Ș硬的DNN抠速晹
æœ‰é»žçĄŹćˆäžæœƒć€Ș硬的DNN抠速晹
 
æ·±ćșŠć­žçż’ć·„䜜攁皋
æ·±ćșŠć­žçż’ć·„䜜攁皋深ćșŠć­žçż’ć·„䜜攁皋
æ·±ćșŠć­žçż’ć·„䜜攁皋
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
 
Google edge tpu
Google edge tpuGoogle edge tpu
Google edge tpu
 
甹Adobe Camera raw é€ČèĄŒè†šè‰Čæ Ąæ­Ł
甹Adobe Camera raw é€ČèĄŒè†šè‰Čæ Ąæ­Łç”šAdobe Camera raw é€ČèĄŒè†šè‰Čæ Ąæ­Ł
甹Adobe Camera raw é€ČèĄŒè†šè‰Čæ Ąæ­Ł
 
ç”Šæ”ćœ±ćž«çš„ć€ć…žè—èĄ“æ§‹ćœ–
ç”Šæ”ćœ±ćž«çš„ć€ć…žè—èĄ“æ§‹ćœ–ç”Šæ”ćœ±ćž«çš„ć€ć…žè—èĄ“æ§‹ćœ–
ç”Šæ”ćœ±ćž«çš„ć€ć…žè—èĄ“æ§‹ćœ–
 
照片目æ–čćœ–è§Łæž
照片目æ–čćœ–è§Łæžç…§ç‰‡ç›Žæ–čćœ–è§Łæž
照片目æ–čćœ–è§Łæž
 
Deep learning
Deep learningDeep learning
Deep learning
 
VRè§ŁćŻ†
VRè§ŁćŻ†VRè§ŁćŻ†
VRè§ŁćŻ†
 
ă€Œèœ‰èČŒă€ç§»ć‹•äș’èŻç¶ČèĄŒæ„­ç›€é»ž
ă€Œèœ‰èČŒă€ç§»ć‹•äș’èŻç¶ČèĄŒæ„­ç›€é»žă€Œèœ‰èČŒă€ç§»ć‹•äș’èŻç¶ČèĄŒæ„­ç›€é»ž
ă€Œèœ‰èČŒă€ç§»ć‹•äș’èŻç¶ČèĄŒæ„­ç›€é»ž
 
The overview of VR solutions
The overview of VR solutionsThe overview of VR solutions
The overview of VR solutions
 
Render thead of hwui
Render thead of hwuiRender thead of hwui
Render thead of hwui
 
Project Tango
Project TangoProject Tango
Project Tango
 
[蜉èČŒ] ç€ŸçŸ€ć€§æ•žæ“š - èŒżæƒ…è§€æžŹćŠćˆ†æžæ‡‰ç”š
[蜉èČŒ] ç€ŸçŸ€ć€§æ•žæ“š - èŒżæƒ…è§€æžŹćŠćˆ†æžæ‡‰ç”š[蜉èČŒ] ç€ŸçŸ€ć€§æ•žæ“š - èŒżæƒ…è§€æžŹćŠćˆ†æžæ‡‰ç”š
[蜉èČŒ] ç€ŸçŸ€ć€§æ•žæ“š - èŒżæƒ…è§€æžŹćŠćˆ†æžæ‡‰ç”š
 
èČĄć ±ćˆ†æž1
èČĄć ±ćˆ†æž1èČĄć ±ćˆ†æž1
èČĄć ±ćˆ†æž1
 
WebRTC overview
WebRTC overviewWebRTC overview
WebRTC overview
 
Device tree
Device treeDevice tree
Device tree
 
Android ćŸ…æ©Ÿèˆ‡æ“äœœè€—é›»æȘąæŸ„
Android ćŸ…æ©Ÿèˆ‡æ“äœœè€—é›»æȘąæŸ„Android ćŸ…æ©Ÿèˆ‡æ“äœœè€—é›»æȘąæŸ„
Android ćŸ…æ©Ÿèˆ‡æ“äœœè€—é›»æȘąæŸ„
 
Analyzing Display and Performance with Systrace
Analyzing Display and Performance with SystraceAnalyzing Display and Performance with Systrace
Analyzing Display and Performance with Systrace
 

KĂŒrzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

KĂŒrzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Deep Learning Hardware: Past, Present, & Future

  • 1. Deep Learning Hardware: Past, Present, & Future ISSCC, San Francisco, 2019-02-18 Yann LeCun Facebook AI Research New York University http://yann.lecun.com
  • 2. Y. LeCun AI today is mostly supervised learning Training a machine by showing examples instead of programming it When the output is wrong, tweak the parameters of the machine PLANE CAR Works well for: Speech→words Image→categories Portrait→ name Photo→caption Text→topic 
.
  • 3. Y. LeCun The History of Neural Nets is Inextricable from Hardware The McCulloch-Pitts Binar Neuron Perceptron: weights are motorized potentiometers Adaline: Weights are electrochemical “memistors” y=sign(∑ i=1 N W i X i+ b) https://youtu.be/X1G2g3SiCwU
  • 4. Y. LeCun The Standard Paradigm of Pattern Recognition ...and “traditional” Machine Learning Trainable Classifier Feature Extractor Hand engineered Trainable
  • 5. Y. LeCun 1969→1985: Neural Net Winter No learning for multilayer nets, why? People used the wrong “neuron”: the McCulloch & Pitts binary neuron Binary neurons are easier to implement: No multiplication necessary! Binary neurons prevented people from thinking about gradient-based methods for multi-layer nets Early 1980s: The second wave of neural nets 1982: Hopfield nets: fully-connected recurrent binary networks 1983: Boltzmann Machines: binary stochastic networks with hidden units 1985/86: Backprop! Q: Why only then? A: sigmoid neurons! Sigmoid neurons were enabled by “fast” floating point (Sun Workstations)
  • 6. Y. LeCun Multilayer Neural Nets and Deep Learning Traditional Machine Learning Trainable Classifier Feature Extractor Deep Learning Trainable Classifier Low-Level Features Mid-Level Features High-Level Features Hand engineered Trainable Trainable
  • 7. Y. LeCun Multi-Layer Neural Nets Multiple Layers of simple units Each units computes a weighted sum of its inputs Weighted sum is passed through a non-linear function The learning algorithm changes the weights Weight matrix Ceci est une voiture ReLU (x )=max (x , 0) Hidden Layer
  • 8. Y. LeCun Supervised Machine Learning = Function Optimization It's like walking in the mountains in a fog and following the direction of steepest descent to reach the village in the valley But each sample gives us a noisy estimate of the direction. So our path is a bit random. traffic light: -1 Function with adjustable parameters Objective Function Error W i ←W i−η ∂ L(W , X ) ∂ W i Stochastic Gradient Descent (SGD)
  • 9. Y. LeCun Computing Gradients by Back-Propagation ● A practical Application of Chain Rule ● Backprop for the state gradients: ● dC/dXi-1 = dC/dXi . dXi/dXi-1 ● dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1 ● Backprop for the weight gradients: ● dC/dWi = dC/dXi . dXi/dWi ● dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi Cost Fn(Xn-1,Wn) C(X,Y,Θ) X (input) Y (desired output) Fi(Xi-1,Wi) F1(X0,W1) Xi-1 Xi dC/dXi- 1 dC/dXi dC/dWn Wn dC/dWi Wi
  • 10. Y. LeCun 1986-1996 Neural Net Hardware at Bell Labs, Holmdel 1986: 12x12 resistor array Fixed resistor values E-beam lithography: 6x6microns 1988: 54x54 neural net Programmable ternary weights On-chip amplifiers and I/O 1991: Net32k: 256x128 net Programmable ternary weights 320GOPS, 1-bit convolver. 1992: ANNA: 64x64 net ConvNet accelerator: 4GOPS 6-bit weights, 3-bit activations 6 microns
  • 11. Y. LeCun Convolutional Network Architecture [LeCun et al. NIPS 1989] Inspired by [Hubel & Wiesel 1962] & [Fukushima 1982] (Neocognitron): simple cells detect local features complex cells “pool” the outputs of simple cells within a retinotopic neighborhood. Filter Bank +non-linearity Filter Bank +non-linearity Pooling Pooling Filter Bank +non-linearity
  • 12. Y. LeCun LeNet character recognition demo 1992 Running on an AT&T DSP32C (floating-point DSP, 20 MFLOPS)
  • 13. Y. LeCun Convolutional Network (LeNet5, vintage 1990) Filters-tanh → pooling → filters-tanh → pooling → filters-tanh
  • 14. Y. LeCun ConvNets can recognize multiple objects All layers are convolutional Networks performs simultaneous segmentation and recognition
  • 15. Y. LeCun Check Reader (AT&T 1995) Check amount reader ConvNet+Language Model trained at the sequence level. 50% percent correct, 49% reject, 1% error (detectable later in the process). Fielded in 1996, used in many banks in the US and Europe. Processed an estimated 10% to 20% of all the checks written in the US in the early 2000s. [LeCun, Bottou, Bengio ICASSP1997] [LeCun, Bottou, Bengio, Haffner 1998]
  • 16. Y. LeCun 1996→2006: 2nd NN Winter! Few teams could train large NNs Hardware was slow for floating point computation Training a character recognizer took 2 weeks on a Sun or SGI workstation A very small ConvNet by today’s standard (500,000 connections) Data was scarce and NN were data hungry No large datasets besides character and speech recognition Interactive software tools had to be built from scratch We wrote a NN simulator with a custom Lisp interpreter/compiler SN [Bottou & LeCun 1988] → SN2 [1992] → Lush (open sourced in 2002). Open sourcing wasn’t common in the pre-Internet days The “black art” of NN training could not be communicated easily SN/SN2/Lush gave us superpowers: tools shape research directions
  • 17. Y. LeCun Lessons learned #1 1.1: It’s hard to succeed with exotic hardware Hardwired analog → programmable hybrid → digital 1.2: Hardware limitations influence research directions It constrains what algorithm designers will let themselves imagine 1.3: Good software tools shape research and give superpowers But require a significant investment Common tools for Research and Development facilitates productization 1.4: Hardware performance matters Fast turn-around is important for R&D But high-end production models always take 2-3 weeks to train 1.5: When hardware is too slow, software is not readily available, or experiments are not easily reproducible, good ideas can be abandoned.
  • 18. The 2nd Neural Net Winter (1995-2005) & Spring (2006-2012) The Lunatic Fringe and the Deep Learning Conspircy
  • 19. Y. LeCun Semantic Segmentation with ConvNet for off-Road Driving Input imageInput image Stereo LabelsStereo Labels Classifier OutputClassifier Output Input imageInput image Stereo LabelsStereo Labels Classifier OutputClassifier Output DARPA LAGR program 2005-2009 [Hadsell et al., J. of Field Robotics 2009] [Sermanet et al., J. of Field Robotics 2009]
  • 21. Y. LeCun Semantic Segmentation with ConvNets (33 categories)
  • 22. Y. LeCun FPGA ConvNet Accelerator: NewFlow [Farabet 2011] NeuFlow: Reconfigurable Dataflow architecture Implemented on Xilinx Virtex6 FPGA 20 configurable tiles. 150GOPS, 10 Watts Semantic Segmentation: 20 frames/sec at 320x240 Exploits the structure of convolutions NeuFlow ASIC [Pham 2012] 150GOPS, 0.5 Watts (simulated)
  • 23. Y. LeCun Driving Cars with Convolutional Nets MobilEye NVIDIA
  • 24. The Deep Learning Revolution State of the Art
  • 25. Y. LeCun Deep ConvNets for Object Recognition (on GPU) AlexNet [Krizhevsky et al. NIPS 2012], OverFeat [Sermanet et al. 2013] 1 to 10 billion connections, 10 million to 1 billion parameters, 8 to 20 layers.
  • 26. Y. LeCun Error Rate on ImageNet Depth inflation (Figure: Anirudh Koul)
  • 27. Y. LeCun Deep ConvNets (depth inflation) VGG [Simonyan 2013] GoogLeNet Szegedy 2014] ResNet [He et al. 2015] DenseNet [Huang et al 2017]
  • 28. Y. LeCun GOPS vs Accuracy on ImageNet vs #Parameters [Canziani 2016] ResNet50 and ResNet100 are used routinely in production. Each of the few billions photos uploaded on Facebook every day goes through a handful of ConvNets within 2 seconds.
  • 29. Y. LeCun Progress in Computer Vision [He 2017]
  • 30. Y. LeCun Mask R-CNN: instance segmentation [He, Gkioxari, Dollar, Girshick arXiv:1703.06870] ConvNet produces an object mask for each region of interest Combined ventral and dorsal pathways
  • 31. Y. LeCun RetinaNet, feature pyramid network One-pass object detection [Lin et al. ArXiv:1708.02002]
  • 32. Y. LeCun Mask-RCNN Results on COCO dataset Individual objects are segmented.
  • 33. Y. LeCun Mask R-CNN Results on COCO test set
  • 34. Y. LeCun Real-Time Pose Estimation on Mobile Devices Maks R-CNN running on Caffe2Go
  • 35. Y. LeCun Detectron: open source vision in PyTorch https://github.com/facebookresearch/maskrcnn-benchmark
  • 36. Y. LeCun 3D ConvNet for Medical Image Analysis Segmentation Femur from MR Images [Deniz et al. Nature 2018]
  • 37. Y. LeCun 3D ConvNet for Medical Image Analysis
  • 38. Y. LeCun Applications of Deep Learning Medical image analysis Self-driving cars Accessibility Face recognition Language translation Virtual assistants* Content Understanding for: Filtering Selection/ranking Search Games Security, anomaly detection Diagnosis, prediction Science! [Geras 2017] [Mnih 2015] [MobilEye] [Esteva 2017]
  • 39. Y. LeCun Lessons learned #2 2.1: Good results are not enough Making them easily reproducible also makes them credible. 2.2: Hardware progress enables new breakthroughs General-Purpose GPUs should have come 10 years earlier! But can we please have hardware that doesn’t require batching? 2.3: Open-source software platforms disseminate ideas But making platforms that are good for research and production is hard. 2.4: Convolutional Nets will soon be everywhere Hardware should exploit the properties of convolutions better There is a need for low-cost, low-power ConvNet accelerators Cars, cameras, vacuum cleaners, lawn mowers, toys, maintenance robots...
  • 40. New DL Architectures With different hardware/software requirements: Memory-Augmented Networks Dynamic Networks Graph Convolutional Nets Networks with Sparse Activations
  • 41. Y. LeCun Augmenting Neural Nets with a Memory Module Recurrent net memory Recurrent networks cannot remember things for very long The cortex only remember things for 20 seconds We need a “hippocampus” (a separate memory module) LSTM [Hochreiter 1997], registers Memory networks [Weston et 2014] (FAIR), associative memory Stacked-Augmented Recurrent Neural Net [Joulin & Mikolov 2014] (FAIR) Neural Turing Machine [Graves 2014], Differentiable Neural Computer [Graves 2016]
  • 42. Y. LeCun Differentiable Associative Memory Used very widely in NLP MemNN, Transformer Network, ELMO, GPT, BERT, GPT2, GLoMO Essentially a “soft” RAM or hash table Input (Address) X Keys Ki Values Vi Dot Products Softmax Sum Y =∑ i Ci V i Ci= e Ki T X ∑ j e K j T X
  • 43. Y. LeCun Learning to synthesize neural programs for visual reasoning https://research.fb.com/visual-reasoning-and-dialog-towards-natural-language-conversations-about-visual-data/
  • 44. Y. LeCun PyTorch: differentiable programming Software 2.0: The operations in a program are only partially specified They are trainable parameterized modules. The precise operations are learned from data, only the general structure of the program is designed. Dynamic computational graph Automatic differentiation by recording a “tape” of operations and rolling it backwards with the Jacobian of each operator. Implemented in PyTorch1.0, Chainer
 Easy if the front-end language is dynamic and interpreted (e.g Python) Not so easy if we want to run without a Python runtime...
  • 45. Y. LeCun ConvNets on Graphs (fixed and data-dependent) Graphs can represent: Natural language, social networks, chemistry, physics, communication networks... Review paper: “Geometric deep learning: going beyond euclidean data”, MM Bronstein, J Bruna, Y LeCun, A Szlam, P Vandergheynst, IEEE Signal Processing Magazine 34 (4), 18-42, 2017 [ArXiv:1611.08097]
  • 46. Y. LeCun Spectral ConvNets / Graph ConvNets Regular grid graph Standard ConvNet Fixed irregular graph Spectral ConvNet Dynamic irregular graph Graph ConvNet IPAM workshop: http://www.ipam.ucla.edu/programs/workshops/new-deep-learning-techniques/
  • 47. Y. LeCun Sparse ConvNets: for sparse voxel-based 3D data ShapeNet competition results ArXiv:1710.06104] Winner: Submanifold Sparse ConvNet [Graham & van der Maaten arXiv 1706.01307] PyTorch: https://github.com/facebookresearch/SparseConvNet
  • 48. Y. LeCun Lessons learned #3 3.1: Dynamic networks are gaining in popularity (e.g. for NLP) Dynamicity breaks many assumptions of current hardware Can’t optimize the compute graph distribution at compile time. Can’t do batching easily! 3.2: Large-Scale Memory-Augmented Networks... ...Will require efficient associative memory/nearest-neighbor search 3.3: Graph ConvNets are very promising for many applications Say goodbye to matrix multiplications? Say goodbye to tensors? 3.4: Large Neural Nets may have sparse activity How to exploit sparsity in hardware?
  • 49. What About (Deep) Reinforcement Learning? It works great 
 
for games and virtual environments
  • 50. Y. LeCun Reinforcement Learning works fine for games RL works well for games Playing Atari games [Mnih 2013], Go [Silver 2016, Tian 2018], Doom [Tian 2017], StarCraft... RL requires too many trials. 100 hours to reach the performance that a human can reach in 15 minutes on Atari games [Hessel ArXiv:1710.02298] RL often doesn’t really work in the real world FAIR open Source go player: OpenGo https://github.com/pytorch/elf
  • 51. Y. LeCun Pure RL is hard to use in the real world Pure RL requires too many trials to learn anything it’s OK in a game it’s not OK in the real world RL works in simple virtual world that you can run faster than real-time on many machines in parallel. Anything you do in the real world can kill you You can’t run the real world faster than real time
  • 52. Y. LeCun What are we missing to get to “real” AI? What we can have Safer cars, autonomous cars Better medical image analysis Personalized medicine Adequate language translation Useful but stupid chatbots Information search, retrieval, filtering Numerous applications in energy, finance, manufacturing, environmental protection, commerce, law, artistic creation, games,
.. What we cannot have (yet) Machines with common sense Intelligent personal assistants “Smart” chatbots” Household robots Agile and dexterous robots Artificial General Intelligence (AGI)
  • 53. How do Humans and Animal Learn? So quickly
  • 54. Y. LeCun Babies learn how the world works by observation Largely by observation, with remarkably little interaction. Photos courtesy of Emmanuel Dupoux
  • 55. Y. LeCun Early Conceptual Acquisition in Infants [from Emmanuel Dupoux] Perceptionroduction Physic s Action s Objects 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Time (months) stability, support gravity, inertia conservation of momentum Object permanence solidity, rigidity shape constancy crawling walking emotional contagion Social- communicati ve rational, goal- directed actions face tracking prooto-imitation pointing biological motion false perceptual beliefs helping vs hindering natural kind categories
  • 56. Y. LeCun Prediction is the essence of Intelligence We learn models of the world by predicting
  • 57. The Future: Self-Supervised Learning With massive amounts of data and very large networks
  • 58. Y. LeCun Self-Supervised Learning Predict any part of the input from any other part. Predict the future from the past. Predict the future from the recent past. Predict the past from the present. Predict the top from the bottom. Predict the occluded from the visible Pretend there is a part of the input you don’t know and predict that. ← Past Present Time → Future →
  • 59. Y. LeCun How Much Information is the Machine Given during Learning? “Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category or a few numbers for each input Predicting human-supplied data 10→10,000 bits per sample Self-Supervised Learning (cake gĂ©noise) The machine predicts any part of its input for any observed part. Predicts future frames in videos Millions of bits per sample
  • 60. Y. LeCun Self-Supervised Learning: Filling in the Blanks
  • 61. Y. LeCun Self-Supervised Learning works well for text Word2vec [Mikolov 2013] FastText [Joulin 2016] BERT Bidirectional Encoder Representations from Transformers [Devlin 2018] Figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/
  • 62. Y. LeCun But it doesn’t really work for high-dim continuous signals Video prediction: Multiple futures are possible. Training a system to make a single prediction results in “blurry” results the average of all the possible futures
  • 63. Y. LeCun The Next AI Revolution THE REVOLUTIONTHE REVOLUTION WILL NOT BE SUPERVISEDWILL NOT BE SUPERVISED (nor purely reinforced)(nor purely reinforced) With thanks To Alyosha Efros
  • 64. Learning Predictive Models of the World Learning to predict, reason, and plan, Learning Common Sense.
  • 65. Y. LeCun Planning Requires Prediction To plan ahead, we simulate the world World Agent Percepts Objective Cost Agent State Actions/ Outputs Agent World Simulator Actor Predicted Percepts Critic Predicted Cost Action Proposals Inferred World State Actor State
  • 66. Y. LeCun Training the Actor with Optimized Action Sequences 1. Find action sequence through optimization 2. Use sequence as target to train the actor Over time we get a compact policy that requires no run-time optimization Agent World Simulator Actor Critic World Simulator Actor Critic World Simulator Actor Critic World Simulator Actor Critic Perception
  • 67. Y. LeCun The Hard Part: Prediction Under Uncertainty Invariant prediction: The training samples are merely representatives of a whole set of possible outputs (e.g. a manifold of outputs). Percepts Hidden State Of the World
  • 68. Y. LeCun Faces “invented” by a GAN (Generative Adversarial Network) Random vector → Generator Network → output image [Goodfellow NIPS 2014] [Karras et al. ICLR 2018] (from NVIDIA)
  • 69. Y. LeCun Generative Adversarial Networks for Creation [Sbai 2017]
  • 70. Y. LeCun Self-supervised Adversarial Learning for Video Prediction Our brains are “prediction machines” Can we train machines to predict the future? Some success with “adversarial training” [Mathieu, Couprie, LeCun arXiv:1511:05440] But we are far from a complete solution.
  • 71. Y. LeCun Predicting Instance Segmentation Maps [Luc, Couprie, LeCun, Verbeek ECCV 2018] Mask R-CNN Feature Pyramid Network backbone Trained for instance segmentation on COCO Separate predictors for each feature level
  • 73. Y. LeCun Long-term predictions (10 frames, 1.8 seconds)
  • 74. Y. LeCun Using Forward Models to Plan (and to learn to drive) Overhead camera on highway. Vehicles are tracked A “state” is a pixel representation of a rectangular window centered around each car. Forward model is trained to predict how every car moves relative to the central car. steering and acceleration are computed
  • 75. Y. LeCun Forward Model Architecture Architecture: Encoder Decoder Latent variable predictor expander +
  • 77. Y. LeCun Learning to Drive by Simulating it in your Head Feed initial state Sample latent variable sequences of length 20 Run the forward model with these sequences Backpropagate gradient of cost to train a policy network. Iterate No need for planning at run time.
  • 78. Y. LeCun Adding an Uncertainty Cost (doesn’t work without it) Estimates epistemic uncertainty Samples multiple drop- puts in forward model Computes variance of predictions (differentiably) Train the policy network to minimize the lane&proximity cost plus the uncertainty cost. Avoids unpredictable outcomes
  • 79. Y. LeCun Driving an Invisible Car in “Real” Traffic
  • 80. Y. LeCun Lessons learned #4 4.1: Self-Supervised learning is the future Networks will be much larger than today, perhaps sparse 4.2: Reasoning/inference through minimization 4.3: DL hardware use cases A. DL R&D: 32-bit FP, high parallelism, fast inter-node communication, flexible hardware and software. B. Routine training: 16-bit FP, some parallelism, moderate cost. C. inference in data centers: 8 or 16-bit FP, low latency, low power consumption, standard interface. D. inference on embedded devices: low cost, low power, exotic number systems? AR/VR, consumer items, household robots, toys, manufacturing, monitoring,...
  • 81. Y. LeCun Speculations Spiking Neural Nets, and neuromorphic architectures? I’m skeptical
.. No spike-based NN comes close to state of the art on practical tasks Why build chips for algorithms that don’t work? Exotic technologies? Resistor/Memristor matrices, and other analog implementations? Conversion to and from digital kills us. No possibility of hardware multiplexing Spintronics? Optical implementations?