Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
Caffe’s expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.Caffe’s extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000 developers and had many significant changes contributed back. Thanks to these contributors the framework tracks the state-of-the-art in both code and models.Speed makes Caffe perfect for research experiments and industry deployment. Caffe can processover 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference and 4 ms/image for learning. We believe that Caffe is the fastest convnet implementation available.Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Join our community of brewers on the caffe-users group and Github.
This tutorial is designed to equip researchers and developers with the tools and know-how needed to incorporate deep learning into their work. Both the ideas and implementation of state-of-the-art deep learning models will be presented. While deep learning and deep features have recently achieved strong results in many tasks, a common framework and shared models are needed to advance further research and applications and reduce the barrier to entry. To this end we present the Caffe framework, public reference models, and working examples for deep learning. Join our tour from the 1989 LeNet for digit recognition to today’s top ILSVRC14 vision models. Follow along with do-it-yourself code notebooks. While focusing on vision, general techniques are covered.
1. DIY DEEP LEARNING
WITH CAFFE
Kate Saenko
UMASS Lowell
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
2. Caffe: Open Source Deep Learning Library
Tutorials by Evan Shelhamer, Jon Long,
Sergio Guadarrama, Jeff Donahue, and Ross
Girshick and Yangqing Jia
caffe.berkeleyvision.org
github.com/BVLC/caffe
Based on tutorials by the Caffe creators at UC Berkeley
3. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev
Jonathan Long, Ross Girshick, Sergio Guadarrama, Trevor Darrell
Caffe crew
...plus the
cold-brew
and over 100 open source contributors!
4. See also: upcoming Caffe Tutorial
at CVPR 2015 in Boston, June 7
http://tutorial.caffe.berkeleyvision.org/
Caffe Tutorial
Deep learning framework tutorial by
Evan Shelhamer, Jeff Donahue, Jon Long, Yangqing Jia and Ross Girshick
5. WHY DEEP LEARNING?
WHAT IS CAFFE?
APPLICATIONS OF CAFFE
NEURAL NETWORKS IN A SIP
CAFFE INTRO
STEP-BY-STEP EXAMPLES
6. Why Deep Learning?
End-to-End Learning for Many Tasks
Pinterest used deep learning-based visual search to find
pinned images of bags similar to the one on the left.
Image Credit: Pinterest
7. Why Deep Learning?
Compositional Models
Learned End-to-End
Hierarchy of Representations
- vision: pixel, motif, part, object
- text: character, word, clause, sentence
- speech: audio, band, phone, word
concrete abstract
learning
figure credit Yann LeCun, ICML ‘13 tutorial
8. Why Deep Learning?
Compositional Models
Learned End-to-End
figure credit Yann LeCun, ICML ‘13 tutorial
Back-propagation jointly learns
all of the model parameters to
optimize the output for the task.
12. WHY DEEP LEARNING?
WHAT IS CAFFE?
APPLICATIONS OF CAFFE
NEURAL NETWORKS IN A SIP
CAFFE INTRO
STEP-BY-STEP EXAMPLES
13. What is Caffe?
Prototype Train Deploy
Open framework, models, and worked examples
for deep learning
- 1.5 years
- 300+ citations, 100+ contributors
- 2,000+ forks, >1 pull request / day average
- focus has been vision, but branching out:
sequences, reinforcement learning, speech + text
14. What is Caffe?
Prototype Train Deploy
Open framework, models, and worked examples
for deep learning
- Pure C++ / CUDA architecture for deep learning
- Command line, Python, MATLAB interfaces
- Fast, well-tested code
- Tools, reference models, demos, and recipes
- Seamless switch between CPU and GPU
16. Caffe offers the
● model definitions
● optimization settings
● pre-trained weights
so you can start right away.
The BVLC models are
licensed for unrestricted use.
The community shares
models in our Model Zoo.
Reference Models
GoogLeNet: ILSVRC14 winner
17. Brewing by the Numbers...
● Speed with Krizhevsky's 2012 model:
o 2 ms / image on K40 GPU
o <1 ms inference with Caffe + cuDNN v2 on Titan X
o 72 million images / day with batched IO
o 8-core CPU: ~20 ms/image
● 9k lines of C++ code (20k with tests)
18. WHY DEEP LEARNING?
WHAT IS CAFFE?
APPLICATIONS OF CAFFE
NEURAL NETWORKS IN A SIP
CAFFE INTRO
STEP-BY-STEP EXAMPLES
19. Image Tagging:
Share a Sip of Brewed Models
demo.caffe.berkeleyvision.org
demo code open-source and bundled
21. Visual Style Recognition
Other Styles:
Vintage
Long Exposure
Noir
Pastel
Macro
… and so on.
Karayev et al. Recognizing Image Style. BMVC14. Caffe fine-tuning example.
Demo online at http://demo.vislab.berkeleyvision.org/ (see Results Explorer).
[ Image-Style]
22. Object Detection
R-CNN: Regions with Convolutional Neural Networks
http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/detection.ipynb
Full R-CNN scripts available at
https://github.com/rbgirshick/rcnn
Ross Girshick et al.
Rich feature hierarchies for accurate
object detection and semantic
segmentation. CVPR14.
23. Recurrent Net RNN and Long Short Term Memory LSTM
are sequential models
- video
- language
- dynamics
learned by back-propagation through time.
Sequences
Jeff Donahue et al.
LRCN: Long-term Recurrent Convolutional Network
- activity recognition
- image captioning
- video captioning
arXiv and web page & PR
24. Youtube Captioning
Venugopalan, Xu, Donahue,
Rohrbach, Mooney, Saenko;
Translating Videos to Natural
Language Using Deep Recurrent
Neural Networks,
NAACL, 2015
25. Fully convolutional networks for pixel prediction
applied to semantic segmentation
- end-to-end learning
- efficiency in inference and learning
175 ms per-image prediction
- multi-modal, multi-task
Image Segmentation
Further applications
- depth estimation
- denoising
Jon Long* & Evan Shelhamer*,
Trevor Darrell
arXiv and pre-release
26. Deep Visuomotor Control
Sergey Levine* & Chelsea Finn*,
Trevor Darrell, and Pieter Abbeeltech report http://tinyurl.com/visuomotor
example experiments
Robotic manipulation
27. Embedded Caffe
Caffe on the NVIDIA Jetson TK1 mobile board
- 10 watts of power
- inference at 35 ms per image
- no need to change the code
28. WHY DEEP LEARNING?
WHAT IS CAFFE?
APPLICATIONS OF CAFFE
NEURAL NETWORKS IN A SIP
CAFFE INTRO
STEP-BY-STEP EXAMPLES
38. Convolutional Networks: 1989
LeNet: a layered model composed of convolution and subsampling
operations followed by a holistic representation and ultimately a
classifier for handwritten digits. [ Yann LeCun; LeNet ]
39. Convolutional Nets: 2012
AlexNet: a layered model composed of convolution,
subsampling, and further operations followed by a holistic
representation and all-in-all a landmark classifier on
ILSVRC12. [ AlexNet ]
+ data
+ gpu
+ non-saturating nonlinearity
+ regularization
40. Convolutional Nets: 2014
ILSVRC14 Winners: ~6.6% Top-5 error
- GoogLeNet: composition of multi-scale dimension-
reduced modules (pictured)
- VGG: 16 layers of 3x3 convolution interleaved with
max pooling + 3 fully-connected layers
+ depth
+ data
+ dimensionality reduction
41. [Zeiler-Fergus]
1st layer filters
image patches that strongly activate 1st layer filters
Why Deep Learning?
The Unreasonable Effectiveness of Learning Deep Features
44. WHY DEEP LEARNING?
WHAT IS CAFFE?
APPLICATIONS OF CAFFE
NEURAL NETWORKS IN A SIP
CAFFE INTRO
STEP-BY-STEP EXAMPLES
45. Net
name: "dummy-net"
layers { name: "data" …}
layers { name: "conv" …}
layers { name: "pool" …}
… more layers …
layers { name: "loss" …}
● A network is a set of layers
connected as a DAG:
LogReg ↑
LeNet →
ImageNet, Krizhevsky 2012 →
● Caffe creates and checks the net from
the definition.
● Data and derivatives flow through the
net as blobs – a an array interface
46. Forward / Backward the essential Net computations
Caffe models are complete machine learning systems for inference and learning.
The computation follows from the model definition. Define the model and run.
47. Layer
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
}
name, type, and the
connection structure
(input blobs and
output blobs)
layer-specific
parameters
* Nets + Layers are defined by protobuf schema
● Every layer type defines
- Setup
- Forward
- Backward
48. Data
Number x K Channel x Height x Width
256 x 3 x 227 x 227 for ImageNet train input
Blobs are 4-D arrays for storing and
communicating information.
● hold data, derivatives, and parameters
● lazily allocate memory
● shuttle between CPU and GPU
Blob
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
… definition …
top
blob
bottom
blob
Parameter: Convolution Weight
N Output x K Input x Height x Width
96 x 3 x 11 x 11 for CaffeNet conv1
Parameter: Convolution Bias
96 x 1 x 1 x 1 for CaffeNet conv1
49. Blobs provide a unified memory interface.
Reshape(num, channel, height, width)
- declare dimensions
- make SyncedMem -- but only lazily
allocate
Blob
cpu_data(), mutable_cpu_data()
- host memory for CPU mode
gpu_data(), mutable_gpu_data()
- device memory for GPU mode
{cpu,gpu}_diff(), mutable_{cpu,gpu}_diff()
- derivative counterparts to data methods
- easy access to data + diff in forward / backward
SyncedMem
allocation + communication
50. Solving: Training a Net
Optimization like model definition is configuration.
train_net: "lenet_train.prototxt"
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
max_iter: 10000
snapshot_prefix: "lenet_snapshot"
All you need to run
things on the GPU.
> caffe train -solver lenet_solver.prototxt -gpu 0
Stochastic Gradient Descent (SGD) + momentum ·
Adaptive Gradient (ADAGRAD) · Nesterov’s Accelerated Gradient (NAG)
51. Setup: run once for initialization.
Forward: make output given input.
Backward: make gradient of output
- w.r.t. bottom
- w.r.t. parameters (if needed)
Reshape: set dimensions.
Layer Protocol
Layer Development Checklist
Compositional Modeling
The Net’s forward and backward passes
are composed of the layers’ steps.
52. Layer Protocol
== Class Interface
Define a class in C++ or Python to
extend Layer.
Include your new layer type in a
network and keep brewing.
layer {
type: ‘Python’
python_param {
module: ‘layers’
layer: ‘EuclideanLoss’
} }
54. Recipe for Brewing
● Convert the data to Caffe-format
o lmdb, leveldb, hdf5 / .mat, list of images, etc.
● Define the Net
● Configure the Solver
● caffe train -solver solver.prototxt -gpu 0
● Examples are your friends
o caffe/examples/mnist,cifar10,imagenet
o caffe/examples/*.ipynb
o caffe/models/*
55. WHY DEEP LEARNING?
WHAT IS CAFFE?
APPLICATIONS OF CAFFE
NEURAL NETWORKS IN A SIP
CAFFE INTRO
STEP-BY-STEP EXAMPLES
58. Latest Roast
Parallelism: open pull requests
- synchronous SGD #2114
- data parallelism
- ~linear speedup on multiple GPUs
for computation >> communication
- Flickr + NVIDIA collaboration on
distributed opt + asynchronous SGD
Python Net Specification #2086
New MATLAB Interface #2505
These will be bundled in the next release shortly.
59. Help Brewing
Documentation
- tutorial documentation
- worked examples
Modeling, Usage, and Install
- caffe-users group
- gitter.im chat
Caffe @ CVPR15
Convolutional Nets
- CS231n online convnet class
by Andrej Karpathy and Fei-
Fei Li.
- Deep Learning for Vision
tutorial from CVPR ‘14.
60. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev
Jonathan Long, Ross Girshick, Sergio Guadarrama, Trevor Darrell
Thanks to the Caffe crew
...plus the
cold-brew
and open source contributors!
61. Acknowledgements
Thank you to the Berkeley Vision and Learning Center Sponsors.
Thank you to NVIDIA
for GPU donation and
collaboration on cuDNN.
Thank you to our 75+
open source contributors
and vibrant community.
Thank you to A9 and AWS
for a research grant for Caffe dev
and reproducible research.
62. References
[ DeCAF ] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep
convolutional activation feature for generic visual recognition. ICML, 2014.
[ R-CNN ] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection
and semantic segmentation. CVPR, 2014.
[ Zeiler-Fergus ] M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. ECCV, 2014.
[ LeNet ] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. IEEE, 1998.
[ AlexNet ] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural
networks. NIPS, 2012.
[ OverFeat ] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated
recognition, localization and detection using convolutional networks. ICLR, 2014.
[ Image-Style ] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, H. Winnemoeller.
Recognizing Image Style. BMVC, 2014.
[ Karpathy14 ] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video
classification with convolutional neural networks. CVPR, 2014.
[ Sutskever13 ] I. Sutskever. Training Recurrent Neural Networks.
PhD thesis, University of Toronto, 2013.
[ Chopra05 ] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to
face verification. CVPR, 2005.
Hinweis der Redaktion
Evan
over 100 open source contributors
Yangqing
Rich set of tasks that can be done through learning
Can do them better than before without individual solutions
Two essential parts:
compositional, learn hierarchical representations (pixels->textures->parts->objects) etc.
through learning learn layers of abstractions, e.g. recognizing your friend
compositionality helps to break down the complex task
Need to learn layers of representations
End-to-end learning jointly learns all parameters for the task
Given current model’s predictions, can measure error and adjust the model to improve it
Over the course of many iterations the models improves
At the output we know if it’s right or wrong, at the lower layer we propagate the error from the top
this comics looks at why AI problems are hard, what is the issue?
If we want to know where the photo is taken, it is a simple problem, look up
But if we want to know what is in it, it si hard
Perception is complicated, the answer is not just a look up of individual pixel values
Yet Flickr did it in a weekend! Tells you where photo is taken and if it is a bird
A fwe years ago it would be a whole research project, now it’s a side project we can do for fun
(point out basic steps, each step leads to next steps, finally get answer)
Yangqing
Caffe is not only a library for deep learning but also test, recipes, and models
Can get started from the work of others
Run it wherever you want, start on laptop, send to high end GPU, cluster or cloud
Model is independent of the processor
Changes are contributed more than once a day on average
used in academic research labs, also in industry and start ups, enthusiasts
Get to take advantage of community expertise
The key part of Caffe framework is the reference models
We offer model definitions, setting for optimize th models, and pretrained weights so you can start right away
In addition to official BVLC models the community shares in model zoo
Good way to see latest research results coming out
The time from being given an image to producing a classification output is less than ms
This translates into 72 images /day
On CPU, still fast
Yangqing
We have an online demo that you could easily deploy
Researchrs at MIT made a scene recognizer
Can even recognize how the image is portrayed
All these applications are using the same model, just trained on different data
We will only look at applications in vision but networks can be used for many other tasks (speech, text, robotics)
Learning models end-to-end not only for vision but also for control
Robot learns to do tasks in real time by going directly from camera inputs and joint angles
Instead of being tied to a workstation GPU there are also embedded solutions
Caffe runs out of the box
Yangqing
The human brain is composed of billions of cells called neurons
Neurons are connected in a complex network, each receives electrical signals on ‘input wires’ and sends out signal on ‘output wires’
Convolutional networks work by applying filters
For each patch we can look at strongest response and summarize it
The whole model is trained by measuring errors on suprevised data
In 2012 we have the same fundamental steps but more data, deeper, more computation and some technical improvements
structure is more complicated but have same fundamental pieces
we can define experiment and deploy all these models in caffe because theyr’e made of the same pieces
Yangqing
Evan
What does a layer do?
Foreard used to make output
Backward used during learning
Reshape determines the layer dimensions given input dimensions, filter size, stride etc., checks for mistakes
you can make your own layers! include a checklist
need to define
example code for layer defined in python
what defines the models you’re learning is the loss
if we take a simple linear model and add a softmax we get a classifier
if we add euclidean loss we get a regression
same architecvture can learn different tasks
can extend to new tasks by defining new losses
General outline for brewing a model in Caffe
We include many examples for defining models and optimization
include pretrained weights/optimizations settings
Yangqing
Take the model powering the online demo, look at an example of running it for inference on an example image
This demo is more than 1ms per image but can get better efficiency by batching up images
Take a tour of features, show existing structure of the model
define network in python but look at generated prototext
caffe models are defined by text which has pros and cons
instead of writing the verbose model format in the schema we have a succinct way to define it in python layer by layer
look at prototext to point out layers
the model is just a set of layers and their interconections
also need settings for optimization, models are sensitive to all these settings
Start with classic logistic regression and extend it to a nonlinear model
Classic logistic regression predicts a binary output
We will learn a classifier for vectors
generate synthetic data with 2 usefl features and 2 noise
use scikit to learn
what is the point? we can extend the model and make it more advanced by adding a layer parameters that learns a nonlinear model
even logistic regression can be seen as a network but we can extend it to be a layered network
we add a hidden layer of dimension 40
if you had done good old logistic regression there is not next step
with caffe you have a way to improve on your results
Finetuning rules since most people don’t have an imagnet worth of data
this is what you will most likely do
a strength of deep models is they can learn incrementally and adapt to new tasks
we can take lots of data, learn a model on it, and use it to initialize for a new task
Example; a student from S Korea took 10 minutes to place in top 10 in a cats vs dogs using a pretrained caffe model for object recognition
we need a new output layer,
imagenet had a 1000 layer of outputs
in this task we have 20 labels so we will make a new layer
give it a new name to signal to caffe that this is a new layer to be learned
start with a 1000 object classifier network and learn to recognize image style e.g. romantic
the learning plos show green (from scratch) and blue (fine-tuned) the blue does remarkably better
underscore how successful finetuning has been even across modalities
important in practice because we don’t have infinite data
important changes on the way for merging into official Caffe
these will all come out shortly in the next release
Join the community for help!
The Caffe crew are attending CVPR ‘15 and giving a tutorial.