Image captioning is the process of generating textual description of an image. It uses both Natural Language Processing and Computer Vision to generate the captions. Like in the notorious “finger pointing to the moon”, automated image captioning requires the ability to discern what it’s really going on in a scene and generate a fluent description for the act taking place. In this talk we present the underlying mechanics to the object detection and language generation using Convolutional and Recurrent Neural Networks.
2. Who we are
● Founded in 2001;
● Branches in Milan, Rome and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise:
○ DevOps
○ Cloud
○ BigData and many more...
3. This presentation is Open Source (yay!)
https://creativecommons.org/licenses/by-nc-sa/3.0/
4. Outline
1. Task introduction
2. Object recognition
3. Language generation
4. Putting all together
5. Improving performance
6. Beyond captioning: deep image search
7. Neural object recognition
A solved problem: Convolutional Neural Networks do the trick.
CNN is an architecture specialized in finding topological invariants in the input.
Finds relationships between atoms and infers higher abstractions.
Highly resistant to noise and spatial transformations.
It learns automatically what are the relevant features to extract from an input.
Not limited to images: CNNs can be applied to text, audio, etc..
8. An image as integers
A handwritten “8” can be
represented as a matrix of
integers.
● 0 for blank
● 1-255 for grayscales
white-to-black
9. Architecture of a ConvNet
Filter
Convolution
+
ReLU
Max
Pooling Filter
Convolution
+
ReLU
Max
Pooling
Fully
Connected
1. Convolution
2. Non Linearity (ReLU)
3. Pooling or Sub Sampling
4. Classification (Fully Connected Layer)
10. Convolution intuition
Let’s multiply a sliding matrix (the “brushing filter”) with our input matrix.
For example, the matrix does edge detection.
11. Convolution in CNN
Each new generated image is called “channel”. A common RGB has 3 channels.
Channels hold different perspectives about the image.
We start with random filters and tune these matrixes as part of our training.
We end up with filters that have learned perspectives of interest.
14. Max Pooling 1/2
After this, we downsample the image by “hashing” it to fewer values. We can:
1 1 2 4
5 6 7 8
3 2 1 0
1 2 3 4
6 8
3 4
13 21
8 8
Max pooling
2x2
stride 2
Sum pooling
2x2
stride 2
● Max: pick only the
highest element
● Sum: sum together all
the elements
16. Fully connected layer
After a couple of “convolute, relu and pool” cycles, we have maybe 128 channels
of 14x14 pixel images.
Concat and reshape them in a linear array of 25088 cells.
Feed it to a feed forward neural network that will output our classes.
17. CNN demo time
Real time web handwritten digit recognition
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
There are a lot of “famous” nets that can be freely downloaded and used off the shelf,
like ResNet which has an error rate of 3.6% over 20000 categories.
19. Why MLP suffers
The Multi Layer Perceptron can actually classify images as just array of pixels.
But it loses if I move and/or rotate the image.
This is because it lacks support for learning the invariant topological properties
that are maintained when the image goes through a spatial transformation.
20. Language generation with Recurrent Networks
Language generation is a serial task. We generate words one after another.
This is well modeled by Recurrent Neural Cells: a neuron that uses itself over and
over again to accept serial inputs, outputting each time a new value.
21. Words as integers without embedding
Vocabulary of words.
V = [‘fight’, ‘kill’, ‘queen’, ‘king’, ‘man’, ‘woman’, ‘love’,...]
“One hot vector” encoding representation of single words.
‘fight’ = [1 0 0 0 0 0 0 …]
‘kill’ = [0 1 0 0 0 0 0 …]
‘queen’ = [0 0 1 0 0 0 0 …]
Can correlate documents (TF-IDF), but can’t correlate single words to each other:
“I fight the king” = [ 1 1 0 1 0 0 0 …]
“fight the tirannny”= [ 0 1 0 1 0 0 1 …]
22. Words as floats with vector embedding
Word embedding.
Fixed length, real valued vector encoding representation of single words.
Close concepts have close vectors.
‘fight’ = [0.17 0.53 0.89 0.03 0.00 0.54 0.11 ]
‘kill’ = [0.17 0.53 0.91 0.06 0.00 0.54 0.12 ]
‘queen’ = [0.22 0.45 0.13 0.53 0.90 0.41 0.00 ]
Vector operation yields to coherent results: king - man + woman = queen
23. How language is generated
x1
h1
y1
x2
h2
y2
x3
h3
y3
“What” “is” “the”
“problem”“the”“is”target word
output likelihood
hidden state
input embedding
input word
Whh
Why
Wxh
24. RNN and the problem of memory
All network state is held in a single cell, used over and over again. Internal state
can get really complicated. Moving the values around during training can lead to
loss of data.
RNN has a “plugin” architecture, in which we can use different types of cells:
Simple RNN cell: fastest, but breaks over long sequences. Outdated.
LSTM cell: slower, supports selectively forgetting and keeping data. Standard.
GRU cell: like LSTM, but faster due to simpler internal architecture. State of art.
26. Putting all together
This is a classical seq2seq.
An image is fed to the CNN.
The CNN generates a state that
model the scene as a cluster of
objects.
The state is fed to a LSTM cell
27. Avoid getting distracted and Attention
We can train an
intermediate network called
Attention that emphasizes
relationships between
different parts of the
encoder (image) with
different time step of the
decoder (current word
being generated).
30. Beyond captioning: deep image search
Inputting an image and a question, the network will output an answer.
Chain together CNN and RNN models to a FC outputting to our vocabulary.
http://vqa.cloudcv.org/
CNN
h
How many wheels has the skate?
RNN h
F
C
F
C
F
C