SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Images and words
mechanics of automated captioning
with neural networks
Alberto Massidda
Who we are
● Founded in 2001;
● Branches in Milan, Rome and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise:
○ DevOps
○ Cloud
○ BigData and many more...
This presentation is Open Source (yay!)
https://creativecommons.org/licenses/by-nc-sa/3.0/
Outline
1. Task introduction
2. Object recognition
3. Language generation
4. Putting all together
5. Improving performance
6. Beyond captioning: deep image search
The task
Generating a description from an image.
“A man jumps over a skateboard”
The challenges
1. Recognize objects in the image
2. Generate a fluent description in natural language
Neural object recognition
A solved problem: Convolutional Neural Networks do the trick.
CNN is an architecture specialized in finding topological invariants in the input.
Finds relationships between atoms and infers higher abstractions.
Highly resistant to noise and spatial transformations.
It learns automatically what are the relevant features to extract from an input.
Not limited to images: CNNs can be applied to text, audio, etc..
An image as integers
A handwritten “8” can be
represented as a matrix of
integers.
● 0 for blank
● 1-255 for grayscales
white-to-black
Architecture of a ConvNet
Filter
Convolution
+
ReLU
Max
Pooling Filter
Convolution
+
ReLU
Max
Pooling
Fully
Connected
1. Convolution
2. Non Linearity (ReLU)
3. Pooling or Sub Sampling
4. Classification (Fully Connected Layer)
Convolution intuition
Let’s multiply a sliding matrix (the “brushing filter”) with our input matrix.
For example, the matrix does edge detection.
Convolution in CNN
Each new generated image is called “channel”. A common RGB has 3 channels.
Channels hold different perspectives about the image.
We start with random filters and tune these matrixes as part of our training.
We end up with filters that have learned perspectives of interest.
Convolution example
Rectifier Linear Unit
We can apply another filter to rule out pixel that don’t contribute.
Max Pooling 1/2
After this, we downsample the image by “hashing” it to fewer values. We can:
1 1 2 4
5 6 7 8
3 2 1 0
1 2 3 4
6 8
3 4
13 21
8 8
Max pooling
2x2
stride 2
Sum pooling
2x2
stride 2
● Max: pick only the
highest element
● Sum: sum together all
the elements
Max Pooling 2/2
Fully connected layer
After a couple of “convolute, relu and pool” cycles, we have maybe 128 channels
of 14x14 pixel images.
Concat and reshape them in a linear array of 25088 cells.
Feed it to a feed forward neural network that will output our classes.
CNN demo time
Real time web handwritten digit recognition
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
There are a lot of “famous” nets that can be freely downloaded and used off the shelf,
like ResNet which has an error rate of 3.6% over 20000 categories.
Why not just using a MLP?
Why MLP suffers
The Multi Layer Perceptron can actually classify images as just array of pixels.
But it loses if I move and/or rotate the image.
This is because it lacks support for learning the invariant topological properties
that are maintained when the image goes through a spatial transformation.
Language generation with Recurrent Networks
Language generation is a serial task. We generate words one after another.
This is well modeled by Recurrent Neural Cells: a neuron that uses itself over and
over again to accept serial inputs, outputting each time a new value.
Words as integers without embedding
Vocabulary of words.
V = [‘fight’, ‘kill’, ‘queen’, ‘king’, ‘man’, ‘woman’, ‘love’,...]
“One hot vector” encoding representation of single words.
‘fight’ = [1 0 0 0 0 0 0 …]
‘kill’ = [0 1 0 0 0 0 0 …]
‘queen’ = [0 0 1 0 0 0 0 …]
Can correlate documents (TF-IDF), but can’t correlate single words to each other:
“I fight the king” = [ 1 1 0 1 0 0 0 …]
“fight the tirannny”= [ 0 1 0 1 0 0 1 …]
Words as floats with vector embedding
Word embedding.
Fixed length, real valued vector encoding representation of single words.
Close concepts have close vectors.
‘fight’ = [0.17 0.53 0.89 0.03 0.00 0.54 0.11 ]
‘kill’ = [0.17 0.53 0.91 0.06 0.00 0.54 0.12 ]
‘queen’ = [0.22 0.45 0.13 0.53 0.90 0.41 0.00 ]
Vector operation yields to coherent results: king - man + woman = queen
How language is generated
x1
h1
y1
x2
h2
y2
x3
h3
y3
“What” “is” “the”
“problem”“the”“is”target word
output likelihood
hidden state
input embedding
input word
Whh
Why
Wxh
RNN and the problem of memory
All network state is held in a single cell, used over and over again. Internal state
can get really complicated. Moving the values around during training can lead to
loss of data.
RNN has a “plugin” architecture, in which we can use different types of cells:
Simple RNN cell: fastest, but breaks over long sequences. Outdated.
LSTM cell: slower, supports selectively forgetting and keeping data. Standard.
GRU cell: like LSTM, but faster due to simpler internal architecture. State of art.
RNN demo
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Putting all together
This is a classical seq2seq.
An image is fed to the CNN.
The CNN generates a state that
model the scene as a cluster of
objects.
The state is fed to a LSTM cell
Avoid getting distracted and Attention
We can train an
intermediate network called
Attention that emphasizes
relationships between
different parts of the
encoder (image) with
different time step of the
decoder (current word
being generated).
How attention works
End to end demo
https://www.captionbot.ai/
Beyond captioning: deep image search
Inputting an image and a question, the network will output an answer.
Chain together CNN and RNN models to a FC outputting to our vocabulary.
http://vqa.cloudcv.org/
CNN
h
How many wheels has the skate?
RNN h
F
C
F
C
F
C
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
Philip Zheng
 
P03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for visionP03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for vision
zukun
 

Was ist angesagt? (20)

Deep Learning - A Literature survey
Deep Learning - A Literature surveyDeep Learning - A Literature survey
Deep Learning - A Literature survey
 
Basics of Deep learning
Basics of Deep learningBasics of Deep learning
Basics of Deep learning
 
Deep Learning
Deep Learning Deep Learning
Deep Learning
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
P03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for visionP03 neural networks cvpr2012 deep learning methods for vision
P03 neural networks cvpr2012 deep learning methods for vision
 
A Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its ApplicationA Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its Application
 
Animesh Prasad and Muthu Kumar Chandrasekaran - WESST - Basics of Deep Learning
Animesh Prasad and Muthu Kumar Chandrasekaran - WESST - Basics of Deep LearningAnimesh Prasad and Muthu Kumar Chandrasekaran - WESST - Basics of Deep Learning
Animesh Prasad and Muthu Kumar Chandrasekaran - WESST - Basics of Deep Learning
 
Deep learning
Deep learning Deep learning
Deep learning
 
Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...
 
From Conventional Machine Learning to Deep Learning and Beyond.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptxFrom Conventional Machine Learning to Deep Learning and Beyond.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptx
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Ersatz meetup - DeepLearning4j Demo
Ersatz meetup - DeepLearning4j DemoErsatz meetup - DeepLearning4j Demo
Ersatz meetup - DeepLearning4j Demo
 
Speech Processing with deep learning
Speech Processing  with deep learningSpeech Processing  with deep learning
Speech Processing with deep learning
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introduction
 

Ähnlich wie Alberto Massidda - Images and words: mechanics of automated captioning with neural networks - Codemotion Milan 2018

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief nets
zukun
 

Ähnlich wie Alberto Massidda - Images and words: mechanics of automated captioning with neural networks - Codemotion Milan 2018 (20)

Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
Cnn
CnnCnn
Cnn
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
 
Deep learning (2)
Deep learning (2)Deep learning (2)
Deep learning (2)
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Java and Deep Learning
Java and Deep LearningJava and Deep Learning
Java and Deep Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
 
lecun-01.ppt
lecun-01.pptlecun-01.ppt
lecun-01.ppt
 
Deep learning for Computer Vision intro
Deep learning for Computer Vision introDeep learning for Computer Vision intro
Deep learning for Computer Vision intro
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)
 
Deep Learning and TensorFlow
Deep Learning and TensorFlowDeep Learning and TensorFlow
Deep Learning and TensorFlow
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief nets
 
What's Wrong With Deep Learning?
What's Wrong With Deep Learning?What's Wrong With Deep Learning?
What's Wrong With Deep Learning?
 
deep learning
deep learningdeep learning
deep learning
 

Mehr von Codemotion

Mehr von Codemotion (20)

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending story
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storia
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard Altwasser
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Alberto Massidda - Images and words: mechanics of automated captioning with neural networks - Codemotion Milan 2018

  • 1. Images and words mechanics of automated captioning with neural networks Alberto Massidda
  • 2. Who we are ● Founded in 2001; ● Branches in Milan, Rome and London; ● Market leader in enterprise ready solutions based on Open Source tech; ● Expertise: ○ DevOps ○ Cloud ○ BigData and many more...
  • 3. This presentation is Open Source (yay!) https://creativecommons.org/licenses/by-nc-sa/3.0/
  • 4. Outline 1. Task introduction 2. Object recognition 3. Language generation 4. Putting all together 5. Improving performance 6. Beyond captioning: deep image search
  • 5. The task Generating a description from an image. “A man jumps over a skateboard”
  • 6. The challenges 1. Recognize objects in the image 2. Generate a fluent description in natural language
  • 7. Neural object recognition A solved problem: Convolutional Neural Networks do the trick. CNN is an architecture specialized in finding topological invariants in the input. Finds relationships between atoms and infers higher abstractions. Highly resistant to noise and spatial transformations. It learns automatically what are the relevant features to extract from an input. Not limited to images: CNNs can be applied to text, audio, etc..
  • 8. An image as integers A handwritten “8” can be represented as a matrix of integers. ● 0 for blank ● 1-255 for grayscales white-to-black
  • 9. Architecture of a ConvNet Filter Convolution + ReLU Max Pooling Filter Convolution + ReLU Max Pooling Fully Connected 1. Convolution 2. Non Linearity (ReLU) 3. Pooling or Sub Sampling 4. Classification (Fully Connected Layer)
  • 10. Convolution intuition Let’s multiply a sliding matrix (the “brushing filter”) with our input matrix. For example, the matrix does edge detection.
  • 11. Convolution in CNN Each new generated image is called “channel”. A common RGB has 3 channels. Channels hold different perspectives about the image. We start with random filters and tune these matrixes as part of our training. We end up with filters that have learned perspectives of interest.
  • 13. Rectifier Linear Unit We can apply another filter to rule out pixel that don’t contribute.
  • 14. Max Pooling 1/2 After this, we downsample the image by “hashing” it to fewer values. We can: 1 1 2 4 5 6 7 8 3 2 1 0 1 2 3 4 6 8 3 4 13 21 8 8 Max pooling 2x2 stride 2 Sum pooling 2x2 stride 2 ● Max: pick only the highest element ● Sum: sum together all the elements
  • 16. Fully connected layer After a couple of “convolute, relu and pool” cycles, we have maybe 128 channels of 14x14 pixel images. Concat and reshape them in a linear array of 25088 cells. Feed it to a feed forward neural network that will output our classes.
  • 17. CNN demo time Real time web handwritten digit recognition http://scs.ryerson.ca/~aharley/vis/conv/flat.html There are a lot of “famous” nets that can be freely downloaded and used off the shelf, like ResNet which has an error rate of 3.6% over 20000 categories.
  • 18. Why not just using a MLP?
  • 19. Why MLP suffers The Multi Layer Perceptron can actually classify images as just array of pixels. But it loses if I move and/or rotate the image. This is because it lacks support for learning the invariant topological properties that are maintained when the image goes through a spatial transformation.
  • 20. Language generation with Recurrent Networks Language generation is a serial task. We generate words one after another. This is well modeled by Recurrent Neural Cells: a neuron that uses itself over and over again to accept serial inputs, outputting each time a new value.
  • 21. Words as integers without embedding Vocabulary of words. V = [‘fight’, ‘kill’, ‘queen’, ‘king’, ‘man’, ‘woman’, ‘love’,...] “One hot vector” encoding representation of single words. ‘fight’ = [1 0 0 0 0 0 0 …] ‘kill’ = [0 1 0 0 0 0 0 …] ‘queen’ = [0 0 1 0 0 0 0 …] Can correlate documents (TF-IDF), but can’t correlate single words to each other: “I fight the king” = [ 1 1 0 1 0 0 0 …] “fight the tirannny”= [ 0 1 0 1 0 0 1 …]
  • 22. Words as floats with vector embedding Word embedding. Fixed length, real valued vector encoding representation of single words. Close concepts have close vectors. ‘fight’ = [0.17 0.53 0.89 0.03 0.00 0.54 0.11 ] ‘kill’ = [0.17 0.53 0.91 0.06 0.00 0.54 0.12 ] ‘queen’ = [0.22 0.45 0.13 0.53 0.90 0.41 0.00 ] Vector operation yields to coherent results: king - man + woman = queen
  • 23. How language is generated x1 h1 y1 x2 h2 y2 x3 h3 y3 “What” “is” “the” “problem”“the”“is”target word output likelihood hidden state input embedding input word Whh Why Wxh
  • 24. RNN and the problem of memory All network state is held in a single cell, used over and over again. Internal state can get really complicated. Moving the values around during training can lead to loss of data. RNN has a “plugin” architecture, in which we can use different types of cells: Simple RNN cell: fastest, but breaks over long sequences. Outdated. LSTM cell: slower, supports selectively forgetting and keeping data. Standard. GRU cell: like LSTM, but faster due to simpler internal architecture. State of art.
  • 26. Putting all together This is a classical seq2seq. An image is fed to the CNN. The CNN generates a state that model the scene as a cluster of objects. The state is fed to a LSTM cell
  • 27. Avoid getting distracted and Attention We can train an intermediate network called Attention that emphasizes relationships between different parts of the encoder (image) with different time step of the decoder (current word being generated).
  • 29. End to end demo https://www.captionbot.ai/
  • 30. Beyond captioning: deep image search Inputting an image and a question, the network will output an answer. Chain together CNN and RNN models to a FC outputting to our vocabulary. http://vqa.cloudcv.org/ CNN h How many wheels has the skate? RNN h F C F C F C