Image captioning with Keras and Tensorflow - Debarko De @ Practo

let's build a system to describe an
image
Debarko De (@debarko)
Engineering Manager @ Practo

Who am I?
● Debarko De
● twitter.com/@debarko
● ex @Hashcube
○ Facebook
○ Mobile platforms
● Engineering Manager @ Practo
○ eCommerce Division of Practo
○ Medicines & Lab Tests
● AI ML Application Developer (Researcher)

Agenda
● Problem Statement
● Basic building blocks for the network
● CNN
● Transfer Learning
● RNN
● LSTM
● How do we wire them together?
● Code
● Other places this can be implemented
● Interaction & Questions

Setting right expectations
● I’ll cover the basic theory needed to understand the network
● This is not a math session about NNs
● It’s more about implementation rather than theory
● Preferable I would like to take the questions at the end
● Slides and Code will be shared separately
● Time is of essence, so let's begin.

Use cases
● Visual to Text systems for blind people
● Search Engines for searching medical records
● Auto Tagging medical imaging data
● Tagging video consultations

Parts to an Image Captioning System
● CNN
● RNN with LSTM unit
● Training Data
● Training
● Eval with BLEU Score

CNN vs Traditional Imaging Ways

References for CNN
● Image classification using CNN
https://www.slideshare.net/debarko/image-classification-using-cnn
● Deep Learning for Noobs (Part 1 & Part 2)
https://hackernoon.com/supervised-deep-learning-in-image-classification-for-noobs-part-1-9f831b6d430d
● Karpathy’s Article on CNN
http://cs231n.github.io/convolutional-networks/

RNN
● As humans we understand context
● Every single time we don’t reset our understanding
● Thoughts have persistence
● Traditional NNs like CNNs don’t have persistence
● Usage: speech recognition, language modeling, translation

PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.
Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.
DUKE VINCENTIO:
Well, your wit is in the care of side and that.
Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.
Clown:
Come, sir, I will make did behold your worship.
VIOLA: I'll drink it.

● RNNs are supposed to remember
● Like in video understanding based on previous scenes
● Can RNNs do it?
Long term dependency problem

In a long sentence that is really big do you think that RNNs
can really remember the text that happened here? Will it
remember whether the sentence was ____ or ____?

In a long sentence that is really big do you think that RNNs
can really remember the text that happened here? Will it
remember whether the sentence was long or short?

RNNs can’t connect the dots ...

● RNNs technically should be capable to store long dependencies
● In practice the data gets diluted
● This problem gave birth to LSTM Networks
Long term dependency problem

Long Short Term Memory networks

LSTM
Sigmoid and tanh neural layers

Training Data
1. Flickr 8k Dataset (Link)
2. Flickr 30k Dataset (Link)
3. Microsoft COCO Dataset (Link)
8kDB: M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics", Journal of Artifical Intellegence Research,
Volume 47, pages 853-899 http://www.jair.org/papers/paper3994.html

Evaluation
BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate
translation of text to one or more reference translations.
Human captions are the reference translations
Generated text via the LSTM network is the candidate translation

DEMO
https://cs.stanford.edu/people/karpathy/deepimagesent/generationdemo/
http://localhost:3000/ - Demo server running locally on my laptop

Other places to use this
● Fashion websites can ask users to take photos and then get matching dresses for
sale
● Blind people can use a mobile app which reads out the current scene, helping
them understand the environment better and also enjoy the beauty of sunset.
● Image search engines can be built on this

Current Production Ready models
● Google Show and Tell
● NeuralTalk2 (https://github.com/karpathy/neuraltalk2)
● https://www.leadergpu.com/

References
● http://colah.github.io/posts/2015-08-Understanding-LSTMs/
● http://www.jair.org/media/3994/live-3994-7274-jair.pdf
● https://github.com/debarko/truview_models
● https://github.com/debarko/truview
● https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.ht
ml
● https://github.com/tensorflow/models/tree/master/research/im2txt
● One of the best GPU server farms (https://www.leadergpu.com/)
● https://en.wikipedia.org/wiki/BLEU

Image captioning with Keras and Tensorflow - Debarko De @ Practo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Image captioning with Keras and Tensorflow - Debarko De @ Practo

Ähnlich wie Image captioning with Keras and Tensorflow - Debarko De @ Practo (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Image captioning with Keras and Tensorflow - Debarko De @ Practo