This slideshow talks about how to create a image captioning system just like Google's Show and Tell Model. This will walk you through the training phase and final prediction file.n
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Image captioning with Keras and Tensorflow - Debarko De @ Practo
1. let's build a system to describe an
image
Debarko De (@debarko)
Engineering Manager @ Practo
2. Who am I?
● Debarko De
● twitter.com/@debarko
● ex @Hashcube
○ Facebook
○ Mobile platforms
● Engineering Manager @ Practo
○ eCommerce Division of Practo
○ Medicines & Lab Tests
● AI ML Application Developer (Researcher)
3. Agenda
● Problem Statement
● Basic building blocks for the network
● CNN
● Transfer Learning
● RNN
● LSTM
● How do we wire them together?
● Code
● Other places this can be implemented
● Interaction & Questions
4. Setting right expectations
● I’ll cover the basic theory needed to understand the network
● This is not a math session about NNs
● It’s more about implementation rather than theory
● Preferable I would like to take the questions at the end
● Slides and Code will be shared separately
● Time is of essence, so let's begin.
8. Use cases
● Visual to Text systems for blind people
● Search Engines for searching medical records
● Auto Tagging medical imaging data
● Tagging video consultations
9. Parts to an Image Captioning System
● CNN
● RNN with LSTM unit
● Training Data
● Training
● Eval with BLEU Score
15. References for CNN
● Image classification using CNN
https://www.slideshare.net/debarko/image-classification-using-cnn
● Deep Learning for Noobs (Part 1 & Part 2)
https://hackernoon.com/supervised-deep-learning-in-image-classification-for-noobs-part-1-9f831b6d430d
● Karpathy’s Article on CNN
http://cs231n.github.io/convolutional-networks/
18. RNN
● As humans we understand context
● Every single time we don’t reset our understanding
● Thoughts have persistence
● Traditional NNs like CNNs don’t have persistence
● Usage: speech recognition, language modeling, translation
19.
20.
21.
22. PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.
Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.
DUKE VINCENTIO:
Well, your wit is in the care of side and that.
Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.
Clown:
Come, sir, I will make did behold your worship.
VIOLA: I'll drink it.
23. ● RNNs are supposed to remember
● Like in video understanding based on previous scenes
● Can RNNs do it?
Long term dependency problem
26. In a long sentence that is really big do you think that RNNs
can really remember the text that happened here? Will it
remember whether the sentence was ____ or ____?
27. In a long sentence that is really big do you think that RNNs
can really remember the text that happened here? Will it
remember whether the sentence was ____ or ____?
28. In a long sentence that is really big do you think that RNNs
can really remember the text that happened here? Will it
remember whether the sentence was long or short?
30. ● RNNs technically should be capable to store long dependencies
● In practice the data gets diluted
● This problem gave birth to LSTM Networks
Long term dependency problem
36. Training Data
1. Flickr 8k Dataset (Link)
2. Flickr 30k Dataset (Link)
3. Microsoft COCO Dataset (Link)
8kDB: M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics", Journal of Artifical Intellegence Research,
Volume 47, pages 853-899 http://www.jair.org/papers/paper3994.html
41. Evaluation
BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate
translation of text to one or more reference translations.
Human captions are the reference translations
Generated text via the LSTM network is the candidate translation
43. Other places to use this
● Fashion websites can ask users to take photos and then get matching dresses for
sale
● Blind people can use a mobile app which reads out the current scene, helping
them understand the environment better and also enjoy the beauty of sunset.
● Image search engines can be built on this
44. Current Production Ready models
● Google Show and Tell
● NeuralTalk2 (https://github.com/karpathy/neuraltalk2)
● https://www.leadergpu.com/