Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Frontiers of Vision and Language:
Bridging Images and Texts by Deep Learning
The University of Tokyo
Yoshitaka Ushiku
losn...
Documents = Vision + Language
Vision & Language:
an emerging topic
• Integration of CV, NLP
and ML techs
• Several backgro...
2012: Impact of Deep Learning
Academic AI startup A famous company
Many slides refer to the first use of CNN (AlexNet) on ...
2012: Impact of Deep Learning
Academic AI startup A famous company
Large gap of error rates
on ImageNet
1st team: 15.3%
2n...
2012: Impact of Deep Learning
According to the official site…
1st team w/ DL
Error rate: 15%
2nd team w/o DL
Error rate: 2...
2014: Another impact of Deep Learning
• Deep learning appears in machine translation
[Sutskever+, NIPS 2014]
– LSTM [Hochr...
Growth of user generated contents
Especially in content posting/sharing service
• Facebook: 300 million photos per day
• Y...
Exploratory researches on Vision and Language
Captioning an image associated with its article
[Feng+Lapata, ACL 2010]
• In...
Exploratory researches on Vision and Language
Captioning an image associated with its article
[Feng+Lapata, ACL 2010]
• In...
Image Captioning
Group of people sitting
at a table with a dinner.
Tourists are standing on
the middle of a flat desert.
[...
Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holdin...
Multilingual + Image Caption Translation
Ein Masten mit zwei Ampeln
fur Autofahrer. (German)
A pole with two lights
for dr...
Visual Question Answering[Fukui+, EMNLP 2016]
Image Generation from Captions
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in ...
Goal of this keynote
Looking over researches on vision&language
• Historical flow of each area
• Changes by Deep Learning
...
Frontiers of Vision and Language 1
Image Captioning
Every picture tells a story
Dataset:
Images + <object, action, scene> + Captions
1. Predict <object, action, scene> for an...
Every picture tells a story
<pet, sleep, ground>
See something unexpected.
<transportation, move, track>
A man stands next...
Retrieve? Generate?
• Retrieve
• Generate
– Template-based
e.g. generating a Subject+Verb sentence
– Template-free
A small...
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-based
e.g. generating a Subject+Verb s...
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-based
dog+stand ⇒ A dog stands.
– Temp...
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-based
dog+stand ⇒ A dog stands.
– Temp...
Captioning with multi-keyphrases
[Ushiku+, ACM MM 2012]
End of sentence
[Ushiku+, ACM MM 2012]
Benefits of Deep Learning
• Refinement of image recognition [Krizhevsky+, NIPS 2012]
• Deep learning appears in machine tr...
Google NIC
Concatenation of Google’s methods
• GoogLeNet [Szegedy+, CVPR 2015]
• MT with LSTM
[Sutskever+, NIPS 2014]
Capt...
Examples of generated captions
[https://github.com/tensorflow/models/tree/master/im2txt]
[Vinyals+, CVPR 2015]
Comparison to [Ushiku+, ACM MM 2012]
Input image
[Ushiku+, ACM MM 2012]:
Conventional object recognition
Fisher Vector + L...
Current development: Accuracy
• Attention-based captioning [Xu+, ICML 2015]
– Focus on some areas for predicting each word...
Current development: Problem setting
Dense captioning
[Lin+, BMVC 2015] [Johnson+, CVPR 2016]
Current development: Problem setting
Generating captions for a photo sequence
[Park+Kim, NIPS 2015][Huang+, NAACL 2016]
Th...
Current development: Problem setting
Captioning using sentiment terms
[Mathews+, AAAI 2016][Shin+, BMVC 2016]
Neutral capt...
Frontiers of Vision and Language 2
Video Captioning
Before Deep Learning
• Grounding of languages and objects in videos
[Yu+Siskind, ACL 2013]
– Learning from only videos and...
End-to-end learning by Deep Learning
• LRCN
[Donahue+, CVPR 2015]
– CNN+RNN for
• Action recognition
• Image / Video
Capti...
Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holdin...
Video Captioning
A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he o...
Video Retrieval from Caption
• Input: Captions
• Output: A video related to the caption
10 sec video clip from 40 min data...
Frontiers of Vision and Language 3
Multilingual +
Image Caption Translation
Towards multiple languages
Datasets with multilingual captions
• IAPR TC12 [Grubinger+, 2006] English + Germany
• Multi30K...
Non-English-caption generation
Non-English-caption generation
Most researches: generate English Caption
• Japanese [Miyazaki+Shimizu, ACL 2016]
• Chinese...
Just collecting non-English captions?
Transfer learning among languages
[Miyazaki+Shimizu, ACL 2016]
• Vision-Language gro...
Image Caption Translation
Machine translation via visual data
Images can boost MT [Calixto+,2012]
• Example below (English to Portuguese):
Does the ...
Input: Caption in Language A + image
• Caption translation via an associated image
[Elliott+, 2015] [Hitschler+, ACL 2016]...
Input: Caption in Language A
• Cross-lingual document retrieval via images
[Funaki+Nakayama, EMNLP 2015]
• Zero-shot machi...
Frontiers of Vision and Language 4
Visual Question Answering
Visual Question Answering (VQA)
Proposed in Human-Computer Interfaces
• VizWiz [Bigham+, UIST 2010]
Manually solved on AMT...
VQA: Visual Question Answering
• Established VQA as an AI problem
– Provided a benchmark dataset
– Experimental results wi...
VQA Dataset
Collected questions and answers on AMT
• Over 100K real images and 30K abstract images
• About 700K questions+...
VQA=Multiclass Classification
Feature 𝑍𝐼+𝑄 is applied to an usual classifier
Question 𝑄
What objects are
found on the bed?...
Development of VQA
How to calculate the integrated feature 𝑧𝐼+𝑄?
• VQA [Antol+, ICCV 2015]: Just concatenate them
• Summat...
VQA Challenge
Examples from competition results
Q: What is the woman holding?
GT A: laptop
Machine A: laptop
Q: Is it goin...
VQA Challenge
Examples from competition results
Q: Why is there snow on one
side of the stream and clear
grass on the othe...
Frontiers of Vision and Language 5
Image Generation from Captions
Image generation from input caption
Photo-realistic image generation itself is difficult
• [Mansimov+, ICLR 2016]: Increme...
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of ...
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of ...
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of ...
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of ...
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of ...
Add a Caption to Generator and Discriminator
Conditional Generative Models
Tries to generate an image
・photo-realistic
・re...
Examples of generated images
• Birds (CUB) / Flowers (Oxford-102)
– About 10K images & 5 captions for each image
– 200 kin...
Towards more realistic image generation
StackGAN [Zhang+, 2016]
Two-step GANs
• First GAN generates small and fuzzy image
...
Examples of generated images
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in co...
Examples of generated images
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in co...
Take-home Messages
• Looked over researches on vision and language
1. Image Captioning
2. Video Captioning
3. Multilingual...
Nächste SlideShare
Wird geladen in …5
×

Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

444 Aufrufe

Veröffentlicht am

Slide used on 11/11/2017 for the keynote in International Conference on Document Analysis and Recognition Workshop on Machine Learning.
(ICDAR WML 2017, https://icdarwml.wixsite.com/icdarwml2017)

This is a translated and updated version of https://www.slideshare.net/YoshitakaUshiku/deep-learning-73499744, which is written in Japanese.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

  1. 1. Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning The University of Tokyo Yoshitaka Ushiku losnuevetoros
  2. 2. Documents = Vision + Language Vision & Language: an emerging topic • Integration of CV, NLP and ML techs • Several backgrounds – Impact of Deep Learning • Image recognition (CV) • Machine translation (NLP) – Growth of user generated contents – Exploratory researches on Vision and Language
  3. 3. 2012: Impact of Deep Learning Academic AI startup A famous company Many slides refer to the first use of CNN (AlexNet) on ImageNet
  4. 4. 2012: Impact of Deep Learning Academic AI startup A famous company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet
  5. 5. 2012: Impact of Deep Learning According to the official site… 1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html] It’s me!!
  6. 6. 2014: Another impact of Deep Learning • Deep learning appears in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) • Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output
  7. 7. Growth of user generated contents Especially in content posting/sharing service • Facebook: 300 million photos per day • YouTube: 400-hours videos per minute Pōhutukawa blooms this time of the year in New Zealand. As the flowers fall, the ground underneath the trees look spectacular. Pairs of a sentence + a video / photo →Collectable in large quantities
  8. 8. Exploratory researches on Vision and Language Captioning an image associated with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week.
  9. 9. Exploratory researches on Vision and Language Captioning an image associated with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week.As a result of these backgrounds: Various research topics such as …
  10. 10. Image Captioning Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert. [Ushiku+, ICCV 2015]
  11. 11. Video Captioning A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]
  12. 12. Multilingual + Image Caption Translation Ein Masten mit zwei Ampeln fur Autofahrer. (German) A pole with two lights for drivers. (English) [Hitschler+, ACL 2016]
  13. 13. Visual Question Answering[Fukui+, EMNLP 2016]
  14. 14. Image Generation from Captions This bird is blue with white and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016]
  15. 15. Goal of this keynote Looking over researches on vision&language • Historical flow of each area • Changes by Deep Learning × Deep Learning enabled these researches ✓ Deep Learning boosted these researches 1. Image Captioning 2. Video Captioning 3. Multilingual + Image Caption Translation 4. Visual Question Answering 5. Image Generation from Captions
  16. 16. Frontiers of Vision and Language 1 Image Captioning
  17. 17. Every picture tells a story Dataset: Images + <object, action, scene> + Captions 1. Predict <object, action, scene> for an input image using MRF 2. Search for the existing caption associated with similar <object, action, scene> <Horse, Ride, Field> [Farhadi+, ECCV 2010]
  18. 18. Every picture tells a story <pet, sleep, ground> See something unexpected. <transportation, move, track> A man stands next to a train on a cloudy day. [Farhadi+, ECCV 2010]
  19. 19. Retrieve? Generate? • Retrieve • Generate – Template-based e.g. generating a Subject+Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  20. 20. Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based e.g. generating a Subject+Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  21. 21. Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based dog+stand ⇒ A dog stands. – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  22. 22. Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based dog+stand ⇒ A dog stands. – Template-free A small white dog standing on a leash. A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  23. 23. Captioning with multi-keyphrases [Ushiku+, ACM MM 2012]
  24. 24. End of sentence [Ushiku+, ACM MM 2012]
  25. 25. Benefits of Deep Learning • Refinement of image recognition [Krizhevsky+, NIPS 2012] • Deep learning appears in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output
  26. 26. Google NIC Concatenation of Google’s methods • GoogLeNet [Szegedy+, CVPR 2015] • MT with LSTM [Sutskever+, NIPS 2014] Caption (word seq.) 𝑆0 … 𝑆 𝑁 for image 𝐼 𝑆0: beginning of the sentence 𝑆1 = LSTM CNN 𝐼 𝑆𝑡 = LSTM St−1 , 𝑡 = 2 … 𝑁 − 1 𝑆 𝑁: end of the sentence [Vinyals+, CVPR 2015]
  27. 27. Examples of generated captions [https://github.com/tensorflow/models/tree/master/im2txt] [Vinyals+, CVPR 2015]
  28. 28. Comparison to [Ushiku+, ACM MM 2012] Input image [Ushiku+, ACM MM 2012]: Conventional object recognition Fisher Vector + Linear classifier Neural image captioning: Conventional object recognition Convolutional Neural Network Neural image captioning Conventional machine translation Recurrent Neural Network + beam search [Ushiku+, ACM MM 2012]: Conventional machine translation Log Linear Model + beam search Estimation of important words Connect the words with grammar model • Trained using only images and captions • Approaches are similar to each other
  29. 29. Current development: Accuracy • Attention-based captioning [Xu+, ICML 2015] – Focus on some areas for predicting each word! – Both attention and caption models are trained using pairs of an image & caption
  30. 30. Current development: Problem setting Dense captioning [Lin+, BMVC 2015] [Johnson+, CVPR 2016]
  31. 31. Current development: Problem setting Generating captions for a photo sequence [Park+Kim, NIPS 2015][Huang+, NAACL 2016] The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water.
  32. 32. Current development: Problem setting Captioning using sentiment terms [Mathews+, AAAI 2016][Shin+, BMVC 2016] Neutral caption Positive caption
  33. 33. Frontiers of Vision and Language 2 Video Captioning
  34. 34. Before Deep Learning • Grounding of languages and objects in videos [Yu+Siskind, ACL 2013] – Learning from only videos and their captions – Experiment with a small object with few objects – Controlled and small dataset • Deep Learning should suite for this problem – Image Captioning: single image → word sequence – Video Captioning: image sequence → word sequence
  35. 35. End-to-end learning by Deep Learning • LRCN [Donahue+, CVPR 2015] – CNN+RNN for • Action recognition • Image / Video Captioning • Video to Text [Venugopalan+, ICCV 2015] – CNNs to recognize • Objects from RGB frames • Actions from flow images – RNN for captioning
  36. 36. Video Captioning A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]
  37. 37. Video Captioning A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water. [Shin+, ICIP 2016]
  38. 38. Video Retrieval from Caption • Input: Captions • Output: A video related to the caption 10 sec video clip from 40 min database! • Video captioning is also addressed A woman in blue is playing ping pong in a room. A guy is skiing with no shirt on and yellow snow pants. A man is water skiing while attached to a long rope. [Yamaguchi+, ICCV 2017]
  39. 39. Frontiers of Vision and Language 3 Multilingual + Image Caption Translation
  40. 40. Towards multiple languages Datasets with multilingual captions • IAPR TC12 [Grubinger+, 2006] English + Germany • Multi30K [Elliot+, 2016] English + Germany • STAIR Captions [Yoshikawa+, 2017] English + Japanese Development of cross-lingual tasks • Non-English-caption generation • Image Caption Transration Input: Pair of a caption in Language A + an image or A caption in Language A Output: Caption in Language B
  41. 41. Non-English-caption generation
  42. 42. Non-English-caption generation Most researches: generate English Caption • Japanese [Miyazaki+Shimizu, ACL 2016] • Chinese [Li+, ICMR 2016] • Turkish [Unal+, SIU 2016] Çimlerde ko¸ san bir köpek 金色头发的小女孩 柵の中にキリンが一頭 立っています
  43. 43. Just collecting non-English captions? Transfer learning among languages [Miyazaki+Shimizu, ACL 2016] • Vision-Language grounding Wim is transferred • Efficient learning using small amount of captions an elephant is an elephant 一匹の 象が 土の 一匹の 象が
  44. 44. Image Caption Translation
  45. 45. Machine translation via visual data Images can boost MT [Calixto+,2012] • Example below (English to Portuguese): Does the word “seal” in English – mean “seal” similar to “stamp”? – mean “seal” which is a sea animal? • [Calixto+,2012] insist that the mistranslation can be avoided using a related image (w/o experiments) Mistranslation!
  46. 46. Input: Caption in Language A + image • Caption translation via an associated image [Elliott+, 2015] [Hitschler+, ACL 2016] – Generate translation candidates – Re-rank the candidates using similar images’ captions in Language B Eine Person in einem Anzug und Krawatte und einem Rock. (In German) Translation w/o the related image A person in a suit and tie and a rock. Translation with the related image A person in a suit and tie and a skirt.
  47. 47. Input: Caption in Language A • Cross-lingual document retrieval via images [Funaki+Nakayama, EMNLP 2015] • Zero-shot machine translation [Nakayama+Nishida, 2017]
  48. 48. Frontiers of Vision and Language 4 Visual Question Answering
  49. 49. Visual Question Answering (VQA) Proposed in Human-Computer Interfaces • VizWiz [Bigham+, UIST 2010] Manually solved on AMT • Automation for the first time (w/o Deep Learning) [Malinowski+Fritz, NIPS 2014] • Similar term: Visual Turing Test [Malinowski+Fritz, 2014]
  50. 50. VQA: Visual Question Answering • Established VQA as an AI problem – Provided a benchmark dataset – Experimental results with reasonable baselines • Portal web site is also organized – http://www.visualqa.org/ – Annual competition for VQA accuracy [Antol+, ICCV 2015] What color are her eyes? What is the mustache made of?
  51. 51. VQA Dataset Collected questions and answers on AMT • Over 100K real images and 30K abstract images • About 700K questions+10 answers for each
  52. 52. VQA=Multiclass Classification Feature 𝑍𝐼+𝑄 is applied to an usual classifier Question 𝑄 What objects are found on the bed? Answer 𝐴 bed sheets, pillow Image 𝐼 Image feature 𝑥𝐼 Question feature 𝑥 𝑄 Integrated feature 𝑧𝐼+𝑄
  53. 53. Development of VQA How to calculate the integrated feature 𝑧𝐼+𝑄? • VQA [Antol+, ICCV 2015]: Just concatenate them • Summation 例 Summation of an image feature with attention and a question feature [Xu+Saenko, ECCV 2016] • Multiplication e.g.Bilinear multiplication using DFT [Fukui+, EMNLP 2016] • Hybrid of summation and multiplication e.g.Concatenation of sum and multiplication [Saito+, ICME 2017] 𝑧𝐼+𝑄 = 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄𝑧𝐼+𝑄 = 𝑧𝐼+𝑄 = 𝑧𝐼+𝑄 = 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄
  54. 54. VQA Challenge Examples from competition results Q: What is the woman holding? GT A: laptop Machine A: laptop Q: Is it going to rain soon? GT A: yes Machine A: yes
  55. 55. VQA Challenge Examples from competition results Q: Why is there snow on one side of the stream and clear grass on the other? GT A: shade Machine A: yes Q: Is the hydrant painted a new color? GT A: yes Machine A: no
  56. 56. Frontiers of Vision and Language 5 Image Generation from Captions
  57. 57. Image generation from input caption Photo-realistic image generation itself is difficult • [Mansimov+, ICLR 2016]: Incrementally draw using LSTM • N.B. Photo synthesis is well studied [Hays+Efros, 2007]
  58. 58. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a fake image from Generator!
  59. 59. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a fake image from Generator!
  60. 60. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a fake image from Generator!
  61. 61. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a fake image from Generator!
  62. 62. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a … hmm
  63. 63. Add a Caption to Generator and Discriminator Conditional Generative Models Tries to generate an image ・photo-realistic ・related to the caption Tries to detect an image ・fake ・unrelated [Reed+, ICML 2016]
  64. 64. Examples of generated images • Birds (CUB) / Flowers (Oxford-102) – About 10K images & 5 captions for each image – 200 kinds of birds / 102 kinds of flowers A tiny bird, with a tiny beak, tarsus and feet, a blue crown, blue coverts, and black cheek patch Bright droopy yellow petals with burgundy streaks, and a yellow stigma [Reed+, ICML 2016]
  65. 65. Towards more realistic image generation StackGAN [Zhang+, 2016] Two-step GANs • First GAN generates small and fuzzy image • Second GAN enlarges and refines it
  66. 66. Examples of generated images This bird is blue with white and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016]
  67. 67. Examples of generated images This bird is blue with white and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016] N.B. Results using dataset specialized in birds / flowers → More breakthrough is necessary to generate general images
  68. 68. Take-home Messages • Looked over researches on vision and language 1. Image Captioning 2. Video Captioning 3. Multilingual + Image Caption Translation 4. Visual Question Answering 5. Image Generation from Captions • Contributions of Deep Learning – Most research themes exist before Deep Learning – Commodity techs for processing images, videos and natural languages – Evolution of recognition and generation Towards a new stage among vision and language!

×