We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND1 v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
Visual Storytelling (NAACL 2016, Poster)
1. A black frisbee is
sitting on top of a
roof.
A man playing
soccer outside of a
white house with a
red door.
The boy is
throwing a soccer
ball by the red
door.
A soccer ball is
over a roof by a
frisbee in a rain
gutter.
Two balls and a
frisbee are on top
of a roof.
A discus got
stuck up on the
roof.
Why not try
getting it down
with a soccer
ball?
Up the soccer
ball goes.
It didn't work so
we tried a volley
ball.
Now the discus,
soccer ball, and
volleyball are all
stuck on the roof.
*Ting-Hao (Kenneth) Huang1, *Francis Ferraro2, Nasrin Mostafazadeh3, Ishan Misra1, Jacob Devlin6, Aishwarya Agrawal4, Ross Girshick5,
Xiaodong He6, Pushmeet Kohli6, Dhruv Batra4, Larry Zitnick5, Devi Parikh5, Lucy Vanderwende6, Michel Galley6 and Margaret Mitchell6
1 Carnegie Mellon University, 2 Johns Hopkins University, 3 University of Rochester, 4 Virginia Tech, 5 Facebook AI Research, 6 Microsoft Research
Stories ≠ Consecutive Captions ≠ Descriptive TextMotivation
Text/Image
Pairs (K)
Vocab
Size (K)
Words/Sent.
Web Ppl.
(30B words)
Brown
(comparison
only)
52.1
(text only)
47.7 20.8 194.0
DII
Description-in-
isolation
151.8 13.8 11.0 147.0
SIS
Stories-in-
sequence
252.9 18.2 10.2 116.0
Getting Humans to Tell Stories
Peason’s r
BLEU 0.08
SkipThoughts 0.18
METEOR 0.22
This is a picture of a family. This is a picture of a cake. This is a picture of a dog.
This is a picture of a beach. This is a picture of a beach.
The family gathered together for a meal. The food was delicious. The dog was excited
to be there. The dog was enjoying the water. The dog was happy to be in the water.
The family gathered together for a meal. The food was delicious. The dog was excited
to be there. The kids were playing in the water. The boat was a little too much to drink.
The family got together for a cookout. They had a lot of delicious food. The dog
was happy to be there. They had a great time on the beach. They even had a
swim in the water.
Greedy
Stories
-Dups
+Grounded
Caption
Output
A solid next move in Artificial Intelligence is to go beyond basic
description of visual scenes towards human-like understanding of
grounded event structure and subjective expression. We introduce the
first dataset for sequential vision-to-language and explore how
modeling concrete description as well as figurative and social language
enables visual storytelling. Our data is at sind.ai.
Get Better Stories with Uniqueness & Visually Grounded Constraints
DIISIS
Automatic Evaluation and Results
See our paper for the description-in-sequence tiers (DIS) and more!
We define 80-5-5-10 train-dev-validation-test splits for all three tiers.
Data
Analysis
Beam
= 10
Beam
= 1
-
Dups
+
Grounded
DII 23.55 19.10 19.21 ----
SIS 23.13 27.76 30.11 31.42
All values are statistically significant (< 1e-5).
Correlations of automatic scores
against human judgments on 3K
random SIS training stories.
METEOR scores on the validation
split, using a sequence-to-sequence
NN with gated recurrent units. Conclusion
Visual Storytelling
Flickr
Album
Description for
Images
in Isolation
&
in Sequences
Story 1
Storytelling
Story 2
Story 3
Re-telling
Preferred Photo
Sequence
Story 4
Story 5
Several strong baselines for the task of visual storytelling demonstrate that intelligent machines
can now begin to generate inferential, conceptual, and evaluative language to share humanlike
experience. METEOR serves as an automatic metric for evaluation, best correlated with
human descriptions. Much more work to be done: Combining a fully grounded model with a
model free to dream yields the best automatically generated stories to date.