Im2Text: Describing Images Using 1 Million Captioned Photographs
1. Im2Text: Describing Images Using
1 Million Captioned Photographs
Vicente Ordonez (presenter), Girish Kulkarni, Tamara L. Berg
Stony Brook University
sky
trees
water
Computer Vision building
bridge
Our Goal
An old bridge over dirty green water.
One of the many stone bridges in town
that carry the gravel carriage roads.
A stone bridge over a peaceful river.
2. Harness the Web!
Matching using Global
Image Features
SBU Captioned Photo Dataset (GIST + Color)
1 million captioned images!
Smallest house in paris Bridge to temple in A walk around the
between red (on right) Hoan Kiem lake. lake near our house
and beige (on left). with Abby.
Transfer Caption(s)
e.g. “The water is clear Hangzhou bridge in The daintree river by
enough to see fish The water is clear
West lake. boat.
swimming around in it.” enough to see
fish swimming
around in it. ...
3. Use High Level Content to Rerank
(Objects, Stuff, People, Scenes, Captions)
The bridge over the Iron bridge over the Duck
lake on Suzhou Street. river.
Transfer Caption(s)
e.g. “The bridge over the The Daintree river by boat. Bridge over Cacapon river.
lake on Suzhou Street.”
...
4. Results
Good Bad
A female Mallard duck in
the lake at Luukki Espoo.
Amazing colours in the sky at
sunset with the orange of The cat in the window.
the cloud and the blue of the
sky behind.
The boat ended up a kilometre
Fresh fruit and
from the water in the middle of
vegetables at the market
the airstrip.
Cat in sink. in Port Louis Mauritius.
Hinweis der Redaktion
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
We approach this task in a data-driven manner by first building a 1 million dataset of images with visually relevant captions. We construct this dataset by collecting an enormous amount of captions assigned to images by web users and filtering these captions in such a way that we end up with captions that are more likely to refer to visual content. We use standard global image feature descriptors such as GIST and Tinyimages to retrieve similar images from which we can directly transfer captions.
Additionally we incorporate high level information to rerank the retrieved images used by the previous baseline method by running object detectors, scene classification, stuff detection, people and action detection and computing text statistics. So in this example we have a bridge and a water detections, we use those to match them with similar detections in the retrieved set of images. As you can see here we run object detectors in our retrieved images only if a relevant keyword is mentioned. Text statistics are also relevant because if in the retrieved set a lot of images agree that there is a bridge then those images are rewarded in the final ranking as well. And then again we can transfer captions from this reranked set of images.
Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.