https://imatge-upc.github.io/synthref/
Integrating computer vision with natural language processing has achieved significant progress
over the last years owing to the continuous evolution of deep learning. A novel vision and language
task, which is tackled in the present Master thesis is referring video object segmentation, in which a
language query defines which instance to segment from a video sequence. One of the biggest chal-
lenges for this task is the lack of relatively large annotated datasets since a tremendous amount of
time and human effort is required for annotation. Moreover, existing datasets suffer from poor qual-
ity annotations in the sense that approximately one out of ten language expressions fails to uniquely
describe the target object.
The purpose of the present Master thesis is to address these challenges by proposing a novel
method for generating synthetic referring expressions for an image (video frame). This method pro-
duces synthetic referring expressions by using only the ground-truth annotations of the objects as well
as their attributes, which are detected by a state-of-the-art object detection deep neural network. One
of the advantages of the proposed method is that its formulation allows its application to any object
detection or segmentation dataset.
By using the proposed method, the first large-scale dataset with synthetic referring expressions for
video object segmentation is created, based on an existing large benchmark dataset for video instance
segmentation. A statistical analysis and comparison of the created synthetic dataset with existing ones
is also provided in the present Master thesis.
The conducted experiments on three different datasets used for referring video object segmen-
tation prove the efficiency of the generated synthetic data. More specifically, the obtained results
demonstrate that by pre-training a deep neural network with the proposed synthetic dataset one can
improve the ability of the network to generalize across different datasets, without any additional annotation cost. This outcome is even more important taking into account that no additional annotation cost is involved.
4. Vision & Language
● Recently raised research area
● Owing to deep learning revolution and independent success in CV and NLP
○ CNNs, Object detection/segmentation models
○ LSTMs, Word embeddings
● Many applications
○ Autonomous driving
○ Assistance of visually impaired individuals
○ Interactive video editing
○ Navigation from vision and language etc.
5. Vision & Language Tasks
● Visual Question Answering, Agrawal et al. 2015
● Caption Generation, Vinyals et al. 2015
● Text to Images, Zhang et al. 2016
● And many more!
6. ● Referring Expression
○ An accurate description of a specific object, but not of any other object in the current scene
○ Example:
■ “a woman” ❌
■ “a woman in red” ❌
■ “a woman in red on the right” ✅
■ “a woman in red top and blue shorts” ✅
● Object Segmentation
○ Assign a label to every pixel corresponding to the target object
Object Segmentation with Referring Expressions
9. Many works on images
● First work: “Segmentation from Natural Language Expressions”, Hu et al. 2016
● Subsequent works tried to jointly model vision and language features and leverage
attention to better capture dependencies between visual and linguistic features
● Most of these works use the Refer-It collection of datasets for training and evaluation
○ Three large-scale image datasets with referring expressions and segmentation masks
○ Collected on top of Microsoft COCO (Common objects in Context)
○ RefCOCO, RefCOCO+ and RefCOCOg
11. Few works on videos
● “Video Object Segmentation with
Language Referring Expressions”,
Khoreva et al. 2018
○ DAVIS-2017: Big set of 78 object classes
○ Too few videos (150 in total)
○ They use a frame-based model
○ Pre-training on RefCOCO is used
● “Actor and action segmentation from a
sentence”, Gavrilyuk et al. 2018
○ A2D: Small set of object classes (only 8 actors)
○ J-HDMB: Single object in each video
DAVIS-2017
13. Main Challenges
● Models
○ Temporal consistency across frames
○ Models’ size and complexity
● Data
○ No large-scale datasets for videos
○ Poor quality of crowdsourced referring expressions
■ ~10% fail to correctly describe the target object (no RE)
Analysis from Bellver et al. 2020
A2D
DAVIS-2017
14. Method Inspiration
A2D
DAVIS-2017
● Existing datasets include trivial cases where a single object from each class
appears
● In such cases an object can be identified using only its class e.g. saying “a
person” or “a horse”
● Existing large datasets for video object segmentation are labeled in terms of
object classes
● Annotating a large dataset with referring expressions requires tremendous
human effort
15. Basic Idea
Generate (automatically) synthetic referring expressions starting from an object’s
class and enhancing them with other cues without any human annotation cost
16. Thesis Purpose
1. Propose a method for generating synthetic referring expressions for a large-scale
video object segmentation dataset
1. Evaluate the effectiveness of the generated synthetic referring expressions for the
task of video object segmentation with referring expressions
18. YouTube-VIS Dataset
YouTube-VOS
→ Large-scale dataset for video object segmentation
→ Short YouTube videos of 3-6 seconds
→ 4,453 videos in total
→ 94 object categories
YouTube-VIS
→ Created on top of YouTube-VOS
→ 2,883 videos
→ 40 object classes
→ Exhaustively annotated = All objects belonging to
the 40 classes are labeled with pixel-wise masks.
● The formulation of our method allows its application to any other object
detection/segmentation dataset
● We apply our proposed method on the YouTube-VIS dataset
19. Overview
1. Ground-truth annotations
● Object class
● Bounding boxes
○ Relative size
○ Relative location
2. Faster R-CNN, Ren et al. 2015
● Enhanced with attribute head by Tang et al. 2020
● Pre-trained on Visual Genome dataset for attribute detection
○ Able to detect a predefined set of 201 attributes
○ Includes color and non-color attributes
○ Non-color attributes can be adjectives (“large”, “spotted”) or verbs (“surfing”)
20. Cues
1. Object Class (e.g “a person”)
○ It can be enough only if one object of this class is present in the video frame
○ However, in most cases more cues are necessary
21. Cues
2. Relative Size
○ The areas At and Ao of the target and other object
bounding boxes are computed:
■ At >= 2Ao : “bigger” is added to the ref. expression
■ At <= 0.5Ao : “smaller” respectively
■ 0.5Ao < At < 2Ao : relative location not applicable
○ Similarly for more objects, “biggest”/“smallest” if
target is “bigger”/ “smaller” than all other objects
“a bigger dog”
22. Cues
3. Relative Location (1 or 2 other objects of the same class)
○ The most discriminative axis (X or Y) is determined using the bounding boxes boundaries
○ The maximum non-overlapping distance between bounding boxes is calculated
○ If distance above a certain threshold, relative location is computed, according to the axis found:
■ If X-axis: “on the left” / “on the right”
■ If Y-axis: “in the front” / “in the back”
○ For 3 objects, combinations of relative locations of each pair of objects are combined (e.g “in
the middle”, “in the front left” etc.)
“rabbit on the left”
rabbit rabbit
rabbit
23. Cues
4. Attributes
○ Faster R-CNN detection is matched to the target object using Intersection-over-Union
○ An attribute is added to the referring expression only if it is unique for the target object
○ Attributes can be colors, other adjectives (“spotted”, “large”) and verbs (“walking”, “surfing”)
○ We select up to 2 color attributes (e.g. “brown and black dog”) and 1 non-color (e.g. “walking”)
Detected Attributes:
'white' : 0.9250
'black' : 0.8844
'brown' : 0.8062
“a white rabbit”
26. We use RefVOS model (Bellver et al. 2020) for the experiments
● Frame-based model
● DeepLabv3 visual encoder
● BERT language encoder
● Multi-modal embedding obtained via multiplication
DeepLabv3
Model
27. Training Details
● Batch size of 8 video frames (2 GPUs)
● Frames are cropped/padded to 480x480
● SGD optimizer
● Learning rate policy depends on the target dataset
28. Evaluation Metrics
1. Region Similarity (J)
Jaccard Index (Intersection-over-Union) between predicted and ground-truth mask
1. Contour Accuracy (F)
F1-score of the contour-based precision Pc and recall Rc between the contour points of the
predicted mask c(M) and the ground-truth c(G), computed via a bipartite graph matching.
1. Precision@X
Given a threshold X in the range [0.5,0.9], a predicted mask for an object is counted as true positive
if its J is larger than X, and as false positive otherwise. Then, Precision is computed
as the ratio between the number of true positives and the total number of instances
29. Experiments
1. Extra pre-training of the model using the generated synthetic data and
evaluating on DAVIS-2017 and A2D Sentences datasets
31. Qualitative Results on DAVIS-2017
Pre-trained only on RefCOCO Pre-trained on RefCOCO + SynthRef-YouTube-VIS
32. Results on A2D Sentences
Referring expressions of A2D Sentences are focused on actions,
including mostly verbs and less attributes
33. Experiments
1. Pre-training the model using the generated synthetic data and evaluating on
DAVIS-2017 and A2D Sentences datasets
1. Training on human vs synthetic referring expressions on the same videos
34. Refer-YouTube-VOS
● Seo et al. 2020 annotated YouTube-VOS dataset with referring expressions
● This allowed a direct comparison of our synthetic referring expressions with human-produced
ones
35. Human vs Synthetic
Training:
1. Synthetic referring expressions from SynthRef-YouTube-VIS (our synthetic dataset)
2. Human-produced referring expressions from Refer-YouTube-VOS
Evaluation: On the test split of SynthRef-YouTube-VIS using human-produced referring expressions
from Refer-YouTube-VOS
36. Experiments
1. Pre-training the model using the generated synthetic data and evaluating on
DAVIS-2017 and A2D Sentences datasets
1. Training on human vs synthetic referring expressions on the same videos
1. Ablation study
37. Ablation Study
● Impact of Synthetic Referring Expression Information (DAVIS-2017)
● Freezing the language branch for synthetic pre-training
39. 1. Pre-training a model using the synthetic referring expressions, when it is additionally
trained on real ones, increases its ability to generalize across different datasets.
1. Gains are higher when no fine-tuning is performed on the target dataset
1. Synthetic referring expressions do not achieve better results than human-produced ones
but can be used complementary without any additional annotation cost
1. More information in the referring expressions yields better segmentation accuracy
Conclusions
40. ● Extend the proposed method by adding more cues
○ Use scene-graph generation models to add relationships between objects
Image from Xu et al. 2017
● Apply the proposed method to other existing object detection/segmentation datasets
○ Create synthetic expressions for Microsoft COCO images to be used interchangeably with RefCOCO
Future work