5. Visual Relationships
!5
Figure 3: Qualitative examples of our scene graph dete
predictions, and red boxes and edges are false negatives.
plays a key role in scene graph generation, leading ou
Figure 3: Qualitative examples of our scene graph de
predictions, and red boxes and edges are false negatives.
plays a key role in scene graph generation, leading o
man-1
glass-1
wearing
head-1
has
paper-1
face-1
has
hand-1
has
on
jacket-1
wearing
hand-2
phone-GT1
holding
of
in
holding
holding
on
has
on
on
snow-1
mountain-1
on
house-GT1
in front of
behind
tree-1
behind
sheep-1
near
window-GT1
has
has
has
man-1
pant-1
glass-1
wearing
bus-1
windshield-1
has
door-1
has
windshield-GT1
of
of
tire-1
has
woman-GT1
woman-GT2
sidewalk-1
standing on
standing on
window-GT1
on
building-1
tree-1 near
man-1
man-GT1
street-GT1
wheel-GT1
wheel-1
skateboard-1
shoe-2
shoe-1
bag-1
on
on
wears
wears
has on
wearing
has
under
on
on
on
cup-GT1
cup-1
table-1
cup-2
box-1
bowl-1
cup-3
on
on
on
on
on
on
near
near
tie-1
wearing
woman-1
wearing
hair-1
has
shirt-GT1
wearing
on
on
Human-object
Object-object
Object-attribute
6. Why Graphs?
• Interactions among objects can be easily modeled by a
graph
• More expressive than a text description
• More suitable as an intermediate data structure
!6
graph detection results. Green boxes and edges are correct
negatives.
leading our model to outperform previous state-of-the-art
ne graph detection results. Green boxes and edges are correct
negatives.
n, leading our model to outperform previous state-of-the-art
7. Scene Graph Applications
• Visual question answering (VQA)
• Image to text (Image captioning)
• Text to image (via scene graphs)
• Group activity recognition
!7
Which side of the image is the plate
on?
Right
What is the dark piece of furniture
to the right of the rug called?
Cabinet
[Hudson and Manning, 2019]
8. Scene Graph Applications
• Visual question answering (VQA)
• Image to text (Image captioning)
• Text to image (via scene graphs)
• Group activity recognition
!8
Motorbike
Road
Park
Dirty
BASE: a motorcycle parked on the side of
a road
BASE+MGCN: a motorcycle parked on the
side of a road
SGAE: a motorcycle is parked on the
Motorbike
Road
Park
Dirty
BASE: a motorcycle parked on the side of
a road
BASE+MGCN: a motorcycle parked on the
side of a road
SGAE: a motorcycle is parked on the
ty street with many cars
CN: a city street with many cars
usy highway filled with lots of
are many cars and buses on the
way
On
On
On
On
ty street with many cars
CN: a city street with many cars
usy highway filled with lots of
are many cars and buses on the
way
On
On
(b): 45710
Green
BASE: a couple of elephants walking in a
field
BASE+MGCN: two elephants walking in
the grass in a field
SGAE: a couple of elephants walking
through a lush green forest
GT: two elephants standing in grassy area
with trees around
Green
BASE: a couple of elephants walking in a
field
BASE+MGCN: two elephants walking in
the grass in a field
SGAE: a couple of elephants walking
through a lush green forest
GT: two elephants standing in grassy area
with trees around
erson walking in the street
CN: a person walking in the
h a black umbrella
erson walking down street with
mbrella in the rain
erson walking in the street
CN: a person walking in the
h a black umbrella
erson walking down street with
mbrella in the rain
9. Scene Graph Applications
• Visual question answering (VQA)
• Image to text (Image captioning)
• Text to image (via scene graphs)
• Group activity recognition
!9
A sheep by another
sheep standing on the
grass with sky above
and a boat in the ocean
by a tree behind the
sheep
Ours
tackGAN
[59]
[47]
grass skyocean
tree
by
behind above
e 1. State-of-the-art methods for generating images from
nces, such as StackGAN [59], struggle to faithfully depict
lex sentences with many objects. We overcome this limita-
y generating images from scene graphs, allowing our method
son explicitly about objects and their relationships.
progress on text to image synthesis [41, 42, 43, 59] by
[Johnson+ 2018]
10. Applications of Scene Graphs
• Visual question answering (VQA)
• Image to text (Image captioning)
• Text to image (via scene graphs)
• Group activity recognition
!10
Software Technology, Nanjing University, China
or rec-
his pa-
een ac-
we pro-
Graph
and po-
h Con-
e auto-
end-to-
ficiently
ermore,
fy ARG
calized
exten-
gnition
Activity
Actor Relation GraphLeft Spike
Figure 1: Understanding group activity in multi-person
scene requires accurately determining relevant relation be-
tween actors. Our model learns to represent the scene by
actor relation graph, and performs reasoning about group
activity (“left spike” in the illustrated example) according to
the graph structure and nodes features. Each node denotes
an actor, and each edge represents the relation between two
actors
actors from other aspects such as appearance similarity and
[Wu+ 2019]
12. VQA: Strong real-world bias
How many plastic containers
are there?
Two
How many cups are there?
Two
!12
A blind model can achieve an accuracy ~67% for binary
questions
13. Scene Graph Datasets: Visual Genome
• 100k images from COCO
dataset annotated with
scene graphs
• 1M instances of objects and
600k relations
!13
ial Intelligence
egie Mellon University
.edu, sthomson@cs.cmu.edu
m/neuralmotifs
helmet
glove
boot
woman motorcycle
riding
wheel
wheel
seathas
has
has
has
has
has
has
has
man
shirt
shorts
re 1. A ground truth scene graph containing entities, such as
an, bike or helmet, that are localized in the image with
nding boxes, color coded above, and the relationships between
e entities, such as riding, the relation between woman and
orcycle or has the relation between man and shirt.
[Krishna+ 2016]
14. Scene Graph Datasets - GQA
• Images from Visual Genome annotated with question-answer
pairs
• Each image is annotated with a scene graph to represent its
semantics
!14
15. Scene Graph Generation
• Different settings of this problem
• Scene graph prediction - image
• Scene graph detection - image and object proposals
• Predicate classification - image, bounding boxes with labels
!15
head-1
face-1
has
hand-2
phone-GT1
holding
of
hold
on
snow-1
ntain-1
on
tree-1
has
has
bus-1
windshield-1
has
door-1
has
windshield-GT1
of
of
tire-1
has
woman-GT1
woman-GT2
sidewalk-1
standing on
standing on
window-GT1
on
building-1
tree-1 near
kateboard-1
shoe-2
g-1
on
wears
wearing
on
cup-GT1
cup-1
table-1
cup-2
box-1
bowl-1
cup-3
on
on
on
on
on
on
near
near
hair-1
in the Scene Graph Detection setting. Green boxes are predicted and overlap with the
h no match. Green edges are true positives predicted by our model at the R@20 setting,
16. Scene Graph Prediction - Demo
!16
chair
table
near
shirt
shirt_1
man
sitting on
wearing
jean
wearing
hairhas
book
on
deskon
book_1
on
man_1
wearing
hair_1
has
pillow
on
cup
on
laptop
on
phone
on
paper
on
17. Scene Graph Prediction - Demo
!17
shirt
shirt_1
face
hand
hand_1
man
wearing
wearing
has
has
has
hair
has
glass
wearing
face_1
has
short
wearing
head
has
short_1
wearing
on
on
boy
wearing
has
18. Evaluation of Scene Graph Models
• Recall@K is used in benchmarking scene graph prediction
performance
• Human annotated graphs can be incomplete
!18
19. SOTA in Scene Graph Prediction
• VisualGenome dataset is used for comparisons
• NeuralMotifs (CVPR 2018)
• LinkNet (NeurIPS 2018)
• Graphical Contrastive Losses for Scene Graph Generation
(CVPR2019) - Winner of OpenImages Relationship Detection
Challenge 2018
!19
20. MotifNet [Zellers+ 2018]
• Faster-RCNN pre-trained on VisualGenome
• Global context learned using bidirectional LSTMs
!20
dog head
eye
eye
nose
ear
c1 c2 c3 c5
c1 c2 c3 c4 c5 c6
c4 c6
dog head eye nose eye ear
<dog has head> <dog has eye> <background>
n {2
objectcontextedgecontext
RPN
VGG16
h1 h2 h3 h4
~h5
~h6
~d6
~d1
~d2
~d3
~d4
~d5
~d6
~d1
~d2
~d3
~d4
~d5
Figure 5. A diagram of a Stacked Motif Network (MOTIFNET). The model breaks scene graph parsing into stages predicting bounding
regions, labels for regions, and then relationships. Between each stage, global context is computed using bidirectional LSTMs and is then
used for subsequent stages. In the first stage, a detector proposes bounding regions and then contextual information among bounding
regions is computed and propagated (object context). The global context is used to predict labels for bounding boxes. Given bounding
boxes and labels, the model constructs a new representation (edge context) that gives global context for edge predictions. Finally, edges
are assigned labels by combining contextualized head, tail, and union bounding region information with an outer product.
5. Experimental Setup
In the following sections we explain (1) details of how
ers, resulting in separate branches for object/edge features.Pr (xi→j |B, O) = softmax (Wrgi,j + woi,oj)
21. Limitations of SOTA Models
• Simple heuristic baselines work better than previous SOTA models
• Most of the models are too complicated and huge
• Not suitable for real-time predictions
• Object detection (e.g. Faster-RCNN) acts as a bottleneck!21
22. Open Images 2019 - Visual Relationship
!22
Google Research
Google Research
Google Research
https://www.kaggle.com/c/open-images-2019-visual-relationship/overview
23. Future Directions
• More intuitive smaller models
• Scene graphs on videos
• Scene graphs for higher level understanding of
images
• Use of external knowledge graphs to incorporate
common sense
!23
24. References
• Wu, Jianchao, et al. "Learning Actor Relation Graphs for Group
Activity Recognition." Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2019.
• Zellers, Rowan, et al. "Neural motifs: Scene graph parsing with
global context." Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2018.
• Zhang, Ji, et al. "Graphical Contrastive Losses for Scene Graph
Generation." Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2019.
• Krishna, Ranjay, et al. "Visual genome: Connecting language
and vision using crowdsourced dense image
annotations." International Journal of Computer Vision 123.1
(2017): 32-73.
!24