SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Graphs
for
Visual Understanding
Kaushalya Madhawa

Murata Group
9th July 2019
Slides available at: http://bitly.ws/4muD!1
Outline
• The problem: scene understanding

• Why scene graphs?

• Scene graph datasets

• State-of-the-art

• Challenges
!2
Object detection
Man , Dog Man , Dog
!3
Object detection isn’t enough!
!4
Man ^ Dog Man ^ Dog
petting chased by
Visual Relationships
!5
Figure 3: Qualitative examples of our scene graph dete
predictions, and red boxes and edges are false negatives.
plays a key role in scene graph generation, leading ou
Figure 3: Qualitative examples of our scene graph de
predictions, and red boxes and edges are false negatives.
plays a key role in scene graph generation, leading o
man-1
glass-1
wearing
head-1
has
paper-1
face-1
has
hand-1
has
on
jacket-1
wearing
hand-2
phone-GT1
holding
of
in
holding
holding
on
has
on
on
snow-1
mountain-1
on
house-GT1
in front of
behind
tree-1
behind
sheep-1
near
window-GT1
has
has
has
man-1
pant-1
glass-1
wearing
bus-1
windshield-1
has
door-1
has
windshield-GT1
of
of
tire-1
has
woman-GT1
woman-GT2
sidewalk-1
standing on
standing on
window-GT1
on
building-1
tree-1 near
man-1
man-GT1
street-GT1
wheel-GT1
wheel-1
skateboard-1
shoe-2
shoe-1
bag-1
on
on
wears
wears
has on
wearing
has
under
on
on
on
cup-GT1
cup-1
table-1
cup-2
box-1
bowl-1
cup-3
on
on
on
on
on
on
near
near
tie-1
wearing
woman-1
wearing
hair-1
has
shirt-GT1
wearing
on
on
Human-object
Object-object
Object-attribute
Why Graphs?
• Interactions among objects can be easily modeled by a
graph

• More expressive than a text description

• More suitable as an intermediate data structure
!6
graph detection results. Green boxes and edges are correct
negatives.
leading our model to outperform previous state-of-the-art
ne graph detection results. Green boxes and edges are correct
negatives.
n, leading our model to outperform previous state-of-the-art
Scene Graph Applications
• Visual question answering (VQA)
• Image to text (Image captioning)

• Text to image (via scene graphs)

• Group activity recognition
!7
Which side of the image is the plate
on?
Right
What is the dark piece of furniture
to the right of the rug called?
Cabinet
[Hudson and Manning, 2019]
Scene Graph Applications
• Visual question answering (VQA)

• Image to text (Image captioning)
• Text to image (via scene graphs)

• Group activity recognition
!8
Motorbike
Road
Park
Dirty
BASE: a motorcycle parked on the side of
a road
BASE+MGCN: a motorcycle parked on the
side of a road
SGAE: a motorcycle is parked on the
Motorbike
Road
Park
Dirty
BASE: a motorcycle parked on the side of
a road
BASE+MGCN: a motorcycle parked on the
side of a road
SGAE: a motorcycle is parked on the
ty street with many cars
CN: a city street with many cars
usy highway filled with lots of
are many cars and buses on the
way
On
On
On
On
ty street with many cars
CN: a city street with many cars
usy highway filled with lots of
are many cars and buses on the
way
On
On
(b): 45710
Green
BASE: a couple of elephants walking in a
field
BASE+MGCN: two elephants walking in
the grass in a field
SGAE: a couple of elephants walking
through a lush green forest
GT: two elephants standing in grassy area
with trees around
Green
BASE: a couple of elephants walking in a
field
BASE+MGCN: two elephants walking in
the grass in a field
SGAE: a couple of elephants walking
through a lush green forest
GT: two elephants standing in grassy area
with trees around
erson walking in the street
CN: a person walking in the
h a black umbrella
erson walking down street with
mbrella in the rain
erson walking in the street
CN: a person walking in the
h a black umbrella
erson walking down street with
mbrella in the rain
Scene Graph Applications
• Visual question answering (VQA)

• Image to text (Image captioning)

• Text to image (via scene graphs)
• Group activity recognition
!9
A sheep by another
sheep standing on the
grass with sky above
and a boat in the ocean
by a tree behind the
sheep
Ours
tackGAN
[59]
[47]
grass skyocean
tree
by
behind above
e 1. State-of-the-art methods for generating images from
nces, such as StackGAN [59], struggle to faithfully depict
lex sentences with many objects. We overcome this limita-
y generating images from scene graphs, allowing our method
son explicitly about objects and their relationships.
progress on text to image synthesis [41, 42, 43, 59] by
[Johnson+ 2018]
Applications of Scene Graphs
• Visual question answering (VQA)

• Image to text (Image captioning)

• Text to image (via scene graphs)

• Group activity recognition
!10
Software Technology, Nanjing University, China
or rec-
his pa-
een ac-
we pro-
Graph
and po-
h Con-
e auto-
end-to-
ficiently
ermore,
fy ARG
calized
exten-
gnition
Activity
Actor Relation GraphLeft Spike
Figure 1: Understanding group activity in multi-person
scene requires accurately determining relevant relation be-
tween actors. Our model learns to represent the scene by
actor relation graph, and performs reasoning about group
activity (“left spike” in the illustrated example) according to
the graph structure and nodes features. Each node denotes
an actor, and each edge represents the relation between two
actors
actors from other aspects such as appearance similarity and
[Wu+ 2019]
Visual Question Answering (VQA)
• VQA 2.0 [Goyal+ 2017] contains 200k images with question-
answer pairs
!11
VQA: Strong real-world bias
How many plastic containers
are there?

Two
How many cups are there?

Two
!12
A blind model can achieve an accuracy ~67% for binary
questions
Scene Graph Datasets: Visual Genome
• 100k images from COCO
dataset annotated with
scene graphs

• 1M instances of objects and
600k relations
!13
ial Intelligence
egie Mellon University
.edu, sthomson@cs.cmu.edu
m/neuralmotifs
helmet
glove
boot
woman motorcycle
riding
wheel
wheel
seathas
has
has
has
has
has
has
has
man
shirt
shorts
re 1. A ground truth scene graph containing entities, such as
an, bike or helmet, that are localized in the image with
nding boxes, color coded above, and the relationships between
e entities, such as riding, the relation between woman and
orcycle or has the relation between man and shirt.
[Krishna+ 2016]
Scene Graph Datasets - GQA
• Images from Visual Genome annotated with question-answer
pairs

• Each image is annotated with a scene graph to represent its
semantics
!14
Scene Graph Generation
• Different settings of this problem

• Scene graph prediction - image

• Scene graph detection - image and object proposals

• Predicate classification - image, bounding boxes with labels
!15
head-1
face-1
has
hand-2
phone-GT1
holding
of
hold
on
snow-1
ntain-1
on
tree-1
has
has
bus-1
windshield-1
has
door-1
has
windshield-GT1
of
of
tire-1
has
woman-GT1
woman-GT2
sidewalk-1
standing on
standing on
window-GT1
on
building-1
tree-1 near
kateboard-1
shoe-2
g-1
on
wears
wearing
on
cup-GT1
cup-1
table-1
cup-2
box-1
bowl-1
cup-3
on
on
on
on
on
on
near
near
hair-1
in the Scene Graph Detection setting. Green boxes are predicted and overlap with the
h no match. Green edges are true positives predicted by our model at the R@20 setting,
Scene Graph Prediction - Demo
!16
chair
table
near
shirt
shirt_1
man
sitting on
wearing
jean
wearing
hairhas
book
on
deskon
book_1
on
man_1
wearing
hair_1
has
pillow
on
cup
on
laptop
on
phone
on
paper
on
Scene Graph Prediction - Demo
!17
shirt
shirt_1
face
hand
hand_1
man
wearing
wearing
has
has
has
hair
has
glass
wearing
face_1
has
short
wearing
head
has
short_1
wearing
on
on
boy
wearing
has
Evaluation of Scene Graph Models
• Recall@K is used in benchmarking scene graph prediction
performance

• Human annotated graphs can be incomplete
!18
SOTA in Scene Graph Prediction
• VisualGenome dataset is used for comparisons

• NeuralMotifs (CVPR 2018)

• LinkNet (NeurIPS 2018)

• Graphical Contrastive Losses for Scene Graph Generation
(CVPR2019) - Winner of OpenImages Relationship Detection
Challenge 2018
!19
MotifNet [Zellers+ 2018]
• Faster-RCNN pre-trained on VisualGenome

• Global context learned using bidirectional LSTMs
!20
dog head
eye
eye
nose
ear
c1 c2 c3 c5
c1 c2 c3 c4 c5 c6
c4 c6
dog head eye nose eye ear
<dog has head> <dog has eye> <background>
n {2
objectcontextedgecontext
RPN
VGG16
h1 h2 h3 h4
~h5
~h6
~d6
~d1
~d2
~d3
~d4
~d5
~d6
~d1
~d2
~d3
~d4
~d5
Figure 5. A diagram of a Stacked Motif Network (MOTIFNET). The model breaks scene graph parsing into stages predicting bounding
regions, labels for regions, and then relationships. Between each stage, global context is computed using bidirectional LSTMs and is then
used for subsequent stages. In the first stage, a detector proposes bounding regions and then contextual information among bounding
regions is computed and propagated (object context). The global context is used to predict labels for bounding boxes. Given bounding
boxes and labels, the model constructs a new representation (edge context) that gives global context for edge predictions. Finally, edges
are assigned labels by combining contextualized head, tail, and union bounding region information with an outer product.
5. Experimental Setup
In the following sections we explain (1) details of how
ers, resulting in separate branches for object/edge features.Pr (xi→j |B, O) = softmax (Wrgi,j + woi,oj)
Limitations of SOTA Models
• Simple heuristic baselines work better than previous SOTA models

• Most of the models are too complicated and huge

• Not suitable for real-time predictions

• Object detection (e.g. Faster-RCNN) acts as a bottleneck!21
Open Images 2019 - Visual Relationship
!22
Google Research
Google Research
Google Research
https://www.kaggle.com/c/open-images-2019-visual-relationship/overview
Future Directions
• More intuitive smaller models

• Scene graphs on videos

• Scene graphs for higher level understanding of
images 

• Use of external knowledge graphs to incorporate
common sense
!23
References
• Wu, Jianchao, et al. "Learning Actor Relation Graphs for Group
Activity Recognition." Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2019.

• Zellers, Rowan, et al. "Neural motifs: Scene graph parsing with
global context." Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2018.

• Zhang, Ji, et al. "Graphical Contrastive Losses for Scene Graph
Generation." Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2019.

• Krishna, Ranjay, et al. "Visual genome: Connecting language
and vision using crowdsourced dense image
annotations." International Journal of Computer Vision 123.1
(2017): 32-73.
!24

Weitere ähnliche Inhalte

Ähnlich wie Graphs for Visual Understanding

CariGANs : Unpaired Photo-to-Caricature Translation
CariGANs : Unpaired Photo-to-Caricature TranslationCariGANs : Unpaired Photo-to-Caricature Translation
CariGANs : Unpaired Photo-to-Caricature TranslationRazorthink
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewQuantUniversity
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooJaeJun Yoo
 
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph CompletionNaomi Shiraishi
 
CapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkCapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkKarel Ha
 
Introduction of DiscoGAN
Introduction of DiscoGANIntroduction of DiscoGAN
Introduction of DiscoGANSeongcheol Baek
 
A method for semantic-based image retrieval using hierarchical clustering tre...
A method for semantic-based image retrieval using hierarchical clustering tre...A method for semantic-based image retrieval using hierarchical clustering tre...
A method for semantic-based image retrieval using hierarchical clustering tre...TELKOMNIKA JOURNAL
 
IRJET- Efficient Geo-tagging of images using LASOM
IRJET- Efficient Geo-tagging of images using LASOMIRJET- Efficient Geo-tagging of images using LASOM
IRJET- Efficient Geo-tagging of images using LASOMIRJET Journal
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kagglerKai Xin Thia
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
 
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET Journal
 
GRAKN.AI: The Hyper-Relational Database for Knowledge-Oriented Systems
GRAKN.AI: The Hyper-Relational Database for Knowledge-Oriented SystemsGRAKN.AI: The Hyper-Relational Database for Knowledge-Oriented Systems
GRAKN.AI: The Hyper-Relational Database for Knowledge-Oriented SystemsVaticle
 
Film Big Data Visualization Based on D3.pptx
Film Big Data Visualization Based on D3.pptxFilm Big Data Visualization Based on D3.pptx
Film Big Data Visualization Based on D3.pptxAbdulVahedShaik
 
Prediction of route and destination intent shibumon alampatta
Prediction of route and destination intent  shibumon alampattaPrediction of route and destination intent  shibumon alampatta
Prediction of route and destination intent shibumon alampattaShibu Alampatta
 
Face Recognition System using Self Organizing Feature Map and Appearance Base...
Face Recognition System using Self Organizing Feature Map and Appearance Base...Face Recognition System using Self Organizing Feature Map and Appearance Base...
Face Recognition System using Self Organizing Feature Map and Appearance Base...ijtsrd
 
IRJET- Optimization of Semantic Image Retargeting by using Guided Fusion Network
IRJET- Optimization of Semantic Image Retargeting by using Guided Fusion NetworkIRJET- Optimization of Semantic Image Retargeting by using Guided Fusion Network
IRJET- Optimization of Semantic Image Retargeting by using Guided Fusion NetworkIRJET Journal
 
Greedy subtourcrossover.arob98
Greedy subtourcrossover.arob98Greedy subtourcrossover.arob98
Greedy subtourcrossover.arob98Kaal Nath
 
Dynamic Routing Between Capsules
Dynamic Routing Between CapsulesDynamic Routing Between Capsules
Dynamic Routing Between CapsulesKarel Ha
 
Paper reading _interpreting_the_latent_space_of_ga_ns_for_semantic_face_editing
Paper reading _interpreting_the_latent_space_of_ga_ns_for_semantic_face_editingPaper reading _interpreting_the_latent_space_of_ga_ns_for_semantic_face_editing
Paper reading _interpreting_the_latent_space_of_ga_ns_for_semantic_face_editingRyosukeSato4
 

Ähnlich wie Graphs for Visual Understanding (20)

CariGANs : Unpaired Photo-to-Caricature Translation
CariGANs : Unpaired Photo-to-Caricature TranslationCariGANs : Unpaired Photo-to-Caricature Translation
CariGANs : Unpaired Photo-to-Caricature Translation
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun Yoo
 
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
 
CapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkCapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule Network
 
Introduction of DiscoGAN
Introduction of DiscoGANIntroduction of DiscoGAN
Introduction of DiscoGAN
 
Visual Network Narrations
Visual Network NarrationsVisual Network Narrations
Visual Network Narrations
 
A method for semantic-based image retrieval using hierarchical clustering tre...
A method for semantic-based image retrieval using hierarchical clustering tre...A method for semantic-based image retrieval using hierarchical clustering tre...
A method for semantic-based image retrieval using hierarchical clustering tre...
 
IRJET- Efficient Geo-tagging of images using LASOM
IRJET- Efficient Geo-tagging of images using LASOMIRJET- Efficient Geo-tagging of images using LASOM
IRJET- Efficient Geo-tagging of images using LASOM
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kaggler
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
 
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
 
GRAKN.AI: The Hyper-Relational Database for Knowledge-Oriented Systems
GRAKN.AI: The Hyper-Relational Database for Knowledge-Oriented SystemsGRAKN.AI: The Hyper-Relational Database for Knowledge-Oriented Systems
GRAKN.AI: The Hyper-Relational Database for Knowledge-Oriented Systems
 
Film Big Data Visualization Based on D3.pptx
Film Big Data Visualization Based on D3.pptxFilm Big Data Visualization Based on D3.pptx
Film Big Data Visualization Based on D3.pptx
 
Prediction of route and destination intent shibumon alampatta
Prediction of route and destination intent  shibumon alampattaPrediction of route and destination intent  shibumon alampatta
Prediction of route and destination intent shibumon alampatta
 
Face Recognition System using Self Organizing Feature Map and Appearance Base...
Face Recognition System using Self Organizing Feature Map and Appearance Base...Face Recognition System using Self Organizing Feature Map and Appearance Base...
Face Recognition System using Self Organizing Feature Map and Appearance Base...
 
IRJET- Optimization of Semantic Image Retargeting by using Guided Fusion Network
IRJET- Optimization of Semantic Image Retargeting by using Guided Fusion NetworkIRJET- Optimization of Semantic Image Retargeting by using Guided Fusion Network
IRJET- Optimization of Semantic Image Retargeting by using Guided Fusion Network
 
Greedy subtourcrossover.arob98
Greedy subtourcrossover.arob98Greedy subtourcrossover.arob98
Greedy subtourcrossover.arob98
 
Dynamic Routing Between Capsules
Dynamic Routing Between CapsulesDynamic Routing Between Capsules
Dynamic Routing Between Capsules
 
Paper reading _interpreting_the_latent_space_of_ga_ns_for_semantic_face_editing
Paper reading _interpreting_the_latent_space_of_ga_ns_for_semantic_face_editingPaper reading _interpreting_the_latent_space_of_ga_ns_for_semantic_face_editing
Paper reading _interpreting_the_latent_space_of_ga_ns_for_semantic_face_editing
 

Mehr von Kaushalya Madhawa

On the limitations of representing functions on sets
On the limitations of representing functions on setsOn the limitations of representing functions on sets
On the limitations of representing functions on setsKaushalya Madhawa
 
Robustness of compressed CNNs
Robustness of compressed CNNsRobustness of compressed CNNs
Robustness of compressed CNNsKaushalya Madhawa
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferenceKaushalya Madhawa
 
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...Kaushalya Madhawa
 
Opportunities in Higher Education & Career Guidance
Opportunities in Higher Education & Career GuidanceOpportunities in Higher Education & Career Guidance
Opportunities in Higher Education & Career GuidanceKaushalya Madhawa
 
Automatic generation of event summaries using microblog streams
Automatic generation of event summaries using microblog streamsAutomatic generation of event summaries using microblog streams
Automatic generation of event summaries using microblog streamsKaushalya Madhawa
 
Understanding social connections
Understanding social connectionsUnderstanding social connections
Understanding social connectionsKaushalya Madhawa
 
Leveraging mobile network big data for urban planning
Leveraging mobile network big data for urban planningLeveraging mobile network big data for urban planning
Leveraging mobile network big data for urban planningKaushalya Madhawa
 

Mehr von Kaushalya Madhawa (9)

On the limitations of representing functions on sets
On the limitations of representing functions on setsOn the limitations of representing functions on sets
On the limitations of representing functions on sets
 
Trends in DNN compression
Trends in DNN compressionTrends in DNN compression
Trends in DNN compression
 
Robustness of compressed CNNs
Robustness of compressed CNNsRobustness of compressed CNNs
Robustness of compressed CNNs
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inference
 
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...
 
Opportunities in Higher Education & Career Guidance
Opportunities in Higher Education & Career GuidanceOpportunities in Higher Education & Career Guidance
Opportunities in Higher Education & Career Guidance
 
Automatic generation of event summaries using microblog streams
Automatic generation of event summaries using microblog streamsAutomatic generation of event summaries using microblog streams
Automatic generation of event summaries using microblog streams
 
Understanding social connections
Understanding social connectionsUnderstanding social connections
Understanding social connections
 
Leveraging mobile network big data for urban planning
Leveraging mobile network big data for urban planningLeveraging mobile network big data for urban planning
Leveraging mobile network big data for urban planning
 

Kürzlich hochgeladen

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Graphs for Visual Understanding

  • 1. Graphs for Visual Understanding Kaushalya Madhawa Murata Group 9th July 2019 Slides available at: http://bitly.ws/4muD!1
  • 2. Outline • The problem: scene understanding • Why scene graphs? • Scene graph datasets • State-of-the-art • Challenges !2
  • 3. Object detection Man , Dog Man , Dog !3
  • 4. Object detection isn’t enough! !4 Man ^ Dog Man ^ Dog petting chased by
  • 5. Visual Relationships !5 Figure 3: Qualitative examples of our scene graph dete predictions, and red boxes and edges are false negatives. plays a key role in scene graph generation, leading ou Figure 3: Qualitative examples of our scene graph de predictions, and red boxes and edges are false negatives. plays a key role in scene graph generation, leading o man-1 glass-1 wearing head-1 has paper-1 face-1 has hand-1 has on jacket-1 wearing hand-2 phone-GT1 holding of in holding holding on has on on snow-1 mountain-1 on house-GT1 in front of behind tree-1 behind sheep-1 near window-GT1 has has has man-1 pant-1 glass-1 wearing bus-1 windshield-1 has door-1 has windshield-GT1 of of tire-1 has woman-GT1 woman-GT2 sidewalk-1 standing on standing on window-GT1 on building-1 tree-1 near man-1 man-GT1 street-GT1 wheel-GT1 wheel-1 skateboard-1 shoe-2 shoe-1 bag-1 on on wears wears has on wearing has under on on on cup-GT1 cup-1 table-1 cup-2 box-1 bowl-1 cup-3 on on on on on on near near tie-1 wearing woman-1 wearing hair-1 has shirt-GT1 wearing on on Human-object Object-object Object-attribute
  • 6. Why Graphs? • Interactions among objects can be easily modeled by a graph • More expressive than a text description • More suitable as an intermediate data structure !6 graph detection results. Green boxes and edges are correct negatives. leading our model to outperform previous state-of-the-art ne graph detection results. Green boxes and edges are correct negatives. n, leading our model to outperform previous state-of-the-art
  • 7. Scene Graph Applications • Visual question answering (VQA) • Image to text (Image captioning) • Text to image (via scene graphs) • Group activity recognition !7 Which side of the image is the plate on? Right What is the dark piece of furniture to the right of the rug called? Cabinet [Hudson and Manning, 2019]
  • 8. Scene Graph Applications • Visual question answering (VQA) • Image to text (Image captioning) • Text to image (via scene graphs) • Group activity recognition !8 Motorbike Road Park Dirty BASE: a motorcycle parked on the side of a road BASE+MGCN: a motorcycle parked on the side of a road SGAE: a motorcycle is parked on the Motorbike Road Park Dirty BASE: a motorcycle parked on the side of a road BASE+MGCN: a motorcycle parked on the side of a road SGAE: a motorcycle is parked on the ty street with many cars CN: a city street with many cars usy highway filled with lots of are many cars and buses on the way On On On On ty street with many cars CN: a city street with many cars usy highway filled with lots of are many cars and buses on the way On On (b): 45710 Green BASE: a couple of elephants walking in a field BASE+MGCN: two elephants walking in the grass in a field SGAE: a couple of elephants walking through a lush green forest GT: two elephants standing in grassy area with trees around Green BASE: a couple of elephants walking in a field BASE+MGCN: two elephants walking in the grass in a field SGAE: a couple of elephants walking through a lush green forest GT: two elephants standing in grassy area with trees around erson walking in the street CN: a person walking in the h a black umbrella erson walking down street with mbrella in the rain erson walking in the street CN: a person walking in the h a black umbrella erson walking down street with mbrella in the rain
  • 9. Scene Graph Applications • Visual question answering (VQA) • Image to text (Image captioning) • Text to image (via scene graphs) • Group activity recognition !9 A sheep by another sheep standing on the grass with sky above and a boat in the ocean by a tree behind the sheep Ours tackGAN [59] [47] grass skyocean tree by behind above e 1. State-of-the-art methods for generating images from nces, such as StackGAN [59], struggle to faithfully depict lex sentences with many objects. We overcome this limita- y generating images from scene graphs, allowing our method son explicitly about objects and their relationships. progress on text to image synthesis [41, 42, 43, 59] by [Johnson+ 2018]
  • 10. Applications of Scene Graphs • Visual question answering (VQA) • Image to text (Image captioning) • Text to image (via scene graphs) • Group activity recognition !10 Software Technology, Nanjing University, China or rec- his pa- een ac- we pro- Graph and po- h Con- e auto- end-to- ficiently ermore, fy ARG calized exten- gnition Activity Actor Relation GraphLeft Spike Figure 1: Understanding group activity in multi-person scene requires accurately determining relevant relation be- tween actors. Our model learns to represent the scene by actor relation graph, and performs reasoning about group activity (“left spike” in the illustrated example) according to the graph structure and nodes features. Each node denotes an actor, and each edge represents the relation between two actors actors from other aspects such as appearance similarity and [Wu+ 2019]
  • 11. Visual Question Answering (VQA) • VQA 2.0 [Goyal+ 2017] contains 200k images with question- answer pairs !11
  • 12. VQA: Strong real-world bias How many plastic containers are there? Two How many cups are there? Two !12 A blind model can achieve an accuracy ~67% for binary questions
  • 13. Scene Graph Datasets: Visual Genome • 100k images from COCO dataset annotated with scene graphs • 1M instances of objects and 600k relations !13 ial Intelligence egie Mellon University .edu, sthomson@cs.cmu.edu m/neuralmotifs helmet glove boot woman motorcycle riding wheel wheel seathas has has has has has has has man shirt shorts re 1. A ground truth scene graph containing entities, such as an, bike or helmet, that are localized in the image with nding boxes, color coded above, and the relationships between e entities, such as riding, the relation between woman and orcycle or has the relation between man and shirt. [Krishna+ 2016]
  • 14. Scene Graph Datasets - GQA • Images from Visual Genome annotated with question-answer pairs • Each image is annotated with a scene graph to represent its semantics !14
  • 15. Scene Graph Generation • Different settings of this problem • Scene graph prediction - image • Scene graph detection - image and object proposals • Predicate classification - image, bounding boxes with labels !15 head-1 face-1 has hand-2 phone-GT1 holding of hold on snow-1 ntain-1 on tree-1 has has bus-1 windshield-1 has door-1 has windshield-GT1 of of tire-1 has woman-GT1 woman-GT2 sidewalk-1 standing on standing on window-GT1 on building-1 tree-1 near kateboard-1 shoe-2 g-1 on wears wearing on cup-GT1 cup-1 table-1 cup-2 box-1 bowl-1 cup-3 on on on on on on near near hair-1 in the Scene Graph Detection setting. Green boxes are predicted and overlap with the h no match. Green edges are true positives predicted by our model at the R@20 setting,
  • 16. Scene Graph Prediction - Demo !16 chair table near shirt shirt_1 man sitting on wearing jean wearing hairhas book on deskon book_1 on man_1 wearing hair_1 has pillow on cup on laptop on phone on paper on
  • 17. Scene Graph Prediction - Demo !17 shirt shirt_1 face hand hand_1 man wearing wearing has has has hair has glass wearing face_1 has short wearing head has short_1 wearing on on boy wearing has
  • 18. Evaluation of Scene Graph Models • Recall@K is used in benchmarking scene graph prediction performance • Human annotated graphs can be incomplete !18
  • 19. SOTA in Scene Graph Prediction • VisualGenome dataset is used for comparisons • NeuralMotifs (CVPR 2018) • LinkNet (NeurIPS 2018) • Graphical Contrastive Losses for Scene Graph Generation (CVPR2019) - Winner of OpenImages Relationship Detection Challenge 2018 !19
  • 20. MotifNet [Zellers+ 2018] • Faster-RCNN pre-trained on VisualGenome • Global context learned using bidirectional LSTMs !20 dog head eye eye nose ear c1 c2 c3 c5 c1 c2 c3 c4 c5 c6 c4 c6 dog head eye nose eye ear <dog has head> <dog has eye> <background> n {2 objectcontextedgecontext RPN VGG16 h1 h2 h3 h4 ~h5 ~h6 ~d6 ~d1 ~d2 ~d3 ~d4 ~d5 ~d6 ~d1 ~d2 ~d3 ~d4 ~d5 Figure 5. A diagram of a Stacked Motif Network (MOTIFNET). The model breaks scene graph parsing into stages predicting bounding regions, labels for regions, and then relationships. Between each stage, global context is computed using bidirectional LSTMs and is then used for subsequent stages. In the first stage, a detector proposes bounding regions and then contextual information among bounding regions is computed and propagated (object context). The global context is used to predict labels for bounding boxes. Given bounding boxes and labels, the model constructs a new representation (edge context) that gives global context for edge predictions. Finally, edges are assigned labels by combining contextualized head, tail, and union bounding region information with an outer product. 5. Experimental Setup In the following sections we explain (1) details of how ers, resulting in separate branches for object/edge features.Pr (xi→j |B, O) = softmax (Wrgi,j + woi,oj)
  • 21. Limitations of SOTA Models • Simple heuristic baselines work better than previous SOTA models • Most of the models are too complicated and huge • Not suitable for real-time predictions • Object detection (e.g. Faster-RCNN) acts as a bottleneck!21
  • 22. Open Images 2019 - Visual Relationship !22 Google Research Google Research Google Research https://www.kaggle.com/c/open-images-2019-visual-relationship/overview
  • 23. Future Directions • More intuitive smaller models • Scene graphs on videos • Scene graphs for higher level understanding of images • Use of external knowledge graphs to incorporate common sense !23
  • 24. References • Wu, Jianchao, et al. "Learning Actor Relation Graphs for Group Activity Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. • Zellers, Rowan, et al. "Neural motifs: Scene graph parsing with global context." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. • Zhang, Ji, et al. "Graphical Contrastive Losses for Scene Graph Generation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. • Krishna, Ranjay, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International Journal of Computer Vision 123.1 (2017): 32-73. !24