SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
2019/5/10
1
Video + Language
Jiebo Luo
Department of Computer Science
What Is Computer Vision?
• A major branch of Artificial Intelligence
• Vision is the process of discovering from images what
is present in the world, and where it is.
-- David Marr, Vision (1982)
Computer vision: 50 years of progress
6
feature engineering + model learning
𝐟 = 𝑓 𝐼 𝒚 𝑔 𝐟, 𝜃
deep le
𝒚 𝑔 𝐼, 𝜃 , wher
MSR-VTT
CCV
UCF101
HMDB51
ActivityNet
FCVID
Hollywood
Sports-1M
YouTube-
8M
1,000
10,000
100,000
1,000,000
10,000,000
10 100 1,000 10,000
#Example
#Class
Computer vision: 50 years of progress
7
COCO
KBK-1M
Open
Images
Visual
Genome
Caltech 101
Caltech 256
SUN
ImageNet
ImageNet…
Pascal
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1 10 100 1,000 10,000 100,000
#Example
#Class
Flickr 8K
Flickr 30K
MPII-MD
M-VAD
MSVD
SBU
TGIF
YFCC 100M
Introduction
• Video has become ubiquitous on the Internet, TV, as well as
personal devices.
• Recognition of video content has been a fundamental challenge
in computer vision for decades, where previous research
predominantly focused on understanding videos using a
predefined yet limited vocabulary.
• Thanks to the recent development of deep learning techniques,
researchers in both computer vision and multimedia
communities are now striving to bridge video with
natural language, which can be regarded as the ultimate goal of
video understanding.
• We present recent advances in exploring the synergy of video
understanding and language processing, including video-
language alignment, video captioning, and video emotion
analysis.
From Classification to Description
Recognizing Realistic Actions from Videos "in the Wild"
UCF-11 to UCF-101
(CVPR 2009)
Similarity btw Videos Cross-Domain Learning
Visual Event Recognition in Videos by
Learning from Web Data
(CVPR2010 Best Student Paper)
Heterogeneous Feature Machine For Visual Recognition
(ICCV 2009)
1 5
6 7
9 10
2019/5/10
2
From Classification to Description
11
Semantic Video Entity Linking
(ICCV2015)
Aligning Language Descriptions
with Videos
Iftekhar Naim, Young Chol Song, Qiguang Liu
Jiebo Luo, Dan Gildea, Henry Kautz
link
OverviewOverview
 Unsupervised alignment of video with text
 Motivations
 Generate labels from data (reduce burden of manual labeling)
 Learn new actions from only parallel video+text
 Extend noun/object matching to verbs and actions
 Unsupervised alignment of video with text
 Motivations
 Generate labels from data (reduce burden of manual labeling)
 Learn new actions from only parallel video+text
 Extend noun/object matching to verbs and actions
Matching Verbs to Actions
The person takes out a knife
and cutting board
Matching Nouns to Objects
[Naim et al., 2015]
An overview of the text and
video alignment framework
Hyperfeatures for ActionsHyperfeatures for Actions
 High-level features required for alignment with text
→ Motion features are generally low-level
 Hyperfeatures, originally used for image recognition extended
for use with motion features
→ Use temporal domain instead of spatial domain for vector
quantization (clustering)
 High-level features required for alignment with text
→ Motion features are generally low-level
 Hyperfeatures, originally used for image recognition extended
for use with motion features
→ Use temporal domain instead of spatial domain for vector
quantization (clustering)
Originally described in “Hyperfeatures:
Multilevel Local Coding for Visual Recog‐
nition” Agarwal, A. (ECCV 06), for images Hyperfeatures for actions
Hyperfeatures for ActionsHyperfeatures for Actions
 From low-level motion features, create high-level
representations that can easily align with verbs in text
 From low-level motion features, create high-level
representations that can easily align with verbs in text
Cluster 3
at time t
Accumulate over
frame at time t
& cluster
Conduct vector
quantization
of the histogram
at time t
Cluster 3, 5, …,5,20
= Hyperfeature 6
Each color code
is a vector
quantized
STIP point
Vector quantized
STIP point histogram at time t
Accumulate clusters over
window (t‐w/2, t+w/2]
and conduct vector
quantization
→ first‐level hyperfeatures
Align hyperfeatures
with verbs from text
(using LCRF)
Latent-variable CRF AlignmentLatent-variable CRF Alignment
 CRF where the latent variable is the alignment
 N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)
 Xi,m represents nouns and verbs extracted from the mth sentence
 Yi,n represents blobs and actions in interval n in the video
 Conditional likelihood
 conditional probability of
 Learning weights w
 Stochastic gradient descent
 CRF where the latent variable is the alignment
 N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)
 Xi,m represents nouns and verbs extracted from the mth sentence
 Yi,n represents blobs and actions in interval n in the video
 Conditional likelihood
 conditional probability of
 Learning weights w
 Stochastic gradient descent
where feature function
More details in Naim et al. 2015 NAACL Paper ‐
Discriminative unsupervised alignment of natural language instructions with corresponding video segments
N
11 14
16 17
18 19
2019/5/10
3
Experiments: Wetlab DatasetExperiments: Wetlab Dataset
 RGB-Depth video with lab protocols in text
 Compare addition of hyperfeatures generated from motion
features to previous results (Naim et al. 2015)
 Small improvement over previous results
 Activities already highly correlated with object-use
 RGB-Depth video with lab protocols in text
 Compare addition of hyperfeatures generated from motion
features to previous results (Naim et al. 2015)
 Small improvement over previous results
 Activities already highly correlated with object-use
Detection of objects in 3D space
using color and point‐cloud
Previous results
using object/noun
alignment only
Addition of different types
of motion features
2DTraj: Dense trajectories
*Using hyperfeature window size w=150
Experiments: TACoS DatasetExperiments: TACoS Dataset
 RGB video with crowd-sourced text descriptions
 Activities such as “making a salad,” “baking a cake”
 No object recognition, alignment using actions only
 Uniform: Assume each sentence takes the same amount of time over the entire sequence
 Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels
 Unsupervised LCRF: Both segmentation and alignment are unknown
 Effect of window size and number of clusters
 Consistent with average
action length: 150 frames
 RGB video with crowd-sourced text descriptions
 Activities such as “making a salad,” “baking a cake”
 No object recognition, alignment using actions only
 Uniform: Assume each sentence takes the same amount of time over the entire sequence
 Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels
 Unsupervised LCRF: Both segmentation and alignment are unknown
 Effect of window size and number of clusters
 Consistent with average
action length: 150 frames
*Using hyperfeature
window size w=150
*d(2)=64
Experiments: TACoS DatasetExperiments: TACoS Dataset
 Segmentation from a sequence in the dataset Segmentation from a sequence in the dataset
Crowd‐sourced descriptions
Example of text and video alignment generated
by the system on the TACoS corpus for sequence s13‐d28
Image Captioning with Semantic
Attention (CVPR 2016)
Quanzeng You, Jiebo Luo
Hailin Jin, Zhaowen Wang and Chen Fang
Image Captioning
• Motivations
– Real-world Usability
• Help visually impaired people, learning-impaired
– Improving Image Understanding
• Classification, Objection detection
– Image Retrieval
1. a young girl inhales with the intent of blowing
out a candle
2. girl blowing out the candle on an ice cream
1. A shot from behind home
plate of children playing
baseball
2. A group of children playing
baseball in the rain
3. Group of baseball players
playing on a wet field
Introduction of Image Captioning
• Machine learning as an approach to solve the problem
20 21
22 23
24 25
2019/5/10
4
Overview
• Brief overview of current approaches
• Our main motivation
• The proposed semantic attention model
• Evaluation results
Brief Introduction of Recurrent Neural Network
• Different from CNN
11),(   ttttt BhAxhxfh
tt Chy 
• Unfolding over time
 Feedforward network
 Backpropagation through time
Inputs
Hidden Units
Outputs
xt ht-1
ht
yt
Inputs
Hidden Units
Outputs
C
yt
Inputs
Hidden Units
Inputs
Hidden Units
B
B
A
A
A B
t-1
t-2
Applications of Recurrent Neural Networks
• Machine Translation
• Reads input sentence “ABC” and produces “WXYZ”
Decoder RNNEncoder RNN
Encoder-Decoder Framework for Captioning
• Inspired by neural network based machine
translation
• Loss function



N
t
tt wwIwp
IwpL
1
10 ),,,|(log
)|(log

Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
– Visually similar nearest neighbor images
26 27
28 29
31 32
2019/5/10
5
Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
– Visually similar nearest neighbor images
– Success of low-level tasks
• Visual attributes detection
Image Captioning with Semantic Attention
• Big idea
First Idea
• Provide additional knowledge at each input node
• Concatenate the input word and the extra attributes K
• Each image has a fixed keyword list
)],,([),( 11   tktttt hbKWwfhxfh
Visual Features: 1024
GoogleNet
LSTM Hidden states: 512
Training details:
1. 256 image/sentence
pairs
2. RMS-Prob
Using Attributes along with Visual Features
• Provide additional knowledge at each input node
• Concatenate the visual embedding and keywords for h0
];[),( 10 bKWvWhvfh kiv  
Attention Model on Attributes
• Instead of using the same set of attributes at every
step
• At each step, select the attributes (attention)
 m mtmt kKwatt ),(
)softmax VK(wT
tt 
))],,(;([),( 11   tttttt hKwattxfhxfh
Overall Framework
• Training with a bilinear/bilateral attention model
ht
pt
xt
v
{Ai}
Yt~
𝜑
𝜙
RNN
Image
CNN
AttrDet 1
AttrDet 2
AttrDet 3
AttrDet N

t = 0
Word
33 34
35 36
38 39
2019/5/10
6
Visual Attributes
• A secondary contribution
• We try different approaches
Performance
• Examples showing the impact of visual attributes on captions
Performance on the Testing Dataset
• Publicly available split
Performance
• MS-COCO Image Captioning Challenge
Captioning with Emotion and Style A Simple Framework
40 41
42 43
45 46
2019/5/10
7
Examples Examples
Integrating Scene Text and Visual Appearance
for Fine-Grained Image Classification Xiang Bai et al.
49
TGIF: A New Dataset and Benchmark on
Animated GIF Description
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault,
Larry Goldberg, Jiebo Luo
Overview Comparison with Existing Datasets
47 48
49 50
51 52
2019/5/10
8
Examples
a skate boarder is
doing trick on his skate
board.
a gloved hand opens to
reveal a golden ring.
a sport car is swinging on
the race playground
the vehicle is moving fast
into the tunnel
Contributions
• A large scale animated GIF description dataset for promoting image
sequence modeling and research
• Performing automatic validation to collect natural language descriptions
from crowd workers
• Establishing baseline image captioning methods for future
benchmarking
• Comparison with existing datasets, highlighting the benefits with
animated GIFs
In Comparison with Existing Datasets
• The language in our dataset is closer to common language
• Our dataset has an emphasis on the verbs
• Animated GIFs are more coherent and self contained
• Our dataset can be used to solve more difficult movie description
problem
Machine Generated Sentence Examples
Machine Generated Sentence Examples Machine Generated Sentence Examples
53 54
55 56
57 58
2019/5/10
9
Comparing Professionals and Crowd-workers
Crowd worker: two people are kissing on a boat.
Professional: someone glances at a kissing couple
then steps to a railing overlooking the ocean an
older man and woman stand beside him.
Crowd worker: two men got into their car
and not able to go anywhere because the
wheels were locked.
Professional: someone slides over the
camaros hood then gets in with his partner
he starts the engine the revving vintage
car starts to backup then lurches to a halt.
Crowd worker: a man in a shirt and tie sits beside a person who is
covered in a sheet.
Professional: he makes eye contact with the woman for only a second.
More: http://beta-
web2.cloudapp.net/ls
mdc_sentence_comp
arison.html
Movie Descriptions versus TGIF
• Crowd workers are encouraged to describe the major visual content
directly, and not to use overly descriptive language
• Because our animated GIFs are presented to crowd workers without
any context, the sentences in our dataset are more self-contained
• Animated GIFs are perfectly segmented since they are carefully
curated by online users to create a coherent visual story
Code & Dataset
• Yahoo! webscope (Coming soon!)
• Animated GIFs and sentences
• Code and models for LSTM baseline
• Pipeline for syntactic and semantic validation to collect natural
languages from crowd workers
Image Captioning with Unpaired Data
• Supervised and semi-supervised captioning methods rely on paired
training data, which are expensive to acquire (i.e., the bottleneck)
• Image captioning trained with unpaired data is a promising direction
because unpaired training data are much easier to obtain
Supervised captioning
images captions
Semi-supervised captioning
images captions
Unpaired captioning
images captions
(a) (b) (c)
Image Captioning with Unpaired Data
• How to align images with sentences?
… …
Little girl looking
down at leaves
with her bicycle
with training
wheels parked
next to her.
Image Captioning with Unpaired Data
• How to align images with sentences?
… …
Little girl looking
down at leaves
with her bicycle
with training
wheels parked
next to her.
Little girl looking
down at leaves
with her bicycle
with training
wheels parked
next to her.
Object
Detection
59 60
61 62
63 64
2019/5/10
10
Image Captioning with Unpaired Data
Bench, bike, girl.
How to
generate a
sentence?
Image Captioning with Unpaired Data
Bench, bike, girl.
…
A young child in a park next to a red
bench and red bicycle that as training
wheels.
A woman that is standing in front of an oven.
Real sentences
Discriminator Real/Fake
Image Captioning with Unpaired Data Image Captioning with Unpaired Data
Unpaired Image Captioning
detected concepts person, tree, helmet
con2sen a person in a helmet is riding a tree .
feat2sen
a person wearing a helmet is skateboarding on a ramp
.
adv a man riding a skateboard down a street .
adv + con a person on a court with a racket
adv + con + im a man riding a skateboard on a ramp .
Ours without init a man riding a surfboard on top of a wave .
Ours a man on a skateboard performing a trick .
BRNN a man riding a skateboard down the side of a ramp
detected concepts boat, watercraft
con2sen a group of people in a boat with a lighthouse .
feat2sen a boat that is floating in the water .
adv a man riding a wave on top of a surfboard .
adv + con a large boat that is in the water .
adv + con + im a large boat in the middle of a lake .
Ours without init a boat that is sitting in the water
Ours a view of a boat in the ocean
BRNN a boat floating on top of a body of water
Unpaired Image Captioning
detected concepts boat
con2sen a boat that is floating in the water .
feat2sen a boat that is floating in the water .
adv a small boat is floating in the water .
adv + con a boat that is floating in the water .
adv + con + im a boat that is parked in the water .
Ours without init a boat that is sitting in the water .
Ours a boat is docked in a harbor .
BRNN a boat floating on top of a body of water
detected concepts animal, frame
con2sen a stuffed animal sitting on top of a frame .
feat2sen
a group of people standing around a table with a
picture of a large piece of cloth and
adv a man riding a wave on top of a surfboard .
adv + con a group of people that are standing on the beach
adv + con + im a person riding a surf board on the waves
Ours without init a child is flying a kite on the beach .
Ours a group of people surfing on a wave .
BRNN a man standing on top of a sandy beach
65 66
67 68
69 70
2019/5/10
11
Sports Video Captioning
by Attentive Motion Representation
based Hierarchical Recurrent Neural Networks
Mengshi Qi1, Yunhong Wang1, Annan Li1, Jiebo Luo2
1Beijing Advanced Innovation Center for Big Data and Brain Computing,
School of Computer Science and Engineering, Beihang University
2Department of Computer Science, University of Rochester
Introduction
 Sports video captioning is a task of automatically generating a
textual description for sports events (e.g., soccer, basketball or
volleyball games).
Soccer Basketball Volleyball
Problem Definition Challenges
How to design an effective sports video captioning system?
conventional video captioning methods can only generate coarse descriptions,
and overlook the motion details of players’ actions and group-level activities
How to capture fine-grained motion information of sports
players?
player’s articulated movements/pose estimation/motion trajectory
the key player’s action is critical in a sports event
How to build a dataset that focuses on sports video captioning?
different sports games
Contributions
We introduce a novel framework for sports video captioning with
attentive motion representation-based hierarchical recurrent neural
networks.
A motion representation module is employed to extract player’s pose
and trajectory information, where we capture semantic attributes from
player’s skeletons and cluster trajectories from dynamic movements.
We annotate a new dataset called Sports Video Captioning Dataset-
Volleyball .
Framework
71 72
73 74
75 76
2019/5/10
12
(1) Action Proposal Module
Given a video, retrieving and localizing temporal segments that
likely contain spatio-temporal group events (i.e. attack, defend)
or individual actions (i.e. spiking, passing) is the first task.
Deep Action Proposals (DAP) method [1]
Motion Representation Module
(1) Pose Attribute Detection Part
(2) Trajectory Clustering Part
Pose Attribute Detection Part
Extract pose-based CNN (Convolutional Neural Network) feature [2] from each
body part of a player
Build an attribute vocabulary from the annotated sentences in the dataset (e.g.,
UCF-101 [3] , Volleyball [4]), and use the top 𝑘 high-frequency words
Trajectory Clustering Part
Extracting the dense point trajectories 𝑇𝑟𝑎 𝑇𝑟𝑎 , … , 𝑇𝑟𝑎 for a sequence of
frames by trajectory-pooled deep-convolutional descriptors [5] , where 𝑀 is the
number of trajectories
Then we partition all the detected trajectories into groups by computing the
affinity matrix between each trajectory pair and utilizing the graph clustering
method [6], we can obtain 𝑚 clusters.
Databases
General Video Captioning:
1) Microsoft Video Description Dataset (MSVD)
2) MSR Video-to-Text Dataset (MSR-VTT)
Sports Video Captioning:
1) Sports Video Captioning Dataset-Volleyball (SVCDV)
Sports Video Captioning Dataset-Volleyball
(SVCDV)
55 videos with 4,830 short clips from Olympic Volleyball Games
About 44,436 sentences, 9.2 sentences on average of each
clip
Verb ratio is nearly 16.2%, 1.72 verbs per sentences
9 individual action labels:
waiting/setting/digging/falling/spiking/blocking/jumping/moving/standing
8 group activity labels:
right set/right spike/right pass/right winpoint/left winpoint/left pass/left
spike/left set
77 78
79 81
86 87
2019/5/10
13
General Video Captioning Task
Table 1. Captioning Performance on MSV and
MSR-VTT Datasets
General Video Captioning Task
Sports Video Captioning and Components
Analysis
Table 2. Captioning Performance on SVCDV
Sports Video Captioning
Demo Video Video Re-localization (ECCV 2018)
• How to localize the part semantically relevant to the clip on the
left in the long video on the right?
88 89
90 91
92 95
2019/5/10
14
Video Re-localization
Possible solutions
• Copy detection
–The contents can be largely different in both foreground and
background
• Action localization
–Only one sample is available
Video Re-localization
• We propose a matching framework
Cross Gating & Bilinear Matching
• We propose
• Cross gating to remove irrelevant contents
• Bilinear matching to capture the interactions between the query and
reference
Video Re-localization
• Results
Shuffleboard
query length = 14
reference length = 49
Ground truth 17 27
Ours 17 27
Frame 15 23
Video 17 19
Pole vault
query length = 15
reference length = 49
Ground truth 10 37
Ours 10 38
Frame 20 32
Video 17 38
query
reference
query
reference
Video Re-localization Localizing Language in Video (AAAI 2019)
• How to localize the part semantically relevant to the clip on the
left in the long video on the right?
96 97
98 99
100 101
2019/5/10
15
Localizing Language in Video (AAAI 2019) Localizing Language in Video (AAAI 2019)
Localizing Language in Video (AAAI 2019) Localizing Language in Video (AAAI 2019)
Localizing Language in Video (AAAI 2019) Increasing Attention (CVPR 2019)
102 103
104 105
106 131

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Universitat Politècnica de Catalunya
 
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Universitat Politècnica de Catalunya
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Universitat Politècnica de Catalunya
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaUniversitat Politècnica de Catalunya
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Universitat Politècnica de Catalunya
 

Was ist angesagt? (20)

Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
Deep Audio and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
 
One Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and VisionOne Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and Vision
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
 
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
 
Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019
Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019
Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
 
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
 

Ähnlich wie Video + Language

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptxcomputerscience98
 
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future DirectionsToru Tamaki
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobileAnirudh Koul
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videosabidhavp
 
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...Edureka!
 
Large-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxLarge-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxcroysierkathey
 
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...AI Frontiers
 
Tagging based Efficient Web Video Event Categorization
Tagging based Efficient Web Video Event CategorizationTagging based Efficient Web Video Event Categorization
Tagging based Efficient Web Video Event CategorizationEditor IJCATR
 
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...Amazon Web Services
 
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...Piyush Yadav
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .netMarco Parenzan
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 MLconf
 

Ähnlich wie Video + Language (20)

Video+Language: From Classification to Description
Video+Language: From Classification to DescriptionVideo+Language: From Classification to Description
Video+Language: From Classification to Description
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
2019 Triangle Machine Learning Day - Stacking Audience Models - Adaptive Deep...
 
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobile
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videos
 
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
 
Large-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxLarge-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docx
 
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
 
Machine Learning and Deep Software Variability
Machine Learning and Deep Software VariabilityMachine Learning and Deep Software Variability
Machine Learning and Deep Software Variability
 
Tagging based Efficient Web Video Event Categorization
Tagging based Efficient Web Video Event CategorizationTagging based Efficient Web Video Event Categorization
Tagging based Efficient Web Video Event Categorization
 
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
 
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
 

Mehr von Goergen Institute for Data Science

Mehr von Goergen Institute for Data Science (9)

How to properly review AI papers?
How to properly review AI papers?How to properly review AI papers?
How to properly review AI papers?
 
Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
 
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
 
Computer Vision++: Where Do We Go from Here?
Computer Vision++: Where Do We Go from Here?Computer Vision++: Where Do We Go from Here?
Computer Vision++: Where Do We Go from Here?
 
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
 
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
 
Big Data Better Life
Big Data Better LifeBig Data Better Life
Big Data Better Life
 
Social Multimedia as Sensors
Social Multimedia as SensorsSocial Multimedia as Sensors
Social Multimedia as Sensors
 
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
 

Kürzlich hochgeladen

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Kürzlich hochgeladen (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Video + Language

  • 1. 2019/5/10 1 Video + Language Jiebo Luo Department of Computer Science What Is Computer Vision? • A major branch of Artificial Intelligence • Vision is the process of discovering from images what is present in the world, and where it is. -- David Marr, Vision (1982) Computer vision: 50 years of progress 6 feature engineering + model learning 𝐟 = 𝑓 𝐼 𝒚 𝑔 𝐟, 𝜃 deep le 𝒚 𝑔 𝐼, 𝜃 , wher MSR-VTT CCV UCF101 HMDB51 ActivityNet FCVID Hollywood Sports-1M YouTube- 8M 1,000 10,000 100,000 1,000,000 10,000,000 10 100 1,000 10,000 #Example #Class Computer vision: 50 years of progress 7 COCO KBK-1M Open Images Visual Genome Caltech 101 Caltech 256 SUN ImageNet ImageNet… Pascal 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1 10 100 1,000 10,000 100,000 #Example #Class Flickr 8K Flickr 30K MPII-MD M-VAD MSVD SBU TGIF YFCC 100M Introduction • Video has become ubiquitous on the Internet, TV, as well as personal devices. • Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. • Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge video with natural language, which can be regarded as the ultimate goal of video understanding. • We present recent advances in exploring the synergy of video understanding and language processing, including video- language alignment, video captioning, and video emotion analysis. From Classification to Description Recognizing Realistic Actions from Videos "in the Wild" UCF-11 to UCF-101 (CVPR 2009) Similarity btw Videos Cross-Domain Learning Visual Event Recognition in Videos by Learning from Web Data (CVPR2010 Best Student Paper) Heterogeneous Feature Machine For Visual Recognition (ICCV 2009) 1 5 6 7 9 10
  • 2. 2019/5/10 2 From Classification to Description 11 Semantic Video Entity Linking (ICCV2015) Aligning Language Descriptions with Videos Iftekhar Naim, Young Chol Song, Qiguang Liu Jiebo Luo, Dan Gildea, Henry Kautz link OverviewOverview  Unsupervised alignment of video with text  Motivations  Generate labels from data (reduce burden of manual labeling)  Learn new actions from only parallel video+text  Extend noun/object matching to verbs and actions  Unsupervised alignment of video with text  Motivations  Generate labels from data (reduce burden of manual labeling)  Learn new actions from only parallel video+text  Extend noun/object matching to verbs and actions Matching Verbs to Actions The person takes out a knife and cutting board Matching Nouns to Objects [Naim et al., 2015] An overview of the text and video alignment framework Hyperfeatures for ActionsHyperfeatures for Actions  High-level features required for alignment with text → Motion features are generally low-level  Hyperfeatures, originally used for image recognition extended for use with motion features → Use temporal domain instead of spatial domain for vector quantization (clustering)  High-level features required for alignment with text → Motion features are generally low-level  Hyperfeatures, originally used for image recognition extended for use with motion features → Use temporal domain instead of spatial domain for vector quantization (clustering) Originally described in “Hyperfeatures: Multilevel Local Coding for Visual Recog‐ nition” Agarwal, A. (ECCV 06), for images Hyperfeatures for actions Hyperfeatures for ActionsHyperfeatures for Actions  From low-level motion features, create high-level representations that can easily align with verbs in text  From low-level motion features, create high-level representations that can easily align with verbs in text Cluster 3 at time t Accumulate over frame at time t & cluster Conduct vector quantization of the histogram at time t Cluster 3, 5, …,5,20 = Hyperfeature 6 Each color code is a vector quantized STIP point Vector quantized STIP point histogram at time t Accumulate clusters over window (t‐w/2, t+w/2] and conduct vector quantization → first‐level hyperfeatures Align hyperfeatures with verbs from text (using LCRF) Latent-variable CRF AlignmentLatent-variable CRF Alignment  CRF where the latent variable is the alignment  N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)  Xi,m represents nouns and verbs extracted from the mth sentence  Yi,n represents blobs and actions in interval n in the video  Conditional likelihood  conditional probability of  Learning weights w  Stochastic gradient descent  CRF where the latent variable is the alignment  N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)  Xi,m represents nouns and verbs extracted from the mth sentence  Yi,n represents blobs and actions in interval n in the video  Conditional likelihood  conditional probability of  Learning weights w  Stochastic gradient descent where feature function More details in Naim et al. 2015 NAACL Paper ‐ Discriminative unsupervised alignment of natural language instructions with corresponding video segments N 11 14 16 17 18 19
  • 3. 2019/5/10 3 Experiments: Wetlab DatasetExperiments: Wetlab Dataset  RGB-Depth video with lab protocols in text  Compare addition of hyperfeatures generated from motion features to previous results (Naim et al. 2015)  Small improvement over previous results  Activities already highly correlated with object-use  RGB-Depth video with lab protocols in text  Compare addition of hyperfeatures generated from motion features to previous results (Naim et al. 2015)  Small improvement over previous results  Activities already highly correlated with object-use Detection of objects in 3D space using color and point‐cloud Previous results using object/noun alignment only Addition of different types of motion features 2DTraj: Dense trajectories *Using hyperfeature window size w=150 Experiments: TACoS DatasetExperiments: TACoS Dataset  RGB video with crowd-sourced text descriptions  Activities such as “making a salad,” “baking a cake”  No object recognition, alignment using actions only  Uniform: Assume each sentence takes the same amount of time over the entire sequence  Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels  Unsupervised LCRF: Both segmentation and alignment are unknown  Effect of window size and number of clusters  Consistent with average action length: 150 frames  RGB video with crowd-sourced text descriptions  Activities such as “making a salad,” “baking a cake”  No object recognition, alignment using actions only  Uniform: Assume each sentence takes the same amount of time over the entire sequence  Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels  Unsupervised LCRF: Both segmentation and alignment are unknown  Effect of window size and number of clusters  Consistent with average action length: 150 frames *Using hyperfeature window size w=150 *d(2)=64 Experiments: TACoS DatasetExperiments: TACoS Dataset  Segmentation from a sequence in the dataset Segmentation from a sequence in the dataset Crowd‐sourced descriptions Example of text and video alignment generated by the system on the TACoS corpus for sequence s13‐d28 Image Captioning with Semantic Attention (CVPR 2016) Quanzeng You, Jiebo Luo Hailin Jin, Zhaowen Wang and Chen Fang Image Captioning • Motivations – Real-world Usability • Help visually impaired people, learning-impaired – Improving Image Understanding • Classification, Objection detection – Image Retrieval 1. a young girl inhales with the intent of blowing out a candle 2. girl blowing out the candle on an ice cream 1. A shot from behind home plate of children playing baseball 2. A group of children playing baseball in the rain 3. Group of baseball players playing on a wet field Introduction of Image Captioning • Machine learning as an approach to solve the problem 20 21 22 23 24 25
  • 4. 2019/5/10 4 Overview • Brief overview of current approaches • Our main motivation • The proposed semantic attention model • Evaluation results Brief Introduction of Recurrent Neural Network • Different from CNN 11),(   ttttt BhAxhxfh tt Chy  • Unfolding over time  Feedforward network  Backpropagation through time Inputs Hidden Units Outputs xt ht-1 ht yt Inputs Hidden Units Outputs C yt Inputs Hidden Units Inputs Hidden Units B B A A A B t-1 t-2 Applications of Recurrent Neural Networks • Machine Translation • Reads input sentence “ABC” and produces “WXYZ” Decoder RNNEncoder RNN Encoder-Decoder Framework for Captioning • Inspired by neural network based machine translation • Loss function    N t tt wwIwp IwpL 1 10 ),,,|(log )|(log  Our Motivation • Additional textual information – Own noisy title, tags or captions (Web) Our Motivation • Additional textual information – Own noisy title, tags or captions (Web) – Visually similar nearest neighbor images 26 27 28 29 31 32
  • 5. 2019/5/10 5 Our Motivation • Additional textual information – Own noisy title, tags or captions (Web) – Visually similar nearest neighbor images – Success of low-level tasks • Visual attributes detection Image Captioning with Semantic Attention • Big idea First Idea • Provide additional knowledge at each input node • Concatenate the input word and the extra attributes K • Each image has a fixed keyword list )],,([),( 11   tktttt hbKWwfhxfh Visual Features: 1024 GoogleNet LSTM Hidden states: 512 Training details: 1. 256 image/sentence pairs 2. RMS-Prob Using Attributes along with Visual Features • Provide additional knowledge at each input node • Concatenate the visual embedding and keywords for h0 ];[),( 10 bKWvWhvfh kiv   Attention Model on Attributes • Instead of using the same set of attributes at every step • At each step, select the attributes (attention)  m mtmt kKwatt ),( )softmax VK(wT tt  ))],,(;([),( 11   tttttt hKwattxfhxfh Overall Framework • Training with a bilinear/bilateral attention model ht pt xt v {Ai} Yt~ 𝜑 𝜙 RNN Image CNN AttrDet 1 AttrDet 2 AttrDet 3 AttrDet N  t = 0 Word 33 34 35 36 38 39
  • 6. 2019/5/10 6 Visual Attributes • A secondary contribution • We try different approaches Performance • Examples showing the impact of visual attributes on captions Performance on the Testing Dataset • Publicly available split Performance • MS-COCO Image Captioning Challenge Captioning with Emotion and Style A Simple Framework 40 41 42 43 45 46
  • 7. 2019/5/10 7 Examples Examples Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification Xiang Bai et al. 49 TGIF: A New Dataset and Benchmark on Animated GIF Description Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Jiebo Luo Overview Comparison with Existing Datasets 47 48 49 50 51 52
  • 8. 2019/5/10 8 Examples a skate boarder is doing trick on his skate board. a gloved hand opens to reveal a golden ring. a sport car is swinging on the race playground the vehicle is moving fast into the tunnel Contributions • A large scale animated GIF description dataset for promoting image sequence modeling and research • Performing automatic validation to collect natural language descriptions from crowd workers • Establishing baseline image captioning methods for future benchmarking • Comparison with existing datasets, highlighting the benefits with animated GIFs In Comparison with Existing Datasets • The language in our dataset is closer to common language • Our dataset has an emphasis on the verbs • Animated GIFs are more coherent and self contained • Our dataset can be used to solve more difficult movie description problem Machine Generated Sentence Examples Machine Generated Sentence Examples Machine Generated Sentence Examples 53 54 55 56 57 58
  • 9. 2019/5/10 9 Comparing Professionals and Crowd-workers Crowd worker: two people are kissing on a boat. Professional: someone glances at a kissing couple then steps to a railing overlooking the ocean an older man and woman stand beside him. Crowd worker: two men got into their car and not able to go anywhere because the wheels were locked. Professional: someone slides over the camaros hood then gets in with his partner he starts the engine the revving vintage car starts to backup then lurches to a halt. Crowd worker: a man in a shirt and tie sits beside a person who is covered in a sheet. Professional: he makes eye contact with the woman for only a second. More: http://beta- web2.cloudapp.net/ls mdc_sentence_comp arison.html Movie Descriptions versus TGIF • Crowd workers are encouraged to describe the major visual content directly, and not to use overly descriptive language • Because our animated GIFs are presented to crowd workers without any context, the sentences in our dataset are more self-contained • Animated GIFs are perfectly segmented since they are carefully curated by online users to create a coherent visual story Code & Dataset • Yahoo! webscope (Coming soon!) • Animated GIFs and sentences • Code and models for LSTM baseline • Pipeline for syntactic and semantic validation to collect natural languages from crowd workers Image Captioning with Unpaired Data • Supervised and semi-supervised captioning methods rely on paired training data, which are expensive to acquire (i.e., the bottleneck) • Image captioning trained with unpaired data is a promising direction because unpaired training data are much easier to obtain Supervised captioning images captions Semi-supervised captioning images captions Unpaired captioning images captions (a) (b) (c) Image Captioning with Unpaired Data • How to align images with sentences? … … Little girl looking down at leaves with her bicycle with training wheels parked next to her. Image Captioning with Unpaired Data • How to align images with sentences? … … Little girl looking down at leaves with her bicycle with training wheels parked next to her. Little girl looking down at leaves with her bicycle with training wheels parked next to her. Object Detection 59 60 61 62 63 64
  • 10. 2019/5/10 10 Image Captioning with Unpaired Data Bench, bike, girl. How to generate a sentence? Image Captioning with Unpaired Data Bench, bike, girl. … A young child in a park next to a red bench and red bicycle that as training wheels. A woman that is standing in front of an oven. Real sentences Discriminator Real/Fake Image Captioning with Unpaired Data Image Captioning with Unpaired Data Unpaired Image Captioning detected concepts person, tree, helmet con2sen a person in a helmet is riding a tree . feat2sen a person wearing a helmet is skateboarding on a ramp . adv a man riding a skateboard down a street . adv + con a person on a court with a racket adv + con + im a man riding a skateboard on a ramp . Ours without init a man riding a surfboard on top of a wave . Ours a man on a skateboard performing a trick . BRNN a man riding a skateboard down the side of a ramp detected concepts boat, watercraft con2sen a group of people in a boat with a lighthouse . feat2sen a boat that is floating in the water . adv a man riding a wave on top of a surfboard . adv + con a large boat that is in the water . adv + con + im a large boat in the middle of a lake . Ours without init a boat that is sitting in the water Ours a view of a boat in the ocean BRNN a boat floating on top of a body of water Unpaired Image Captioning detected concepts boat con2sen a boat that is floating in the water . feat2sen a boat that is floating in the water . adv a small boat is floating in the water . adv + con a boat that is floating in the water . adv + con + im a boat that is parked in the water . Ours without init a boat that is sitting in the water . Ours a boat is docked in a harbor . BRNN a boat floating on top of a body of water detected concepts animal, frame con2sen a stuffed animal sitting on top of a frame . feat2sen a group of people standing around a table with a picture of a large piece of cloth and adv a man riding a wave on top of a surfboard . adv + con a group of people that are standing on the beach adv + con + im a person riding a surf board on the waves Ours without init a child is flying a kite on the beach . Ours a group of people surfing on a wave . BRNN a man standing on top of a sandy beach 65 66 67 68 69 70
  • 11. 2019/5/10 11 Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks Mengshi Qi1, Yunhong Wang1, Annan Li1, Jiebo Luo2 1Beijing Advanced Innovation Center for Big Data and Brain Computing, School of Computer Science and Engineering, Beihang University 2Department of Computer Science, University of Rochester Introduction  Sports video captioning is a task of automatically generating a textual description for sports events (e.g., soccer, basketball or volleyball games). Soccer Basketball Volleyball Problem Definition Challenges How to design an effective sports video captioning system? conventional video captioning methods can only generate coarse descriptions, and overlook the motion details of players’ actions and group-level activities How to capture fine-grained motion information of sports players? player’s articulated movements/pose estimation/motion trajectory the key player’s action is critical in a sports event How to build a dataset that focuses on sports video captioning? different sports games Contributions We introduce a novel framework for sports video captioning with attentive motion representation-based hierarchical recurrent neural networks. A motion representation module is employed to extract player’s pose and trajectory information, where we capture semantic attributes from player’s skeletons and cluster trajectories from dynamic movements. We annotate a new dataset called Sports Video Captioning Dataset- Volleyball . Framework 71 72 73 74 75 76
  • 12. 2019/5/10 12 (1) Action Proposal Module Given a video, retrieving and localizing temporal segments that likely contain spatio-temporal group events (i.e. attack, defend) or individual actions (i.e. spiking, passing) is the first task. Deep Action Proposals (DAP) method [1] Motion Representation Module (1) Pose Attribute Detection Part (2) Trajectory Clustering Part Pose Attribute Detection Part Extract pose-based CNN (Convolutional Neural Network) feature [2] from each body part of a player Build an attribute vocabulary from the annotated sentences in the dataset (e.g., UCF-101 [3] , Volleyball [4]), and use the top 𝑘 high-frequency words Trajectory Clustering Part Extracting the dense point trajectories 𝑇𝑟𝑎 𝑇𝑟𝑎 , … , 𝑇𝑟𝑎 for a sequence of frames by trajectory-pooled deep-convolutional descriptors [5] , where 𝑀 is the number of trajectories Then we partition all the detected trajectories into groups by computing the affinity matrix between each trajectory pair and utilizing the graph clustering method [6], we can obtain 𝑚 clusters. Databases General Video Captioning: 1) Microsoft Video Description Dataset (MSVD) 2) MSR Video-to-Text Dataset (MSR-VTT) Sports Video Captioning: 1) Sports Video Captioning Dataset-Volleyball (SVCDV) Sports Video Captioning Dataset-Volleyball (SVCDV) 55 videos with 4,830 short clips from Olympic Volleyball Games About 44,436 sentences, 9.2 sentences on average of each clip Verb ratio is nearly 16.2%, 1.72 verbs per sentences 9 individual action labels: waiting/setting/digging/falling/spiking/blocking/jumping/moving/standing 8 group activity labels: right set/right spike/right pass/right winpoint/left winpoint/left pass/left spike/left set 77 78 79 81 86 87
  • 13. 2019/5/10 13 General Video Captioning Task Table 1. Captioning Performance on MSV and MSR-VTT Datasets General Video Captioning Task Sports Video Captioning and Components Analysis Table 2. Captioning Performance on SVCDV Sports Video Captioning Demo Video Video Re-localization (ECCV 2018) • How to localize the part semantically relevant to the clip on the left in the long video on the right? 88 89 90 91 92 95
  • 14. 2019/5/10 14 Video Re-localization Possible solutions • Copy detection –The contents can be largely different in both foreground and background • Action localization –Only one sample is available Video Re-localization • We propose a matching framework Cross Gating & Bilinear Matching • We propose • Cross gating to remove irrelevant contents • Bilinear matching to capture the interactions between the query and reference Video Re-localization • Results Shuffleboard query length = 14 reference length = 49 Ground truth 17 27 Ours 17 27 Frame 15 23 Video 17 19 Pole vault query length = 15 reference length = 49 Ground truth 10 37 Ours 10 38 Frame 20 32 Video 17 38 query reference query reference Video Re-localization Localizing Language in Video (AAAI 2019) • How to localize the part semantically relevant to the clip on the left in the long video on the right? 96 97 98 99 100 101
  • 15. 2019/5/10 15 Localizing Language in Video (AAAI 2019) Localizing Language in Video (AAAI 2019) Localizing Language in Video (AAAI 2019) Localizing Language in Video (AAAI 2019) Localizing Language in Video (AAAI 2019) Increasing Attention (CVPR 2019) 102 103 104 105 106 131