SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Non-local Neural Networks
Xiaolong Wang1,2⇤
Ross Girshick2
Abhinav Gupta1
Kaiming He2
1
Carnegie Mellon University 2
Facebook AI Research
Abstract
Both convolutional and recurrent operations are building
blocks that process one local neighborhood at a time. In
this paper, we present non-local operations as a generic
family of building blocks for capturing long-range depen-
dencies. Inspired by the classical non-local means method
[4] in computer vision, our non-local operation computes
the response at a position as a weighted sum of the features
at all positions. This building block can be plugged into
many computer vision architectures. On the task of video
classification, even without any bells and whistles, our non-
local models can compete or outperform current competition
winners on both Kinetics and Charades datasets. In static
image recognition, our non-local models improve object de-
xi
xj
Figure 1. A spacetime non-local operation in our network trained
for video classification in Kinetics. A position xi’s response is
computed by the weighted average of the features of all positions
xj (only the highest weighted ones are shown here). In this example
computed by our model, note how it relates the ball in the first frame
to the ball in the last two frames. More examples are in Figure 3.
a non-local operation computes the response at a position
as a weighted sum of the features at all positions in the in-
s.CV]13Apr2018
p
Vp
Vq
s(i, j) = exp
(
−
∥Vp − Vq∥2
h2 )
w(i, j) =
s(i, j)
∑l=I
s(i, l)
y(i) =
∑
j=I
w(i, j) ⋅ x(j)
q
Vq
q
Scaled Dot-Product Attention Multi-Head Attention
igure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
ttention layers running in parallel.
Attention(Q, K, V) = softmax
(
QKT
dk )
⋅ V
Scaled Dot-Product Attention Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
MultiHead(Q, K, V) = concat(head1, …, headh)WO
where headi = Attention (QWQ
i
, KWK
i , VWV
i )
L
L − 1
k × k
Attention(Q, K, V) = softmax
(
QKT
dk )
⋅ Vy(i) =
∑
j=I
w(i, j) ⋅ x(j)
yi =
∑
j∈k×k
wj ⋅ g(xj) yi =
1
C(x) ∑
j
f(xi, xj) ⋅ g(xj)
g(xj) wj
yi
g(xj)
yixi
xj
f( ⋅ )
x
f( ⋅ ) g( ⋅ )
z
T × H × W × 1024
T × H × W × 1024
1 × 1 × 1
T × H × W × 512
THW × 512
T × H × W × 512T × H × W × 512
z =
1
C(x) ∑
f(x) ⋅ g(x)
Softmax
f( ⋅ )
g( ⋅ )
x
z
T × H × W × 1024
T × H × W × 1024
1 × 1 × 1
T × H × W × 512
THW × 512
T × H × W × 512T × H × W × 512
z =
1
C(x) ∑
exT
i xj ⋅ Wgxj
Softmax
1 × 1 × 11 × 1 × 1
x
z
T × H × W × 1024
T × H × W × 1024
1 × 1 × 1
T × H × W × 512
THW × 512
T × H × W × 512T × H × W × 512
z =
1
C(x) ∑
eθ(xi)T
ϕ(xj)
⋅ Wgxj
Softmax
1 × 1 × 1ϕ : 1 × 1 × 1θ : 1 × 1 × 1
T × H × W × 512
THW × THW
x
z
T × H × W × 1024
T × H × W × 1024
1 × 1 × 1
T × H × W × 512
THW × 512
T × H × W × 512T × H × W × 512
z =
1
C(x) ∑
θ(xi)T
ϕ(xj) ⋅ Wgxj
1 × 1 × 1ϕ : 1 × 1 × 1θ : 1 × 1 × 1
T × H × W × 512
THW × THW
1/N
x
z
T × H × W × 1024
T × H × W × 1024
1 × 1 × 1
T × H × W × 512
THW × 512
T × H × W × 512T × H × W × 512
z =
1
C(x) ∑
ReLU (wT
f [θ(xi), ϕ(xj)]) ⋅ Wgxj
ReLU & 1/N
1 × 1 × 1ϕ : 1 × 1 × 1θ : 1 × 1 × 1
T × H × W × 512
THW × THW
(a) headbanging (b) stretching leg
model, R50 top-1 top-5
C2D baseline 71.8 89.7
Gaussian 72.5 90.2
Gaussian, embed 72.7 90.5
dot-product 72.9 90.3
concatenation 72.8 90.5
(a) Instantiations: 1 non-local block
of different types is added into the C2D
baseline. All entries are with ResNet-
model, R50 top-1 top-5
baseline 71.8 89.7
res2 72.7 90.3
res3 72.9 90.4
res4 72.7 90.5
res5 72.3 90.1
(b) Stages: 1 non-local block is
added into different stages. All
entries are with ResNet-50.
model top-1 top-5
R50
baseline 71.8 89.7
1-block 72.7 90.5
5-block 73.8 91.0
10-block 74.3 91.2
R101
baseline 73.1 91.0
1-block 74.3 91.3
5-block 75.1 91.7
10-block 75.1 91.6
(c) Deeper non-local models: we
compare 1, 5, and 10 non-local blocks
added to the C2D baseline. We show
model top-1 top-5
R50
baseline 71.8 89.7
space-only 72.9 90.8
time-only 73.1 90.5
spacetime 73.8 91.0
R101
baseline 73.1 91.0
space-only 74.4 91.3
time-only 74.4 90.5
spacetime 75.1 91.7
(d) Space vs. time vs. spacetime: we
compare non-local operations applied
along space, time, and spacetime dimen-
model top-1 top-5
R50
baseline 71.8 89.7
space-only 72.9 90.8
time-only 73.1 90.5
spacetime 73.8 91.0
R101
baseline 73.1 91.0
space-only 74.4 91.3
time-only 74.4 90.5
spacetime 75.1 91.7
model top-1 top-5
R50
C2D baseline 71.8 89.7
I3D 73.3 90.7
NL I3D 74.9 91.6
R101
C2D baseline 73.1 91.0
I3D 74.4 91.1
NL I3D 76.0 92.1
f( ⋅ )
model, R50 top-1 top-5
C2D baseline 71.8 89.7
Gaussian 72.5 90.2
Gaussian, embed 72.7 90.5
dot-product 72.9 90.3
concatenation 72.8 90.5
model, R50 top-1 top-5
baseline 71.8 89.7
res2 72.7 90.3
res3 72.9 90.4
res4 72.7 90.5
res5 72.3 90.1
del, R50 top-1 top-5
aseline 71.8 89.7
res2 72.7 90.3
res3 72.9 90.4
res4 72.7 90.5
res5 72.3 90.1
ages: 1 non-local block is
into different stages. All
model top-1 top-5
R50
baseline 71.8 89.7
1-block 72.7 90.5
5-block 73.8 91.0
10-block 74.3 91.2
R101
baseline 73.1 91.0
1-block 74.3 91.3
5-block 75.1 91.7
10-block 75.1 91.6
(c) Deeper non-local models: we
compare 1, 5, and 10 non-local blocks
model top-1 top-5
R50
baseline 71.8 89.7
space-only 72.9 90.8
time-only 73.1 90.5
spacetime 73.8 91.0
R101
baseline 73.1 91.0
space-only 74.4 91.3
time-only 74.4 90.5
spacetime 75.1 91.7
(d) Space vs. time vs. spacetime: we
compare non-local operations applied
model, R50 top-1 top-5
C2D baseline 71.8 89.7
Gaussian 72.5 90.2
Gaussian, embed 72.7 90.5
dot-product 72.9 90.3
concatenation 72.8 90.5
(a) Instantiations: 1 non-local block
of different types is added into the C2D
baseline. All entries are with ResNet-
50.
model, R50 top-1 top-5
baseline 71.8 89.7
res2 72.7 90.3
res3 72.9 90.4
res4 72.7 90.5
res5 72.3 90.1
(b) Stages: 1 non-local block is
added into different stages. All
entries are with ResNet-50.
model top-1 to
R50
baseline 71.8 8
1-block 72.7 9
5-block 73.8 9
10-block 74.3 9
R101
baseline 73.1 9
1-block 74.3 9
5-block 75.1 9
10-block 75.1 9
(c) Deeper non-local models:
compare 1, 5, and 10 non-local b
added to the C2D baseline. We
ResNet-50 (top) and ResNet-101
tom) results.
model, R101 params FLOPs top-1 top-5
C2D baseline 1⇥ 1⇥ 73.1 91.0
I3D3⇥3⇥3 1.5⇥ 1.8⇥ 74.1 91.2
I3D3⇥1⇥1 1.2⇥ 1.5⇥ 74.4 91.1
NL C2D, 5-block 1.2⇥ 1.2⇥ 75.1 91.7
(e) Non-local vs. 3D Conv: A 5-block non-local C2D
vs. inflated 3D ConvNet (I3D) [7]. All entries are with
ResNet-101. The numbers of parameters and FLOPs are
model top-1 top-5
R50
C2D baseline 71.8 89.7
I3D 73.3 90.7
NL I3D 74.9 91.6
R101
C2D baseline 73.1 91.0
I3D 74.4 91.1
NL I3D 76.0 92.1
(f) Non-local 3D ConvNet: 5 non-local
blocks are added on top of our best I3D mod-
els. These results show that non-local opera-
f( ⋅ )
NL C2D, 5-block 1.2⇥ 1.2⇥ 75.1 91.7
(e) Non-local vs. 3D Conv: A 5-block non-local C2D
vs. inflated 3D ConvNet (I3D) [7]. All entries are with
ResNet-101. The numbers of parameters and FLOPs are
relative to the C2D baseline (43.2M and 34.2B).
R101
C2D baseline 73.1 91.0
I3D 74.4 91.1
NL I3D 76.0 92.1
(f) Non-local 3D ConvNet: 5 non-local
blocks are added on top of our best I3D mod-
els. These results show that non-local opera-
tions are complementary to 3D convolutions.
R101
C2D baseline 75.3
I3D 76.4
NL I3D 77.7
(g) Longer clips: we fine-tune
models in Table 2f on the 128
The gains of our non-local operat
sistent.
Table 2. Ablations on Kinetics action classification. We show top-1 and top-5 classification accuracy (%).
0 50 100 150 200 250 300 350 400
iterations (K)
25
30
35
40
45
50
55
60
error(%)
C2D baseline (train)
C2D baseline (val)
NL C2D, 5-block (train)
NL C2D, 5-block (val)
Table 2 shows the ablation results, analyzed a
Instantiations. Table 2a compares different typ
gle non-local block added to the C2D baseline (r
the last residual block of res4). Even adding on
block can lead to ⇠1% improvement over the ba
Interestingly, the embedded Gaussian, dot-p
concatenation versions perform similarly, up to so
variations (72.7 to 72.9). As discussed in Sec. 3
local operations with Gaussian kernels become si
self-attention module [49]. However, our experim
that the attentional (softmax) behavior of this mo
xi
xi
Figure 3. Examples of the behavior of a non-local block in res3 computed by
are from held-out validation videos. The starting point of arrows represen
xi
xi
model backbone modality top-1 val top-5 val top-1 test top-5 test avg test†
I3D in [7] Inception RGB 72.1 90.3 71.1 89.3 80.2
2-Stream I3D in [7] Inception RGB + flow 75.7 92.0 74.2 91.3 82.8
RGB baseline in [3] Inception-ResNet-v2 RGB 73.0 90.9 - - -
3-stream late fusion [3] Inception-ResNet-v2 RGB + flow + audio 74.9 91.6 - - -
3-stream LSTM [3] Inception-ResNet-v2 RGB + flow + audio 77.1 93.2 - - -
3-stream SATT [3] Inception-ResNet-v2 RGB + flow + audio 77.7 93.2 - - -
NL I3D [ours]
ResNet-50 RGB 76.5 92.6 - - -
ResNet-101 RGB 77.7 93.3 - - 83.8
ble 3. Comparisons with state-of-the-art results in Kinetics, reported on the val and test sets. We include the Kinetics 2017 competi
nner’s results [3], but their best results exploited audio signals (marked in gray) so were not vision-only solutions. †
: “avg” is the ave
op-1 and top-5 accuracy; individual top-1 or top-5 numbers are not available from the test server at the time of submitting this manusc
model modality train/val trainval/test
2-Stream [43] RGB + flow 18.6 -
2-Stream +LSTM [43] RGB + flow 17.8 -
Asyn-TF [43] RGB + flow 22.4 -
I3D [7] RGB 32.9 34.4
I3D [ours] RGB 35.5 37.2
NL I3D [ours] RGB 37.5 39.5
Table 4. Classification mAP (%) in the Charades dataset [44], on
the train/val split and the trainval/test split. Our results are based
on ResNet-101. Our NL I3D uses 5 non-local blocks.
method APbox APbox
50 APbox
75 APmask APmask
50 APmask
75
R50
baseline 38.0 59.6 41.0 34.6 56.4 36.5
+1 NL 39.0 61.1 41.9 35.5 58.0 37.4
R101
baseline 39.5 61.4 42.9 36.0 58.1 38.3
+1 NL 40.8 63.1 44.5 37.1 59.9 39.2
X152
baseline 44.1 66.4 48.4 39.7 63.2 42.2
+1 NL 45.0 67.8 48.9 40.3 64.4 42.8
Table 5. Adding 1 non-local block to Mask R-CNN for COCO
object detection and instance segmentation. The backbone is
ResNet-50/101 or ResNeXt-152 [53], both with FPN [32].
kp kp kp
model modality train/val trainval/test
2-Stream [43] RGB + flow 18.6 -
2-Stream +LSTM [43] RGB + flow 17.8 -
Asyn-TF [43] RGB + flow 22.4 -
I3D [7] RGB 32.9 34.4
I3D [ours] RGB 35.5 37.2
NL I3D [ours] RGB 37.5 39.5
method APbox APbox
50 APbox
75 APmask APmask
50 APmask
75
R50
baseline 38.0 59.6 41.0 34.6 56.4 36.5
+1 NL 39.0 61.1 41.9 35.5 58.0 37.4
R101
baseline 39.5 61.4 42.9 36.0 58.1 38.3
+1 NL 40.8 63.1 44.5 37.1 59.9 39.2
X152
baseline 44.1 66.4 48.4 39.7 63.2 42.2
+1 NL 45.0 67.8 48.9 40.3 64.4 42.8
model modality train/val trainval/test
ream [43] RGB + flow 18.6 -
ream +LSTM [43] RGB + flow 17.8 -
n-TF [43] RGB + flow 22.4 -
[7] RGB 32.9 34.4
[ours] RGB 35.5 37.2
I3D [ours] RGB 37.5 39.5
4. Classification mAP (%) in the Charades dataset [44], on
ain/val split and the trainval/test split. Our results are based
esNet-101. Our NL I3D uses 5 non-local blocks.
Experiments on Charades
harades [44] is a video dataset with ⇠8k training, ⇠1.8k
ation, and ⇠2k testing videos. It is a multi-label classifi-
n task with 157 action categories. We use a per-category
oid output to handle the multi-label property.
We initialize our models pre-trained on Kinetics (128-
e). The mini-batch size is set to 1 clip per GPU. We train
models for 200k iterations, starting from a learning rate
00125 and reducing it by 10 every 75k iterations. We use
ering strategy similar to that in Kinetics to determine the
method APbox APbox
50 APbox
75 APmask APmask
50 APmask
75
R50
baseline 38.0 59.6 41.0 34.6 56.4 36.5
+1 NL 39.0 61.1 41.9 35.5 58.0 37.4
R101
baseline 39.5 61.4 42.9 36.0 58.1 38.3
+1 NL 40.8 63.1 44.5 37.1 59.9 39.2
X152
baseline 44.1 66.4 48.4 39.7 63.2 42.2
+1 NL 45.0 67.8 48.9 40.3 64.4 42.8
Table 5. Adding 1 non-local block to Mask R-CNN for COCO
object detection and instance segmentation. The backbone is
ResNet-50/101 or ResNeXt-152 [53], both with FPN [32].
model APkp AP
kp
50 AP
kp
75
R101 baseline 65.1 86.8 70.4
NL, +4 in head 66.0 87.1 71.7
NL, +4 in head, +1 in backbone 66.5 87.3 72.8
Table 6. Adding non-local blocks to Mask R-CNN for COCO
keypoint detection. The backbone is ResNet-101 with FPN [32].
graded from R50/101 to X152. This comparison suggests
that non-local dependency has not been sufficiently captured
by existing models despite increased depth/capacity.
In addition, the above gain is at a very small cost. The

Weitere ähnliche Inhalte

Was ist angesagt?

Gaussian Image Blurring in CUDA C++
Gaussian Image Blurring in CUDA C++Gaussian Image Blurring in CUDA C++
Gaussian Image Blurring in CUDA C++Darshan Parsana
 
Circular Convolution
Circular ConvolutionCircular Convolution
Circular ConvolutionSarang Joshi
 
A Note on Correlated Topic Models
A Note on Correlated Topic ModelsA Note on Correlated Topic Models
A Note on Correlated Topic ModelsTomonari Masada
 
On approximate bounds of zeros of polynomials within
On approximate bounds of zeros of polynomials withinOn approximate bounds of zeros of polynomials within
On approximate bounds of zeros of polynomials withineSAT Publishing House
 
Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5Takao Wada
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTandrewmart11
 
Solutions Manual for Calculus Early Transcendentals 10th Edition by Anton
Solutions Manual for Calculus Early Transcendentals 10th Edition by AntonSolutions Manual for Calculus Early Transcendentals 10th Edition by Anton
Solutions Manual for Calculus Early Transcendentals 10th Edition by AntonPamelaew
 
Diffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricDiffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricGota Morota
 
Comparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationComparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationPranamesh Chakraborty
 
Lecture note4coordinatedescent
Lecture note4coordinatedescentLecture note4coordinatedescent
Lecture note4coordinatedescentXudong Sun
 
Kumaraswamy disribution
Kumaraswamy disributionKumaraswamy disribution
Kumaraswamy disributionPankaj Das
 
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeksBeginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeksJinTaek Seo
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationPranamesh Chakraborty
 
Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"壮 八幡
 

Was ist angesagt? (17)

Gaussian Image Blurring in CUDA C++
Gaussian Image Blurring in CUDA C++Gaussian Image Blurring in CUDA C++
Gaussian Image Blurring in CUDA C++
 
Circular Convolution
Circular ConvolutionCircular Convolution
Circular Convolution
 
A Note on Correlated Topic Models
A Note on Correlated Topic ModelsA Note on Correlated Topic Models
A Note on Correlated Topic Models
 
On approximate bounds of zeros of polynomials within
On approximate bounds of zeros of polynomials withinOn approximate bounds of zeros of polynomials within
On approximate bounds of zeros of polynomials within
 
Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 
Solutions Manual for Calculus Early Transcendentals 10th Edition by Anton
Solutions Manual for Calculus Early Transcendentals 10th Edition by AntonSolutions Manual for Calculus Early Transcendentals 10th Edition by Anton
Solutions Manual for Calculus Early Transcendentals 10th Edition by Anton
 
Image transforms
Image transformsImage transforms
Image transforms
 
Diffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricDiffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metric
 
Comparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationComparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimization
 
Section5 stochastic
Section5 stochasticSection5 stochastic
Section5 stochastic
 
Lecture note4coordinatedescent
Lecture note4coordinatedescentLecture note4coordinatedescent
Lecture note4coordinatedescent
 
Kumaraswamy disribution
Kumaraswamy disributionKumaraswamy disribution
Kumaraswamy disribution
 
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeksBeginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
 
Image restoration and reconstruction
Image restoration and reconstructionImage restoration and reconstruction
Image restoration and reconstruction
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimization
 
Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"
 

Ähnlich wie Non-local Neural Network

NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)Igalia
 
Trident International Graphics Workshop 2014 1/5
Trident International Graphics Workshop 2014 1/5Trident International Graphics Workshop 2014 1/5
Trident International Graphics Workshop 2014 1/5Takao Wada
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingMark Kilgard
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSEAU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSEThiyagarajan G
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1Puja Koch
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming languageLincoln Hannah
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)WoochulShin10
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortStefan Marr
 

Ähnlich wie Non-local Neural Network (20)

NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)
 
Trident International Graphics Workshop 2014 1/5
Trident International Graphics Workshop 2014 1/5Trident International Graphics Workshop 2014 1/5
Trident International Graphics Workshop 2014 1/5
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
 
ERTS UNIT 3.ppt
ERTS UNIT 3.pptERTS UNIT 3.ppt
ERTS UNIT 3.ppt
 
slides.07.pptx
slides.07.pptxslides.07.pptx
slides.07.pptx
 
sheet6.pdf
sheet6.pdfsheet6.pdf
sheet6.pdf
 
doc6.pdf
doc6.pdfdoc6.pdf
doc6.pdf
 
paper6.pdf
paper6.pdfpaper6.pdf
paper6.pdf
 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSEAU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1
 
Gate-Cs 1993
Gate-Cs 1993Gate-Cs 1993
Gate-Cs 1993
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming language
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Clojure And Swing
Clojure And SwingClojure And Swing
Clojure And Swing
 
ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 

Kürzlich hochgeladen

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 

Kürzlich hochgeladen (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 

Non-local Neural Network

  • 1.
  • 2.
  • 3. Non-local Neural Networks Xiaolong Wang1,2⇤ Ross Girshick2 Abhinav Gupta1 Kaiming He2 1 Carnegie Mellon University 2 Facebook AI Research Abstract Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range depen- dencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non- local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object de- xi xj Figure 1. A spacetime non-local operation in our network trained for video classification in Kinetics. A position xi’s response is computed by the weighted average of the features of all positions xj (only the highest weighted ones are shown here). In this example computed by our model, note how it relates the ball in the first frame to the ball in the last two frames. More examples are in Figure 3. a non-local operation computes the response at a position as a weighted sum of the features at all positions in the in- s.CV]13Apr2018
  • 4.
  • 5. p Vp Vq s(i, j) = exp ( − ∥Vp − Vq∥2 h2 ) w(i, j) = s(i, j) ∑l=I s(i, l) y(i) = ∑ j=I w(i, j) ⋅ x(j) q Vq q
  • 6. Scaled Dot-Product Attention Multi-Head Attention igure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several ttention layers running in parallel. Attention(Q, K, V) = softmax ( QKT dk ) ⋅ V Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of MultiHead(Q, K, V) = concat(head1, …, headh)WO where headi = Attention (QWQ i , KWK i , VWV i )
  • 7. L L − 1 k × k
  • 8. Attention(Q, K, V) = softmax ( QKT dk ) ⋅ Vy(i) = ∑ j=I w(i, j) ⋅ x(j)
  • 9. yi = ∑ j∈k×k wj ⋅ g(xj) yi = 1 C(x) ∑ j f(xi, xj) ⋅ g(xj) g(xj) wj yi g(xj) yixi xj f( ⋅ )
  • 10. x f( ⋅ ) g( ⋅ ) z T × H × W × 1024 T × H × W × 1024 1 × 1 × 1 T × H × W × 512 THW × 512 T × H × W × 512T × H × W × 512 z = 1 C(x) ∑ f(x) ⋅ g(x) Softmax f( ⋅ ) g( ⋅ )
  • 11. x z T × H × W × 1024 T × H × W × 1024 1 × 1 × 1 T × H × W × 512 THW × 512 T × H × W × 512T × H × W × 512 z = 1 C(x) ∑ exT i xj ⋅ Wgxj Softmax 1 × 1 × 11 × 1 × 1
  • 12. x z T × H × W × 1024 T × H × W × 1024 1 × 1 × 1 T × H × W × 512 THW × 512 T × H × W × 512T × H × W × 512 z = 1 C(x) ∑ eθ(xi)T ϕ(xj) ⋅ Wgxj Softmax 1 × 1 × 1ϕ : 1 × 1 × 1θ : 1 × 1 × 1 T × H × W × 512 THW × THW
  • 13. x z T × H × W × 1024 T × H × W × 1024 1 × 1 × 1 T × H × W × 512 THW × 512 T × H × W × 512T × H × W × 512 z = 1 C(x) ∑ θ(xi)T ϕ(xj) ⋅ Wgxj 1 × 1 × 1ϕ : 1 × 1 × 1θ : 1 × 1 × 1 T × H × W × 512 THW × THW 1/N
  • 14. x z T × H × W × 1024 T × H × W × 1024 1 × 1 × 1 T × H × W × 512 THW × 512 T × H × W × 512T × H × W × 512 z = 1 C(x) ∑ ReLU (wT f [θ(xi), ϕ(xj)]) ⋅ Wgxj ReLU & 1/N 1 × 1 × 1ϕ : 1 × 1 × 1θ : 1 × 1 × 1 T × H × W × 512 THW × THW
  • 15. (a) headbanging (b) stretching leg
  • 16. model, R50 top-1 top-5 C2D baseline 71.8 89.7 Gaussian 72.5 90.2 Gaussian, embed 72.7 90.5 dot-product 72.9 90.3 concatenation 72.8 90.5 (a) Instantiations: 1 non-local block of different types is added into the C2D baseline. All entries are with ResNet- model, R50 top-1 top-5 baseline 71.8 89.7 res2 72.7 90.3 res3 72.9 90.4 res4 72.7 90.5 res5 72.3 90.1 (b) Stages: 1 non-local block is added into different stages. All entries are with ResNet-50. model top-1 top-5 R50 baseline 71.8 89.7 1-block 72.7 90.5 5-block 73.8 91.0 10-block 74.3 91.2 R101 baseline 73.1 91.0 1-block 74.3 91.3 5-block 75.1 91.7 10-block 75.1 91.6 (c) Deeper non-local models: we compare 1, 5, and 10 non-local blocks added to the C2D baseline. We show model top-1 top-5 R50 baseline 71.8 89.7 space-only 72.9 90.8 time-only 73.1 90.5 spacetime 73.8 91.0 R101 baseline 73.1 91.0 space-only 74.4 91.3 time-only 74.4 90.5 spacetime 75.1 91.7 (d) Space vs. time vs. spacetime: we compare non-local operations applied along space, time, and spacetime dimen- model top-1 top-5 R50 baseline 71.8 89.7 space-only 72.9 90.8 time-only 73.1 90.5 spacetime 73.8 91.0 R101 baseline 73.1 91.0 space-only 74.4 91.3 time-only 74.4 90.5 spacetime 75.1 91.7 model top-1 top-5 R50 C2D baseline 71.8 89.7 I3D 73.3 90.7 NL I3D 74.9 91.6 R101 C2D baseline 73.1 91.0 I3D 74.4 91.1 NL I3D 76.0 92.1 f( ⋅ )
  • 17. model, R50 top-1 top-5 C2D baseline 71.8 89.7 Gaussian 72.5 90.2 Gaussian, embed 72.7 90.5 dot-product 72.9 90.3 concatenation 72.8 90.5 model, R50 top-1 top-5 baseline 71.8 89.7 res2 72.7 90.3 res3 72.9 90.4 res4 72.7 90.5 res5 72.3 90.1 del, R50 top-1 top-5 aseline 71.8 89.7 res2 72.7 90.3 res3 72.9 90.4 res4 72.7 90.5 res5 72.3 90.1 ages: 1 non-local block is into different stages. All model top-1 top-5 R50 baseline 71.8 89.7 1-block 72.7 90.5 5-block 73.8 91.0 10-block 74.3 91.2 R101 baseline 73.1 91.0 1-block 74.3 91.3 5-block 75.1 91.7 10-block 75.1 91.6 (c) Deeper non-local models: we compare 1, 5, and 10 non-local blocks model top-1 top-5 R50 baseline 71.8 89.7 space-only 72.9 90.8 time-only 73.1 90.5 spacetime 73.8 91.0 R101 baseline 73.1 91.0 space-only 74.4 91.3 time-only 74.4 90.5 spacetime 75.1 91.7 (d) Space vs. time vs. spacetime: we compare non-local operations applied model, R50 top-1 top-5 C2D baseline 71.8 89.7 Gaussian 72.5 90.2 Gaussian, embed 72.7 90.5 dot-product 72.9 90.3 concatenation 72.8 90.5 (a) Instantiations: 1 non-local block of different types is added into the C2D baseline. All entries are with ResNet- 50. model, R50 top-1 top-5 baseline 71.8 89.7 res2 72.7 90.3 res3 72.9 90.4 res4 72.7 90.5 res5 72.3 90.1 (b) Stages: 1 non-local block is added into different stages. All entries are with ResNet-50. model top-1 to R50 baseline 71.8 8 1-block 72.7 9 5-block 73.8 9 10-block 74.3 9 R101 baseline 73.1 9 1-block 74.3 9 5-block 75.1 9 10-block 75.1 9 (c) Deeper non-local models: compare 1, 5, and 10 non-local b added to the C2D baseline. We ResNet-50 (top) and ResNet-101 tom) results. model, R101 params FLOPs top-1 top-5 C2D baseline 1⇥ 1⇥ 73.1 91.0 I3D3⇥3⇥3 1.5⇥ 1.8⇥ 74.1 91.2 I3D3⇥1⇥1 1.2⇥ 1.5⇥ 74.4 91.1 NL C2D, 5-block 1.2⇥ 1.2⇥ 75.1 91.7 (e) Non-local vs. 3D Conv: A 5-block non-local C2D vs. inflated 3D ConvNet (I3D) [7]. All entries are with ResNet-101. The numbers of parameters and FLOPs are model top-1 top-5 R50 C2D baseline 71.8 89.7 I3D 73.3 90.7 NL I3D 74.9 91.6 R101 C2D baseline 73.1 91.0 I3D 74.4 91.1 NL I3D 76.0 92.1 (f) Non-local 3D ConvNet: 5 non-local blocks are added on top of our best I3D mod- els. These results show that non-local opera- f( ⋅ )
  • 18. NL C2D, 5-block 1.2⇥ 1.2⇥ 75.1 91.7 (e) Non-local vs. 3D Conv: A 5-block non-local C2D vs. inflated 3D ConvNet (I3D) [7]. All entries are with ResNet-101. The numbers of parameters and FLOPs are relative to the C2D baseline (43.2M and 34.2B). R101 C2D baseline 73.1 91.0 I3D 74.4 91.1 NL I3D 76.0 92.1 (f) Non-local 3D ConvNet: 5 non-local blocks are added on top of our best I3D mod- els. These results show that non-local opera- tions are complementary to 3D convolutions. R101 C2D baseline 75.3 I3D 76.4 NL I3D 77.7 (g) Longer clips: we fine-tune models in Table 2f on the 128 The gains of our non-local operat sistent. Table 2. Ablations on Kinetics action classification. We show top-1 and top-5 classification accuracy (%). 0 50 100 150 200 250 300 350 400 iterations (K) 25 30 35 40 45 50 55 60 error(%) C2D baseline (train) C2D baseline (val) NL C2D, 5-block (train) NL C2D, 5-block (val) Table 2 shows the ablation results, analyzed a Instantiations. Table 2a compares different typ gle non-local block added to the C2D baseline (r the last residual block of res4). Even adding on block can lead to ⇠1% improvement over the ba Interestingly, the embedded Gaussian, dot-p concatenation versions perform similarly, up to so variations (72.7 to 72.9). As discussed in Sec. 3 local operations with Gaussian kernels become si self-attention module [49]. However, our experim that the attentional (softmax) behavior of this mo xi xi Figure 3. Examples of the behavior of a non-local block in res3 computed by are from held-out validation videos. The starting point of arrows represen xi xi
  • 19. model backbone modality top-1 val top-5 val top-1 test top-5 test avg test† I3D in [7] Inception RGB 72.1 90.3 71.1 89.3 80.2 2-Stream I3D in [7] Inception RGB + flow 75.7 92.0 74.2 91.3 82.8 RGB baseline in [3] Inception-ResNet-v2 RGB 73.0 90.9 - - - 3-stream late fusion [3] Inception-ResNet-v2 RGB + flow + audio 74.9 91.6 - - - 3-stream LSTM [3] Inception-ResNet-v2 RGB + flow + audio 77.1 93.2 - - - 3-stream SATT [3] Inception-ResNet-v2 RGB + flow + audio 77.7 93.2 - - - NL I3D [ours] ResNet-50 RGB 76.5 92.6 - - - ResNet-101 RGB 77.7 93.3 - - 83.8 ble 3. Comparisons with state-of-the-art results in Kinetics, reported on the val and test sets. We include the Kinetics 2017 competi nner’s results [3], but their best results exploited audio signals (marked in gray) so were not vision-only solutions. † : “avg” is the ave op-1 and top-5 accuracy; individual top-1 or top-5 numbers are not available from the test server at the time of submitting this manusc
  • 20. model modality train/val trainval/test 2-Stream [43] RGB + flow 18.6 - 2-Stream +LSTM [43] RGB + flow 17.8 - Asyn-TF [43] RGB + flow 22.4 - I3D [7] RGB 32.9 34.4 I3D [ours] RGB 35.5 37.2 NL I3D [ours] RGB 37.5 39.5 Table 4. Classification mAP (%) in the Charades dataset [44], on the train/val split and the trainval/test split. Our results are based on ResNet-101. Our NL I3D uses 5 non-local blocks. method APbox APbox 50 APbox 75 APmask APmask 50 APmask 75 R50 baseline 38.0 59.6 41.0 34.6 56.4 36.5 +1 NL 39.0 61.1 41.9 35.5 58.0 37.4 R101 baseline 39.5 61.4 42.9 36.0 58.1 38.3 +1 NL 40.8 63.1 44.5 37.1 59.9 39.2 X152 baseline 44.1 66.4 48.4 39.7 63.2 42.2 +1 NL 45.0 67.8 48.9 40.3 64.4 42.8 Table 5. Adding 1 non-local block to Mask R-CNN for COCO object detection and instance segmentation. The backbone is ResNet-50/101 or ResNeXt-152 [53], both with FPN [32]. kp kp kp model modality train/val trainval/test 2-Stream [43] RGB + flow 18.6 - 2-Stream +LSTM [43] RGB + flow 17.8 - Asyn-TF [43] RGB + flow 22.4 - I3D [7] RGB 32.9 34.4 I3D [ours] RGB 35.5 37.2 NL I3D [ours] RGB 37.5 39.5 method APbox APbox 50 APbox 75 APmask APmask 50 APmask 75 R50 baseline 38.0 59.6 41.0 34.6 56.4 36.5 +1 NL 39.0 61.1 41.9 35.5 58.0 37.4 R101 baseline 39.5 61.4 42.9 36.0 58.1 38.3 +1 NL 40.8 63.1 44.5 37.1 59.9 39.2 X152 baseline 44.1 66.4 48.4 39.7 63.2 42.2 +1 NL 45.0 67.8 48.9 40.3 64.4 42.8 model modality train/val trainval/test ream [43] RGB + flow 18.6 - ream +LSTM [43] RGB + flow 17.8 - n-TF [43] RGB + flow 22.4 - [7] RGB 32.9 34.4 [ours] RGB 35.5 37.2 I3D [ours] RGB 37.5 39.5 4. Classification mAP (%) in the Charades dataset [44], on ain/val split and the trainval/test split. Our results are based esNet-101. Our NL I3D uses 5 non-local blocks. Experiments on Charades harades [44] is a video dataset with ⇠8k training, ⇠1.8k ation, and ⇠2k testing videos. It is a multi-label classifi- n task with 157 action categories. We use a per-category oid output to handle the multi-label property. We initialize our models pre-trained on Kinetics (128- e). The mini-batch size is set to 1 clip per GPU. We train models for 200k iterations, starting from a learning rate 00125 and reducing it by 10 every 75k iterations. We use ering strategy similar to that in Kinetics to determine the method APbox APbox 50 APbox 75 APmask APmask 50 APmask 75 R50 baseline 38.0 59.6 41.0 34.6 56.4 36.5 +1 NL 39.0 61.1 41.9 35.5 58.0 37.4 R101 baseline 39.5 61.4 42.9 36.0 58.1 38.3 +1 NL 40.8 63.1 44.5 37.1 59.9 39.2 X152 baseline 44.1 66.4 48.4 39.7 63.2 42.2 +1 NL 45.0 67.8 48.9 40.3 64.4 42.8 Table 5. Adding 1 non-local block to Mask R-CNN for COCO object detection and instance segmentation. The backbone is ResNet-50/101 or ResNeXt-152 [53], both with FPN [32]. model APkp AP kp 50 AP kp 75 R101 baseline 65.1 86.8 70.4 NL, +4 in head 66.0 87.1 71.7 NL, +4 in head, +1 in backbone 66.5 87.3 72.8 Table 6. Adding non-local blocks to Mask R-CNN for COCO keypoint detection. The backbone is ResNet-101 with FPN [32]. graded from R50/101 to X152. This comparison suggests that non-local dependency has not been sufficiently captured by existing models despite increased depth/capacity. In addition, the above gain is at a very small cost. The