SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Downloaden Sie, um offline zu lesen
!"#$%&'(#)*#+,-./#01201#
+,-3*4#+/&.#)*#&#5)3(,#
+,-./6#
!"#$%&'($)")*&+,$-&./0*&1"(" 23#4"567$4/)*&$)8"9:;:<&
橋口凌大(名工大玉木研)
英語論文紹介 :;:<=<<=<>
概要
n?)$4,-/)@3)のみを用いた動画認識モデルの提案
• 'A$B3&?"@3&+CC34C"/4&7/%3#&D'?+7E
n疎にサンプリングしたフレームで学習,推論
• 高精度
• 低計算コスト
n推論方法の改善
• <つのクリップで推論
• ビデオあたりのフレーム数をF;倍削減
• 従来手法のG;倍高速な推論
An Image is Worth 16x16 Words, What is a Video Worth?
Gilad Sharir Asaf Noy Lihi Zelnik-Manor
DAMO Academy, Alibaba Group
Abstract
Leading methods in the domain of action recognition try to
distill information from both the spatial and temporal di-
mensions of an input video. Methods that reach State of the
Art (SotA) accuracy, usually make use of 3D convolution
layers as a way to abstract the temporal information from
video frames. The use of such convolutions requires sam-
pling short clips from the input video, where each clip is a
collection of closely sampled frames. Since each short clip
covers a small fraction of an input video, multiple clips are
sampled at inference in order to cover the whole temporal
length of the video. This leads to increased computational
load and is impractical for real-world applications. We ad-
dress the computational bottleneck by significantly reducing
the number of frames required for inference. Our approach
relies on a temporal transformer that applies global atten-
tion over video frames, and thus better exploits the salient
Figure 1. Kinetics-400 top-1 Accuracy vs Runtime, measured
over Nvidia V100 GPU and presented in log-scale. Markers sizes
are proportional to the number of frames used per video by lead-
ing methods. Our method provides dominating trade-off for those
5v2
[cs.CV]
27
May
2021
関連研究
nHIJ?&K&FL&M..&NO$#-$/P#QK*&IMMR:;:;S
• FL&M..で時間情報を獲得する
• FL&M..の出力をHIJ?に通す
• 高密度のサンプリング,フレームが必要
n'#/T-$,C NU3"B(C34(/-3)K*&VMMR:;<>S
• 異なる時間解像度のネットワーク
• FL&M..ベース
• 高密度なサンプリングが必要
ing. Thirdly, the proposed BERT-based temporal modeling implementations on
SlowFast architecture are examined in Section 3.3. The reason to re-consider
the BERT-based late temporal modeling on SlowFast architecture is due to its
di↵erent two-stream structure from other 3D CNN architectures.
3.1 BERT-based Temporal Modeling with 3D CNNs for Action
Recognition
Fig. 1: BERT-based late temporal modeling
Bi-directional Encoder Representations from Transformers (BERT) [4] is a
bidirectional self-attention method, which has provided unprecedented success in
many downstream Natural Language Processing (NLP) tasks. The bidirectional
property enables BERT to fuse the contextual information from both directions,
instead of relying upon only a single direction, as in former recurrent neural
networks or other self-attention methods, such as Transformer [29]. Moreover,
BERT introduces challenging unsupervised pre-training tasks which leads to
useful representations for many tasks.
Our architecture utilizes BERT-based temporal pooling as shown in Fig. 1.
In this architecture, the selected K frames from the input sequence are propa-
gated through a 3D CNN architecture without applying temporal global average
SlowFast Networks for Video Recognition
Christoph Feichtenhofer Haoqi Fan Jitendra Malik Kaiming He
Facebook AI Research (FAIR)
Abstract
We present SlowFast networks for video recognition. Our
model involves (i) a Slow pathway, operating at low frame
rate, to capture spatial semantics, and (ii) a Fast path-
way, operating at high frame rate, to capture motion at
fine temporal resolution. The Fast pathway can be made
very lightweight by reducing its channel capacity, yet can
learn useful temporal information for video recognition.
Our models achieve strong performance for both action
classification and detection in video, and large improve-
ments are pin-pointed as contributions by our SlowFast con-
cept. We report state-of-the-art accuracy on major video
T
C
H,W
prediction
High frame rate
C
αT
C
C
αT
αT βC
βC
βC
T
T
T
Low frame rate
Figure 1. A SlowFast network has a low frame rate, low temporal
resolution Slow pathway and a high frame rate, ↵⇥ higher temporal
CV]
29
Oct
2019
提案手法
n空間と時間のC)$4,-/)@3)モデル
n空間
• R","/4&?)$4,-/)@3)&DR"?E&NL/,/9"C,5"0K*&
VM1J:;:<Sを用いる
n時間
• 時間情報のみのC)$4,-/)@3)モデル
Figure 3. Our proposed Transformer Network for video
a
to
ti
th
tw
p
c
re
p
te
a
2
3
s
e
le
in
I
tr
o
Published as a conference paper at ICLR 2021
Transformer Encoder
MLP
Head
Vision Transformer (ViT)
*
Linear Projection of Flattened Patches
* Extralearnable
[ cl ass] embedding
1 2 3 4 5 6 7 8 9
0
Patch + Position
Embedding
Class
Bird
Ball
Car
...
Embedded
Patches
Multi-Head
Attention
Norm
MLP
Norm
+
L x
+
Transformer Encoder
提案手法
n'A$C"$#&$CC34C"/4
(0,t)
Multi-head Self-Attention block (MSA). STAM con-
sists of L MSA blocks. At each block ` 2 {1, . . . , L}, and
head a 2 {1, . . . , A}, each patch representation is trans-
formed into query, key, and value vectors. The representa-
tion produced by the previous block z` 1
(p,t) is used as input.
q
(`,a)
(p,t) = W
(`,a)
Q LN
⇣
z
(` 1)
(p,t)
⌘
2 RDh
(2)
k
(`,a)
(p,t) = W
(`,a)
K LN
⇣
z
(` 1)
(p,t)
⌘
2 RDh
(3)
v
(`,a)
(p,t) = W
(`,a)
V LN
⇣
z
(` 1)
(p,t)
⌘
2 RDh
(4)
↵
↵
↵
(`,a)
(p,t) = SM
0
@
q
(`,a)
(p,t)
p
Dh
>
·
"
k
(`,a)
(0,t)
n
k
(`,a)
(p0,t0)
o
p0
=1,...,N
t0
=1,...,F
#1
A
(5)
where SM() denotes the softmax activation function, and
k
(`,a)
(0,t) is the key value associated with the class token of
frame t.
Spatial attention Applying attention over all the patches
of the sequence is computationally expensive, therefore, an
alternative configuration is required in order to make the
spatio-temporal attention computationally tractable. A re-
duction in computation can be achieved by disentangling
the spatial and temporal dimensions. For the spatial atten-
tion, we apply attention between patches of the same frame:
↵
↵
↵
(`,a)space
(p,t) = SM
0
@
q
(`,a)
(p,t)
p
Dh
>
·

k
(`,a)
(0,t) ,
n
k
(`,a)
(p0,t)
o
p0=1,...,N
1
A .
(6)
The attention vector entries are used as coefficients in a
weighted sum over the values for each attention head:
s
(`,a)
(p,t) = ↵
(`,a)
(p,t),(0,t)v
(`,a)
(0,t) +
N
X
p0=1
↵
(`,a)
(p,t),(p0,t)v
(`,a)
(p0,t). (7)
z0(`)
(p,t) = WO
6
6
4
.
.
.
s
(`,A)
(p,t)
7
7
5 + z
(` 1)
(p,t) (8)
z
(`)
(p,t) = MLP
⇣
LN
⇣
z0(`)
(p,t)
⌘⌘
+ z0(`)
(p,t). (9)
The MSA and MLP layers are operating as residual opera-
tors thanks to added skip-connections
After passing through the spatial Transformer layers, the
class embedding from each frame is used to produce an em-
bedding vector ft. This frame embedding will be fed into
the temporal attention.
ft = LN
⇣
z
(Lspace)
(0,t)
⌘
t=1,...,F
2 RD
. (10)
where Lspace is the number of layers of the spatial Trans-
former.
Temporal attention. The spatial attention provides a
powerfull representation for each individual frame by ap-
plying attention between patches in the same image. How-
ever, in order to capture the temporal information across
the frame sequence, a temporal attention mechanism is re-
quired. The effect of temporal modeling can be seen in ta-
ble 2. The spatial attention backbone provides a good rep-
resentation of the videos, however the additional temporal
attention provides a significant improvement over it. In our
where SM() denotes the softmax activation function, and
k
(`,a)
(0,t) is the key value associated with the class token of
frame t.
Spatial attention Applying attention over all the patches
of the sequence is computationally expensive, therefore, an
alternative configuration is required in order to make the
spatio-temporal attention computationally tractable. A re-
duction in computation can be achieved by disentangling
the spatial and temporal dimensions. For the spatial atten-
tion, we apply attention between patches of the same frame:
↵
↵
↵
(`,a)space
(p,t) = SM
0
@
q
(`,a)
(p,t)
p
Dh
>
·

k
(`,a)
(0,t) ,
n
k
(`,a)
(p0,t)
o
p0=1,...,N
1
A .
(6)
The attention vector entries are used as coefficients in a
weighted sum over the values for each attention head:
s
(`,a)
(p,t) = ↵
(`,a)
(p,t),(0,t)v
(`,a)
(0,t) +
N
X
p0=1
↵
(`,a)
(p,t),(p0,t)v
(`,a)
(p0,t). (7)
The MSA and MLP layers are operating as residual oper
tors thanks to added skip-connections
After passing through the spatial Transformer layers, th
class embedding from each frame is used to produce an em
bedding vector ft. This frame embedding will be fed in
the temporal attention.
ft = LN
⇣
z
(Lspace)
(0,t)
⌘
t=1,...,F
2 RD
. (10
where Lspace is the number of layers of the spatial Tran
former.
Temporal attention. The spatial attention provides
powerfull representation for each individual frame by ap
plying attention between patches in the same image. How
ever, in order to capture the temporal information acro
the frame sequence, a temporal attention mechanism is r
quired. The effect of temporal modeling can be seen in t
ble 2. The spatial attention backbone provides a good rep
resentation of the videos, however the additional tempor
attention provides a significant improvement over it. In ou
Where LN represents a LayerNorm [2]. The dimension of
each attention head is given by Dh = D/A.
The attention weights are computed by a dot product
comparison between queries and keys. The self-attention
weights ↵
↵
↵
(`,a)
(p,t) 2 RNF +F
for patch (p, t) are given by:
↵
↵
↵
(`,a)
(p,t) = SM
0
@
q
(`,a)
(p,t)
p
Dh
>
·
"
k
(`,a)
(0,t)
n
k
(`,a)
(p0,t0)
o
p0
=1,...,N
t0
=1,...,F
#1
A
(5)
where SM() denotes the softmax activation function, and
k
(`,a)
(0,t) is the key value associated with the class token of
frame t.
Spatial attention Applying attention over all the patches
of the sequence is computationally expensive, therefore, an
alternative configuration is required in order to make the
spatio-temporal attention computationally tractable. A re-
duction in computation can be achieved by disentangling
the spatial and temporal dimensions. For the spatial atten-
tion, we apply attention between patches of the same frame:
↵
↵
↵
(`,a)space
(p,t) = SM
0
@
q
(`,a)
(p,t)
p
Dh
>
·

k
(`,a)
(0,t) ,
n
k
(`,a)
(p0,t)
o
p0=1,...,N
1
A .
(6)
The attention vector entries are used as coefficients in a
These outputs from attention heads are concatenated and
passed through a 2 Multi-Layer Perceptron (MLP) layers
with GeLU [13] activations:
z0(`)
(p,t) = WO
2
6
6
4
s
(`,1)
(p,t)
.
.
.
s
(`,A)
(p,t)
3
7
7
5 + z
(` 1)
(p,t) (8)
z
(`)
(p,t) = MLP
⇣
LN
⇣
z0(`)
(p,t)
⌘⌘
+ z0(`)
(p,t). (9)
The MSA and MLP layers are operating as residual opera-
tors thanks to added skip-connections
After passing through the spatial Transformer layers, the
class embedding from each frame is used to produce an em-
bedding vector ft. This frame embedding will be fed into
the temporal attention.
ft = LN
⇣
z
(Lspace)
(0,t)
⌘
t=1,...,F
2 RD
. (10)
where Lspace is the number of layers of the spatial Trans-
former.
Temporal attention. The spatial attention provides a
powerfull representation for each individual frame by ap-
plying attention between patches in the same image. How-
ever, in order to capture the temporal information across
n?3@A/)$#&$CC34C"/4
model, the temporal attention layers are applied on the rep-
resentations produced by the spatial attention layers.
For the temporal blocks of our model, we use the frame
embedding vectors from eqn. 10, stacked into a matrix
Xtime 2 RF ⇥D
as the input sequence. As before, we add
a trainable classification token at index t = 0. This input
sequence is projected into query/key/value vectors q
(`,a)
t ,
k
(`,a)
t , v
(`,a)
t . The temporal attention is then computed only
over the frame index.
↵
↵
↵
(`,a)time
t = SM
0
@q
(`,a)
t
p
Dh
>
·

k
(`,a)
0
n
k
(`,a)
t0
o
t0=1,...,F
1
A .
(11)
Next, we apply the same equations of the attention block
eqn. (7) through (9), with a single axis describing the frame
indices instead of the double (p, t) index which was used
in those equations. The embedding vector for a video se-
quence is given by applying the layer norm on the classifi-
cation embedding from the top layer:
y = LN
⇣
z
(Ltime)
(0)
⌘
2 RD
. (12)
where Ltime is the number of temporal attention layers. An
additional single layer MLP is applied as the classifier, out-
as the spatial Transformer
of the ViT family of mod
each with 12 self-attentio
21K pretraining provided
erwise). For the Tempor
smaller version of the Tran
self-attention (Lspace = 1
ers are trained from scratc
We sample frames unif
ing we resize the smaller d
2 [256, 320], and take a ra
the same location for all f
apply random flip augmen
auto-augment with Image
For inference we resiz
dimension is 256, and tak
the center. We use the s
training and inference.
Training For Kinetics-4
GPUs for 30 epochs with
sine learning rate schedu
Fast, both trained with 12
ing is much faster and req
sequence is projected into query/key/value vectors q
(`,a)
t ,
k
(`,a)
t , v
(`,a)
t . The temporal attention is then computed only
over the frame index.
↵
↵
↵
(`,a)time
t = SM
0
@q
(`,a)
t
p
Dh
>
·

k
(`,a)
0
n
k
(`,a)
t0
o
t0=1,...,F
1
A .
(11)
Next, we apply the same equations of the attention block
eqn. (7) through (9), with a single axis describing the frame
indices instead of the double (p, t) index which was used
in those equations. The embedding vector for a video se-
quence is given by applying the layer norm on the classifi-
cation embedding from the top layer:
y = LN
⇣
z
(Ltime)
(0)
⌘
2 RD
. (12)
where Ltime is the number of temporal attention layers. An
additional single layer MLP is applied as the classifier, out-
puting a vector of dimension equal to the number of classes.
The added cost of the temporal encoder layers over the
self
ers
W
ing
2 [2
the
app
auto
F
dim
the
train
Tra
GPU
sine
Fas
ing
Kin
学習と推論
n提案手法
Figure 4. Training and inference commonly used by other methods. In the common approach multiple clips are sample
video, i.e., dense sampling of frames. Furthermore, training and inference have a different structure.
Figure 3. Our proposed Transformer Network for video
attention applied to the sequence of frame embedding vec-
tors. This separation between the spatial and temporal atten-
tion components has several advantages. First, this reduces
the computation by breaking down the input sequence into
two shorter sequences. In the first stage each patch is com-
pared to N other patches within a frames. The second stage
compares each frame embedding vector to F other vectors,
resulting in less overall computation than comparing each
patch to NF other patches.
The second advantage stems from the understanding that
temporal information is better exploited on a higher (more
abstract) level of the network. In many works the where
2D and 3D convolutions are used in the same network, the
3D components are only used on the top layers. Using the
same reasoning, we apply the temporal attention on frame
embeddings rather than on individual patches, since frame
level representations provide more sense of what’s going on
in a video compared to individual patches.
Input embeddings. The input to the spatio-temporal
transformer is X 2 RH⇥W ⇥3⇥F
consists of F RGB frames
of size H ⇥W sampled from the original video. Each frame
n従来手法
実験設定
n'A$C"$#&C)$4,-/)@3)&DR"?E
• 事前学習
• V@$P343C:<5
n?3@A/)$#&C)$4,-/)@3)
• 1$03)がW層
• X3$%がY個
• ランダムに初期化
nデータセット
• O"43C"B,G;;
n学習
• 小さい方の次元をN:ZW*F:;Sにリサイ
ズ
• ランダムに::G&[&::Gに切り取り
• U#"A&$QP@34C$C"/4
• MQC/QC
• V@$P343Cの$QC/&$QP@34C$C"/4
n推論
• 小さい方の次元を:ZWにリサイズ
• センタークロップ ::G&[&::G
nどちらも均一なフレームサンプリングとす
る
実験結果
n空間認識モデルでの認識精度の違い
• バックボーンにM..モデルを採用して
比較
• 'A$C"$#&C)$4,-/)@3)&DR"?Eは強力
Model
Top-1 Accuracy
[%]
Flops ⇥ views
[G]
Param
[M]
Runtime
[hrs]
Runtime
[VPS]
D + NL [5] 75.7 28.9 ⇥ 30 33.6 — —
st 8 ⇥ 8, R50 77.0 65.7 ⇥ 30 34.4 2.75 1.4
M 76.0 6.2 ⇥ 30 3.8 1.47 1.3
77.5 24.8 ⇥ 30 6.1 2.06 0.46
L 79.1 48.4 ⇥ 30 11.0 — —
50 74.7 65 ⇥ 10 24.3 — —
al R101 77.7 359 ⇥ 30 54.3 — —
(16 Frames) 79.3 270 ⇥ 1 96 0.05 20.0
(64 Frames) 80.5 1080 ⇥ 1 96 0.21 4.8
on on Kinetics400. Time measurements were done on Nvidia V100 GPUs with mixed precision. The Runtime
n Kinetics-400 validation set (using 8 GPUs), while the videos per second (VPS) measurement was done on a
ious methods are as reported in the relevant publications. The proposed STAM is an order of magnitude faster
uracy.
Backbone+Temporal
Top-1 Accuracy
[%]
Flops
[G]
ViT+Temporal-Transformer 77.8 270
TResNet-M+Temporal-Transformer 75.7 93
ViT+Mean 75.1 265
TResNet-M+Mean 71.9 88
ansformer vs. Mean. Comparing the Temporal Transformer representation vs. mean of frame embeddings.
X3D-L 77.5 24.8 ⇥ 30 6.1 2.06
X3D-XL 79.1 48.4 ⇥ 30 11.0 —
TSM R50 74.7 65 ⇥ 10 24.3 —
Nonlocal R101 77.7 359 ⇥ 30 54.3 —
STAM (16 Frames) 79.3 270 ⇥ 1 96 0.05
STAM (64 Frames) 80.5 1080 ⇥ 1 96 0.21
Table 1. Model comparison on Kinetics400. Time measurements were done on Nvidia V100 GPUs with m
[hrs] measures inference on Kinetics-400 validation set (using 8 GPUs), while the videos per second (VPS)
single GPU. Results of various methods are as reported in the relevant publications. The proposed STAM is
while providing SOTA accuracy.
Backbone+Temporal
Top-1 Accuracy
[%]
Flops
[G]
ViT+Temporal-Transformer 77.8 270
TResNet-M+Temporal-Transformer 75.7 93
ViT+Mean 75.1 265
TResNet-M+Mean 71.9 88
Table 2. Temporal Transformer vs. Mean. Comparing the Temporal Transformer representation vs. m
Model (num. of frames)
Top-1 Accuracy
[%]
Flops
[G]
TResNet-M+Temporal (16) 75.7 93
TResNet-L+Temporal (8) 75.9 77
ResNet50+Temporal (16) 72.5 70
ViT-B+Temporal (16) 79.3 270
Table 3. Using different backbones with the Temporal Transformer. TResNet and ViT models use ima
n?3@A/)$#&C)$4,-/)@3)の有効性
• 比較
• フレーム予測平均
• C3@A/)$#&C)$4,-/)@3)
• R"?* ?J3,.3C67どちらにおいても
性能向上
実験結果
n入力動画のフレーム数の検証
• フレーム数が少なくても高認識
• フレーム数が増えればより良い認識
ViT+Temporal-Transformer 77.8 270
TResNet-M+Temporal-Transformer 75.7 93
ViT+Mean 75.1 265
TResNet-M+Mean 71.9 88
mer vs. Mean. Comparing the Temporal Transformer representation vs. mean of frame embeddings.
Model (num. of frames)
Top-1 Accuracy
[%]
Flops
[G]
TResNet-M+Temporal (16) 75.7 93
TResNet-L+Temporal (8) 75.9 77
ResNet50+Temporal (16) 72.5 70
ViT-B+Temporal (16) 79.3 270
nes with the Temporal Transformer. TResNet and ViT models use imagenet-21K pretraining, while
1K pretraining.
Number of frames
Top-1 Accuracy
[%]
Flops
[G]
4 74.1 67
8 77.1 135
16 79.3 270
32 79.9 540
64 80.5 1080
for prediction Comparing different number of frames sampled uniformly from the video as input, using
[%] [G]
TResNet-M+Temporal (16) 75.7 93
TResNet-L+Temporal (8) 75.9 77
ResNet50+Temporal (16) 72.5 70
ViT-B+Temporal (16) 79.3 270
Table 3. Using different backbones with the Temporal Transformer. TResNet and ViT models use im
ResNet50 is used with imagenet-1K pretraining.
Number of frames
Top-1 Accuracy
[%]
Flops
[G]
4 74.1 67
8 77.1 135
16 79.3 270
32 79.9 540
64 80.5 1080
Table 4. Number of frames used for prediction Comparing different number of frames sampled uniformly
the ViT+Temporal model.
Method
Top-1 Accuracy
[%]
X3D L 72.25 (77.5)
SlowFast 8x8 R50 65.5 (77.0)
Table 5. Evaluating X3D and SlowFast with one clip (16 frames). Applying the same sampling strategy tha
on other methods. (In parentheses - accuracy at 30 views per video).
n従来手法に提案した推論方法を検証
• 既存手法に疎な入力を適用
• 密な入力と比べて性能低下
実験結果
n8&FL681よりも高精度な認識
n<9"3Tで精度を落とさず推論可能
• ランタイムの向上 DG;倍高速E
Model
Top-1 Accuracy
[%]
Flops ⇥ views
[G]
Param
[M]
Runtime
[hrs]
Runtime
[VPS]
Oct-I3D + NL [5] 75.7 28.9 ⇥ 30 33.6 — —
SlowFast 8 ⇥ 8, R50 77.0 65.7 ⇥ 30 34.4 2.75 1.4
X3D-M 76.0 6.2 ⇥ 30 3.8 1.47 1.3
X3D-L 77.5 24.8 ⇥ 30 6.1 2.06 0.46
X3D-XL 79.1 48.4 ⇥ 30 11.0 — —
TSM R50 74.7 65 ⇥ 10 24.3 — —
Nonlocal R101 77.7 359 ⇥ 30 54.3 — —
STAM (16 Frames) 79.3 270 ⇥ 1 96 0.05 20.0
STAM (64 Frames) 80.5 1080 ⇥ 1 96 0.21 4.8
Table 1. Model comparison on Kinetics400. Time measurements were done on Nvidia V100 GPUs with mixed precision. The Runtime
[hrs] measures inference on Kinetics-400 validation set (using 8 GPUs), while the videos per second (VPS) measurement was done on a
single GPU. Results of various methods are as reported in the relevant publications. The proposed STAM is an order of magnitude faster
まとめ
n?)$4,-/)@3)のみを用いた動画認識モデル
n大域的な時間で$CC34C"/4が可能
• グローバルに時間情報を取得
n疎なフレームで高認識
• 動画全体を<B#"Aで見ることが可能
• ビデオあたりのフレーム数をF;倍削減
• G;倍高速な推論

Weitere ähnliche Inhalte

Ähnlich wie 文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Ohsawa Goodfellow
 
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust) GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
智啓 出川
 

Ähnlich wie 文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (20)

音声認識と深層学習
音声認識と深層学習音声認識と深層学習
音声認識と深層学習
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
 
ディープラーニングフレームワーク とChainerの実装
ディープラーニングフレームワーク とChainerの実装ディープラーニングフレームワーク とChainerの実装
ディープラーニングフレームワーク とChainerの実装
 
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
 
TVM の紹介
TVM の紹介TVM の紹介
TVM の紹介
 
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow
 
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages. Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
 
TensorFlow math ja 05 word2vec
TensorFlow math ja 05 word2vecTensorFlow math ja 05 word2vec
TensorFlow math ja 05 word2vec
 
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
Learning Deep Architectures for AI (第 3 回 Deep Learning 勉強会資料; 松尾)
 
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust) GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
 
論文LT会用資料: Attention Augmented Convolution Networks
論文LT会用資料: Attention Augmented Convolution Networks論文LT会用資料: Attention Augmented Convolution Networks
論文LT会用資料: Attention Augmented Convolution Networks
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定(文献紹介)深層学習による動被写体ロバストなカメラの動き推定
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定
 
ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)
 
Icml読み会 deep speech2
Icml読み会 deep speech2Icml読み会 deep speech2
Icml読み会 deep speech2
 

Mehr von Toru Tamaki

Mehr von Toru Tamaki (20)

論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 

Kürzlich hochgeladen

Kürzlich hochgeladen (7)

業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
 
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
 
LoRaWAN スマート距離検出デバイスDS20L日本語マニュアル
LoRaWAN スマート距離検出デバイスDS20L日本語マニュアルLoRaWAN スマート距離検出デバイスDS20L日本語マニュアル
LoRaWAN スマート距離検出デバイスDS20L日本語マニュアル
 
Amazon SES を勉強してみる その32024/04/26の勉強会で発表されたものです。
Amazon SES を勉強してみる その32024/04/26の勉強会で発表されたものです。Amazon SES を勉強してみる その32024/04/26の勉強会で発表されたものです。
Amazon SES を勉強してみる その32024/04/26の勉強会で発表されたものです。
 
新人研修 後半 2024/04/26の勉強会で発表されたものです。
新人研修 後半        2024/04/26の勉強会で発表されたものです。新人研修 後半        2024/04/26の勉強会で発表されたものです。
新人研修 後半 2024/04/26の勉強会で発表されたものです。
 
Amazon SES を勉強してみる その22024/04/26の勉強会で発表されたものです。
Amazon SES を勉強してみる その22024/04/26の勉強会で発表されたものです。Amazon SES を勉強してみる その22024/04/26の勉強会で発表されたものです。
Amazon SES を勉強してみる その22024/04/26の勉強会で発表されたものです。
 
LoRaWANスマート距離検出センサー DS20L カタログ LiDARデバイス
LoRaWANスマート距離検出センサー  DS20L  カタログ  LiDARデバイスLoRaWANスマート距離検出センサー  DS20L  カタログ  LiDARデバイス
LoRaWANスマート距離検出センサー DS20L カタログ LiDARデバイス
 

文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

  • 2. 概要 n?)$4,-/)@3)のみを用いた動画認識モデルの提案 • 'A$B3&?"@3&+CC34C"/4&7/%3#&D'?+7E n疎にサンプリングしたフレームで学習,推論 • 高精度 • 低計算コスト n推論方法の改善 • <つのクリップで推論 • ビデオあたりのフレーム数をF;倍削減 • 従来手法のG;倍高速な推論 An Image is Worth 16x16 Words, What is a Video Worth? Gilad Sharir Asaf Noy Lihi Zelnik-Manor DAMO Academy, Alibaba Group Abstract Leading methods in the domain of action recognition try to distill information from both the spatial and temporal di- mensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sam- pling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We ad- dress the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global atten- tion over video frames, and thus better exploits the salient Figure 1. Kinetics-400 top-1 Accuracy vs Runtime, measured over Nvidia V100 GPU and presented in log-scale. Markers sizes are proportional to the number of frames used per video by lead- ing methods. Our method provides dominating trade-off for those 5v2 [cs.CV] 27 May 2021
  • 3. 関連研究 nHIJ?&K&FL&M..&NO$#-$/P#QK*&IMMR:;:;S • FL&M..で時間情報を獲得する • FL&M..の出力をHIJ?に通す • 高密度のサンプリング,フレームが必要 n'#/T-$,C NU3"B(C34(/-3)K*&VMMR:;<>S • 異なる時間解像度のネットワーク • FL&M..ベース • 高密度なサンプリングが必要 ing. Thirdly, the proposed BERT-based temporal modeling implementations on SlowFast architecture are examined in Section 3.3. The reason to re-consider the BERT-based late temporal modeling on SlowFast architecture is due to its di↵erent two-stream structure from other 3D CNN architectures. 3.1 BERT-based Temporal Modeling with 3D CNNs for Action Recognition Fig. 1: BERT-based late temporal modeling Bi-directional Encoder Representations from Transformers (BERT) [4] is a bidirectional self-attention method, which has provided unprecedented success in many downstream Natural Language Processing (NLP) tasks. The bidirectional property enables BERT to fuse the contextual information from both directions, instead of relying upon only a single direction, as in former recurrent neural networks or other self-attention methods, such as Transformer [29]. Moreover, BERT introduces challenging unsupervised pre-training tasks which leads to useful representations for many tasks. Our architecture utilizes BERT-based temporal pooling as shown in Fig. 1. In this architecture, the selected K frames from the input sequence are propa- gated through a 3D CNN architecture without applying temporal global average SlowFast Networks for Video Recognition Christoph Feichtenhofer Haoqi Fan Jitendra Malik Kaiming He Facebook AI Research (FAIR) Abstract We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast path- way, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improve- ments are pin-pointed as contributions by our SlowFast con- cept. We report state-of-the-art accuracy on major video T C H,W prediction High frame rate C αT C C αT αT βC βC βC T T T Low frame rate Figure 1. A SlowFast network has a low frame rate, low temporal resolution Slow pathway and a high frame rate, ↵⇥ higher temporal CV] 29 Oct 2019
  • 4. 提案手法 n空間と時間のC)$4,-/)@3)モデル n空間 • R","/4&?)$4,-/)@3)&DR"?E&NL/,/9"C,5"0K*& VM1J:;:<Sを用いる n時間 • 時間情報のみのC)$4,-/)@3)モデル Figure 3. Our proposed Transformer Network for video a to ti th tw p c re p te a 2 3 s e le in I tr o Published as a conference paper at ICLR 2021 Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extralearnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder
  • 5. 提案手法 n'A$C"$#&$CC34C"/4 (0,t) Multi-head Self-Attention block (MSA). STAM con- sists of L MSA blocks. At each block ` 2 {1, . . . , L}, and head a 2 {1, . . . , A}, each patch representation is trans- formed into query, key, and value vectors. The representa- tion produced by the previous block z` 1 (p,t) is used as input. q (`,a) (p,t) = W (`,a) Q LN ⇣ z (` 1) (p,t) ⌘ 2 RDh (2) k (`,a) (p,t) = W (`,a) K LN ⇣ z (` 1) (p,t) ⌘ 2 RDh (3) v (`,a) (p,t) = W (`,a) V LN ⇣ z (` 1) (p,t) ⌘ 2 RDh (4) ↵ ↵ ↵ (`,a) (p,t) = SM 0 @ q (`,a) (p,t) p Dh > · " k (`,a) (0,t) n k (`,a) (p0,t0) o p0 =1,...,N t0 =1,...,F #1 A (5) where SM() denotes the softmax activation function, and k (`,a) (0,t) is the key value associated with the class token of frame t. Spatial attention Applying attention over all the patches of the sequence is computationally expensive, therefore, an alternative configuration is required in order to make the spatio-temporal attention computationally tractable. A re- duction in computation can be achieved by disentangling the spatial and temporal dimensions. For the spatial atten- tion, we apply attention between patches of the same frame: ↵ ↵ ↵ (`,a)space (p,t) = SM 0 @ q (`,a) (p,t) p Dh > ·  k (`,a) (0,t) , n k (`,a) (p0,t) o p0=1,...,N 1 A . (6) The attention vector entries are used as coefficients in a weighted sum over the values for each attention head: s (`,a) (p,t) = ↵ (`,a) (p,t),(0,t)v (`,a) (0,t) + N X p0=1 ↵ (`,a) (p,t),(p0,t)v (`,a) (p0,t). (7) z0(`) (p,t) = WO 6 6 4 . . . s (`,A) (p,t) 7 7 5 + z (` 1) (p,t) (8) z (`) (p,t) = MLP ⇣ LN ⇣ z0(`) (p,t) ⌘⌘ + z0(`) (p,t). (9) The MSA and MLP layers are operating as residual opera- tors thanks to added skip-connections After passing through the spatial Transformer layers, the class embedding from each frame is used to produce an em- bedding vector ft. This frame embedding will be fed into the temporal attention. ft = LN ⇣ z (Lspace) (0,t) ⌘ t=1,...,F 2 RD . (10) where Lspace is the number of layers of the spatial Trans- former. Temporal attention. The spatial attention provides a powerfull representation for each individual frame by ap- plying attention between patches in the same image. How- ever, in order to capture the temporal information across the frame sequence, a temporal attention mechanism is re- quired. The effect of temporal modeling can be seen in ta- ble 2. The spatial attention backbone provides a good rep- resentation of the videos, however the additional temporal attention provides a significant improvement over it. In our where SM() denotes the softmax activation function, and k (`,a) (0,t) is the key value associated with the class token of frame t. Spatial attention Applying attention over all the patches of the sequence is computationally expensive, therefore, an alternative configuration is required in order to make the spatio-temporal attention computationally tractable. A re- duction in computation can be achieved by disentangling the spatial and temporal dimensions. For the spatial atten- tion, we apply attention between patches of the same frame: ↵ ↵ ↵ (`,a)space (p,t) = SM 0 @ q (`,a) (p,t) p Dh > ·  k (`,a) (0,t) , n k (`,a) (p0,t) o p0=1,...,N 1 A . (6) The attention vector entries are used as coefficients in a weighted sum over the values for each attention head: s (`,a) (p,t) = ↵ (`,a) (p,t),(0,t)v (`,a) (0,t) + N X p0=1 ↵ (`,a) (p,t),(p0,t)v (`,a) (p0,t). (7) The MSA and MLP layers are operating as residual oper tors thanks to added skip-connections After passing through the spatial Transformer layers, th class embedding from each frame is used to produce an em bedding vector ft. This frame embedding will be fed in the temporal attention. ft = LN ⇣ z (Lspace) (0,t) ⌘ t=1,...,F 2 RD . (10 where Lspace is the number of layers of the spatial Tran former. Temporal attention. The spatial attention provides powerfull representation for each individual frame by ap plying attention between patches in the same image. How ever, in order to capture the temporal information acro the frame sequence, a temporal attention mechanism is r quired. The effect of temporal modeling can be seen in t ble 2. The spatial attention backbone provides a good rep resentation of the videos, however the additional tempor attention provides a significant improvement over it. In ou Where LN represents a LayerNorm [2]. The dimension of each attention head is given by Dh = D/A. The attention weights are computed by a dot product comparison between queries and keys. The self-attention weights ↵ ↵ ↵ (`,a) (p,t) 2 RNF +F for patch (p, t) are given by: ↵ ↵ ↵ (`,a) (p,t) = SM 0 @ q (`,a) (p,t) p Dh > · " k (`,a) (0,t) n k (`,a) (p0,t0) o p0 =1,...,N t0 =1,...,F #1 A (5) where SM() denotes the softmax activation function, and k (`,a) (0,t) is the key value associated with the class token of frame t. Spatial attention Applying attention over all the patches of the sequence is computationally expensive, therefore, an alternative configuration is required in order to make the spatio-temporal attention computationally tractable. A re- duction in computation can be achieved by disentangling the spatial and temporal dimensions. For the spatial atten- tion, we apply attention between patches of the same frame: ↵ ↵ ↵ (`,a)space (p,t) = SM 0 @ q (`,a) (p,t) p Dh > ·  k (`,a) (0,t) , n k (`,a) (p0,t) o p0=1,...,N 1 A . (6) The attention vector entries are used as coefficients in a These outputs from attention heads are concatenated and passed through a 2 Multi-Layer Perceptron (MLP) layers with GeLU [13] activations: z0(`) (p,t) = WO 2 6 6 4 s (`,1) (p,t) . . . s (`,A) (p,t) 3 7 7 5 + z (` 1) (p,t) (8) z (`) (p,t) = MLP ⇣ LN ⇣ z0(`) (p,t) ⌘⌘ + z0(`) (p,t). (9) The MSA and MLP layers are operating as residual opera- tors thanks to added skip-connections After passing through the spatial Transformer layers, the class embedding from each frame is used to produce an em- bedding vector ft. This frame embedding will be fed into the temporal attention. ft = LN ⇣ z (Lspace) (0,t) ⌘ t=1,...,F 2 RD . (10) where Lspace is the number of layers of the spatial Trans- former. Temporal attention. The spatial attention provides a powerfull representation for each individual frame by ap- plying attention between patches in the same image. How- ever, in order to capture the temporal information across n?3@A/)$#&$CC34C"/4 model, the temporal attention layers are applied on the rep- resentations produced by the spatial attention layers. For the temporal blocks of our model, we use the frame embedding vectors from eqn. 10, stacked into a matrix Xtime 2 RF ⇥D as the input sequence. As before, we add a trainable classification token at index t = 0. This input sequence is projected into query/key/value vectors q (`,a) t , k (`,a) t , v (`,a) t . The temporal attention is then computed only over the frame index. ↵ ↵ ↵ (`,a)time t = SM 0 @q (`,a) t p Dh > ·  k (`,a) 0 n k (`,a) t0 o t0=1,...,F 1 A . (11) Next, we apply the same equations of the attention block eqn. (7) through (9), with a single axis describing the frame indices instead of the double (p, t) index which was used in those equations. The embedding vector for a video se- quence is given by applying the layer norm on the classifi- cation embedding from the top layer: y = LN ⇣ z (Ltime) (0) ⌘ 2 RD . (12) where Ltime is the number of temporal attention layers. An additional single layer MLP is applied as the classifier, out- as the spatial Transformer of the ViT family of mod each with 12 self-attentio 21K pretraining provided erwise). For the Tempor smaller version of the Tran self-attention (Lspace = 1 ers are trained from scratc We sample frames unif ing we resize the smaller d 2 [256, 320], and take a ra the same location for all f apply random flip augmen auto-augment with Image For inference we resiz dimension is 256, and tak the center. We use the s training and inference. Training For Kinetics-4 GPUs for 30 epochs with sine learning rate schedu Fast, both trained with 12 ing is much faster and req sequence is projected into query/key/value vectors q (`,a) t , k (`,a) t , v (`,a) t . The temporal attention is then computed only over the frame index. ↵ ↵ ↵ (`,a)time t = SM 0 @q (`,a) t p Dh > ·  k (`,a) 0 n k (`,a) t0 o t0=1,...,F 1 A . (11) Next, we apply the same equations of the attention block eqn. (7) through (9), with a single axis describing the frame indices instead of the double (p, t) index which was used in those equations. The embedding vector for a video se- quence is given by applying the layer norm on the classifi- cation embedding from the top layer: y = LN ⇣ z (Ltime) (0) ⌘ 2 RD . (12) where Ltime is the number of temporal attention layers. An additional single layer MLP is applied as the classifier, out- puting a vector of dimension equal to the number of classes. The added cost of the temporal encoder layers over the self ers W ing 2 [2 the app auto F dim the train Tra GPU sine Fas ing Kin
  • 6. 学習と推論 n提案手法 Figure 4. Training and inference commonly used by other methods. In the common approach multiple clips are sample video, i.e., dense sampling of frames. Furthermore, training and inference have a different structure. Figure 3. Our proposed Transformer Network for video attention applied to the sequence of frame embedding vec- tors. This separation between the spatial and temporal atten- tion components has several advantages. First, this reduces the computation by breaking down the input sequence into two shorter sequences. In the first stage each patch is com- pared to N other patches within a frames. The second stage compares each frame embedding vector to F other vectors, resulting in less overall computation than comparing each patch to NF other patches. The second advantage stems from the understanding that temporal information is better exploited on a higher (more abstract) level of the network. In many works the where 2D and 3D convolutions are used in the same network, the 3D components are only used on the top layers. Using the same reasoning, we apply the temporal attention on frame embeddings rather than on individual patches, since frame level representations provide more sense of what’s going on in a video compared to individual patches. Input embeddings. The input to the spatio-temporal transformer is X 2 RH⇥W ⇥3⇥F consists of F RGB frames of size H ⇥W sampled from the original video. Each frame n従来手法
  • 7. 実験設定 n'A$C"$#&C)$4,-/)@3)&DR"?E • 事前学習 • V@$P343C:<5 n?3@A/)$#&C)$4,-/)@3) • 1$03)がW層 • X3$%がY個 • ランダムに初期化 nデータセット • O"43C"B,G;; n学習 • 小さい方の次元をN:ZW*F:;Sにリサイ ズ • ランダムに::G&[&::Gに切り取り • U#"A&$QP@34C$C"/4 • MQC/QC • V@$P343Cの$QC/&$QP@34C$C"/4 n推論 • 小さい方の次元を:ZWにリサイズ • センタークロップ ::G&[&::G nどちらも均一なフレームサンプリングとす る
  • 8. 実験結果 n空間認識モデルでの認識精度の違い • バックボーンにM..モデルを採用して 比較 • 'A$C"$#&C)$4,-/)@3)&DR"?Eは強力 Model Top-1 Accuracy [%] Flops ⇥ views [G] Param [M] Runtime [hrs] Runtime [VPS] D + NL [5] 75.7 28.9 ⇥ 30 33.6 — — st 8 ⇥ 8, R50 77.0 65.7 ⇥ 30 34.4 2.75 1.4 M 76.0 6.2 ⇥ 30 3.8 1.47 1.3 77.5 24.8 ⇥ 30 6.1 2.06 0.46 L 79.1 48.4 ⇥ 30 11.0 — — 50 74.7 65 ⇥ 10 24.3 — — al R101 77.7 359 ⇥ 30 54.3 — — (16 Frames) 79.3 270 ⇥ 1 96 0.05 20.0 (64 Frames) 80.5 1080 ⇥ 1 96 0.21 4.8 on on Kinetics400. Time measurements were done on Nvidia V100 GPUs with mixed precision. The Runtime n Kinetics-400 validation set (using 8 GPUs), while the videos per second (VPS) measurement was done on a ious methods are as reported in the relevant publications. The proposed STAM is an order of magnitude faster uracy. Backbone+Temporal Top-1 Accuracy [%] Flops [G] ViT+Temporal-Transformer 77.8 270 TResNet-M+Temporal-Transformer 75.7 93 ViT+Mean 75.1 265 TResNet-M+Mean 71.9 88 ansformer vs. Mean. Comparing the Temporal Transformer representation vs. mean of frame embeddings. X3D-L 77.5 24.8 ⇥ 30 6.1 2.06 X3D-XL 79.1 48.4 ⇥ 30 11.0 — TSM R50 74.7 65 ⇥ 10 24.3 — Nonlocal R101 77.7 359 ⇥ 30 54.3 — STAM (16 Frames) 79.3 270 ⇥ 1 96 0.05 STAM (64 Frames) 80.5 1080 ⇥ 1 96 0.21 Table 1. Model comparison on Kinetics400. Time measurements were done on Nvidia V100 GPUs with m [hrs] measures inference on Kinetics-400 validation set (using 8 GPUs), while the videos per second (VPS) single GPU. Results of various methods are as reported in the relevant publications. The proposed STAM is while providing SOTA accuracy. Backbone+Temporal Top-1 Accuracy [%] Flops [G] ViT+Temporal-Transformer 77.8 270 TResNet-M+Temporal-Transformer 75.7 93 ViT+Mean 75.1 265 TResNet-M+Mean 71.9 88 Table 2. Temporal Transformer vs. Mean. Comparing the Temporal Transformer representation vs. m Model (num. of frames) Top-1 Accuracy [%] Flops [G] TResNet-M+Temporal (16) 75.7 93 TResNet-L+Temporal (8) 75.9 77 ResNet50+Temporal (16) 72.5 70 ViT-B+Temporal (16) 79.3 270 Table 3. Using different backbones with the Temporal Transformer. TResNet and ViT models use ima n?3@A/)$#&C)$4,-/)@3)の有効性 • 比較 • フレーム予測平均 • C3@A/)$#&C)$4,-/)@3) • R"?* ?J3,.3C67どちらにおいても 性能向上
  • 9. 実験結果 n入力動画のフレーム数の検証 • フレーム数が少なくても高認識 • フレーム数が増えればより良い認識 ViT+Temporal-Transformer 77.8 270 TResNet-M+Temporal-Transformer 75.7 93 ViT+Mean 75.1 265 TResNet-M+Mean 71.9 88 mer vs. Mean. Comparing the Temporal Transformer representation vs. mean of frame embeddings. Model (num. of frames) Top-1 Accuracy [%] Flops [G] TResNet-M+Temporal (16) 75.7 93 TResNet-L+Temporal (8) 75.9 77 ResNet50+Temporal (16) 72.5 70 ViT-B+Temporal (16) 79.3 270 nes with the Temporal Transformer. TResNet and ViT models use imagenet-21K pretraining, while 1K pretraining. Number of frames Top-1 Accuracy [%] Flops [G] 4 74.1 67 8 77.1 135 16 79.3 270 32 79.9 540 64 80.5 1080 for prediction Comparing different number of frames sampled uniformly from the video as input, using [%] [G] TResNet-M+Temporal (16) 75.7 93 TResNet-L+Temporal (8) 75.9 77 ResNet50+Temporal (16) 72.5 70 ViT-B+Temporal (16) 79.3 270 Table 3. Using different backbones with the Temporal Transformer. TResNet and ViT models use im ResNet50 is used with imagenet-1K pretraining. Number of frames Top-1 Accuracy [%] Flops [G] 4 74.1 67 8 77.1 135 16 79.3 270 32 79.9 540 64 80.5 1080 Table 4. Number of frames used for prediction Comparing different number of frames sampled uniformly the ViT+Temporal model. Method Top-1 Accuracy [%] X3D L 72.25 (77.5) SlowFast 8x8 R50 65.5 (77.0) Table 5. Evaluating X3D and SlowFast with one clip (16 frames). Applying the same sampling strategy tha on other methods. (In parentheses - accuracy at 30 views per video). n従来手法に提案した推論方法を検証 • 既存手法に疎な入力を適用 • 密な入力と比べて性能低下
  • 10. 実験結果 n8&FL681よりも高精度な認識 n<9"3Tで精度を落とさず推論可能 • ランタイムの向上 DG;倍高速E Model Top-1 Accuracy [%] Flops ⇥ views [G] Param [M] Runtime [hrs] Runtime [VPS] Oct-I3D + NL [5] 75.7 28.9 ⇥ 30 33.6 — — SlowFast 8 ⇥ 8, R50 77.0 65.7 ⇥ 30 34.4 2.75 1.4 X3D-M 76.0 6.2 ⇥ 30 3.8 1.47 1.3 X3D-L 77.5 24.8 ⇥ 30 6.1 2.06 0.46 X3D-XL 79.1 48.4 ⇥ 30 11.0 — — TSM R50 74.7 65 ⇥ 10 24.3 — — Nonlocal R101 77.7 359 ⇥ 30 54.3 — — STAM (16 Frames) 79.3 270 ⇥ 1 96 0.05 20.0 STAM (64 Frames) 80.5 1080 ⇥ 1 96 0.21 4.8 Table 1. Model comparison on Kinetics400. Time measurements were done on Nvidia V100 GPUs with mixed precision. The Runtime [hrs] measures inference on Kinetics-400 validation set (using 8 GPUs), while the videos per second (VPS) measurement was done on a single GPU. Results of various methods are as reported in the relevant publications. The proposed STAM is an order of magnitude faster