SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
!"#"$%&'"()"*%+,"(-.,(/--&$&"0%(
1$%&'&%2(3"$.40&%&.0(
50-","0$"
!"#$"#%&'%#(&)%$*#&'%(&+,-&!".$(&/,0%1.&2-1-3-(&4-5.6"&7%8".
9!!:;<;=
仁田智也(名工大玉木研)
Amazon Web Services
{chunhliu, xxnl, hxen, dmodolo, tighej}@amazon.com
Abstract
Most action recognition solutions rely on dense sampling
to precisely cover the informative temporal clip. Exten-
sively searching temporal region is expensive for a real-
world application. In this work, we focus on improving the
inference efficiency of current action recognition backbones
on trimmed videos, and illustrate that an action model can
accurately classify an action with a single pass over the
video unlike the multi-clip sampling common with SOTA by
learning to drop non-informative features. We present Se-
lective Feature Compression (SFC), an action recognition
inference strategy that greatly increases model inference
efficiency without compromising accuracy. Different from
previous works that compress kernel size and decrease the
channel dimension, we propose to compress features along
the spatio-temporal dimensions without the need to change
backbone parameters. Our experiments on Kinetics-400,
UCF101 and ActivityNet show that SFC is able to reduce in-
ference speed by 6-7x and memory usage by 5-6x compared
Jumping into water
✕
✓
catching frisbee
running
✕
Score
Class
Uniform Sampling:
(1 2 3 4) + (5 6 7 8)
Running Catching Frisbee Jumping into water
(1) (2) (3) (4) (5) (6) (7) (8)
Jumping into water
✕
✓
catching frisbee
running
✕
Score
Class
Dense Sampling:
(1 2 3 4) + (3 4 5 6) +(5 6 7 8)
SFC:
Entire Clip
✕
✓
✕
Score
Class
Select
Key Feature
Jumping into water
catching frisbee
running
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(3)
(4)
(5)
(6)
3D Model 3D Model
Figure 1: With fast motion, uniform sampling (row 1, left) is not
guaranteed to precisely locate the key region. Consequently, an
導入
n動画認識の既存の推論方法
• 短いフレーム数のクリップから推論
• =ビデオから複数のクリップを使う
• 密にサンプリングをした方が良い
• メモリ使用量が多い
n>.3.?@%0.&A.,@#B.&!-C6B.55%-$&D>A!E
• 長いフレーム数のクリップから推論
• =ビデオから=つのクリップを使う
• 推論の効率化
• 様々なモデルに適用させやすい
Chunhui Liu*, Xinyu Li*
, Hao Chen, Davide Modolo, Joseph Tighe
Amazon Web Services
{chunhliu, xxnl, hxen, dmodolo, tighej}@amazon.com
Abstract
Most action recognition solutions rely on dense sampling
to precisely cover the informative temporal clip. Exten-
sively searching temporal region is expensive for a real-
world application. In this work, we focus on improving the
inference efficiency of current action recognition backbones
on trimmed videos, and illustrate that an action model can
accurately classify an action with a single pass over the
video unlike the multi-clip sampling common with SOTA by
learning to drop non-informative features. We present Se-
lective Feature Compression (SFC), an action recognition
inference strategy that greatly increases model inference
efficiency without compromising accuracy. Different from
previous works that compress kernel size and decrease the
channel dimension, we propose to compress features along
the spatio-temporal dimensions without the need to change
backbone parameters. Our experiments on Kinetics-400,
UCF101 and ActivityNet show that SFC is able to reduce in-
Jumping into water
✕
✓
catching frisbee
running
✕
Score
Class
Uniform Sampling:
(1 2 3 4) + (5 6 7 8)
Running Catching Frisbee Jumping into water
(1) (2) (3) (4) (5) (6) (7) (8)
Jumping into water
✕
✓
catching frisbee
running
✕
Score
Class
Dense Sampling:
(1 2 3 4) + (3 4 5 6) +(5 6 7 8)
SFC:
Entire Clip
✕
✓
✕
Score
Class
Select
Key Feature
Jumping into water
catching frisbee
running
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(3)
(4)
(5)
(6)
3D Model 3D Model
Figure 1: With fast motion, uniform sampling (row 1, left) is not
効率的な推論方法
n入力サイズを変更する
• 動画の全てのフレームを入力
• 長い動画を処理できない
• 複数の短いクリップを入力
nモデルを軽量にする
• 7>2&D'%$F(&9!!:;<=GE
• ;/ベース
• 7"%H.@&D'#-F(&9!!:;<=IE
• J!K&DL-3M,8",B%F(&J!!:;<=NE
n重要なフレームをサンプリング
• >!>,C63.B&DO-BP,BF(&9!!:;<=GE
• 2QR'&DS#F(&9!!:;<=GE
(Wu+, ICCV2019)
Untrimmed Video Recognition
Wenhao Wu 1,2∗
Dongliang He 3
Xiao Tan 3
Shifeng Chen 1†
Shilei Wen 3
1
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
2
University of Chinese Academy of Sciences, China
3
Department of Computer Vision Technology (VIS), Baidu Inc., China
{wh.wu,shifeng.chen}@siat.ac.cn {hedongliang01,tanxiao01,wenshilei}@baidu.com
Abstract
Video Recognition has drawn great research interest and
great progress has been made. A suitable frame sampling
strategy can improve the accuracy and efficiency of recogni-
tion. However, mainstream solutions generally adopt hand-
crafted frame sampling strategies for recognition. It could
degrade the performance, especially in untrimmed videos,
due to the variation of frame-level saliency. To this end,
we concentrate on improving untrimmed video classifica-
tion via developing a learning-based frame sampling strat-
egy. We intuitively formulate the frame sampling proce-
dure as multiple parallel Markov decision processes, each
of which aims at picking out a frame/clip by gradually ad-
justing an initial sampling. Then we propose to solve the
problems with multi-agent reinforcement learning (MARL).
Our MARL framework is composed of a novel RNN-based
Agent 1 Agent 2 Agent 3
Predicted as Hopscotch
Action
Observation Observation Observation
Action Action
Step by step
Untrimmed
video
All agents
stop
• • •
• • •
• • •
Figure 1. A demonstration of our approach. At each time step,
each agent relies on context information among nearby agents to
手法
30×3×
$
%
×&×' 30×(′×
$
%
×&′′×'′′
3×$×&×'
Late Fusion
One Pass
Prediction
1×(×$×&′×'′ 1×(×
$
+
×&′×'′ 1×(′×
$
+
×&′′×'′′
Long Video Chunk
3×$×&×'
Long Video Chunk
…
Dense
Sampling
…
I3D
Head
I3D
Tail
I3D
Head
Selective
Feature
Compression
I3D
Tail
(a) Dense Sampling
(b) Feature Compression
複数サンプリングをする
従来手法
提案手法
そのまま入力
モデルを分割して適応
時間軸を圧縮
!"#"$%&'"()"*%+,"(-./0,"11&.2
3D Backbone
Head
3D Backbone
Tail
Selective Feature
Compression
Conv3D θq
( 3×1×1)
Conv3D θk
( 3×1×1)
Fhead
TopK
Pooling
×
×
SFC Module
v
k
q M !
"
, !
$, !, %, &
$, !,
%
2
,
&
2
$,
!
"
,
%
2
,
&
2
$, !,
%
2
,
&
2
Γ
Γ
!,
$%&
4
$,
!
"
,
%
2
,
&
2
!
"
,
$%&
4
Γ
!, $%& !
"
, $%&
Γ!"
$,
!
"
, %, &
Abstraction
Layer
$, !,
%
2
,
&
2
Fabs
Reshape
Function
Γ Γ!"
× Matrix
Multiplication
Data Flow
$, !, %, &
特徴量が大きい時間を抽出
!は"#$%からの値
圧縮と入力の時間軸の
アテンション
畳み込み層
実験設定
nモデル
• 事前学習済みモデルに適応
• >A!以外は重み固定
n学習
• ビデオ全体を入力
• 反復数:=Tエポック
• データ拡張
• クロッピング(空間方向)
• 反転
nデータセット
• U!A=<=&D>--CB-F(&,B)%0;<=;E
• O%$.@%?5VW<<&DO,*F(&,B)%0;<=IE
• Q?@%0%@*H.@ D+.%3PB-$F(&!:XR;<=TE
n評価指標
• 認識率
• A'KX5
• 7"B-#8"6#@:YXU=台が=秒に処理で
きるビデオの数
• パラメータ数
• バッチ数:YXU=台で処理できるビデ
オの数
• YXUZ7.53, :=<<(&=[Y
実験結果
Backbone Trained Inference Input
Efficiency Memory Accuracy
FLOPS Throughput #Params. #Videos Top1 Top5
Slow Only I3D-50 8⇥8 K400
30 crops 64⇥30 1643G 2.4 32.5M 3 74.4 91.4
SFC 288⇥1 247G 14.8 (6⇥) 35.9M 19 74.6 91.4
Slow Only I3D-50 16⇥4 K400
30 crops 64⇥30 3285G 1.2 32.5M 1 76.4 92.3
SFC 288⇥1 494G 7.4 (6⇥) 35.9M 8 76.9 92.5
Slow Only I3D-50 32⇥2 K400
30 crops 64⇥30 6570G 0.6 32.4M <1 75.7 92.3
SFC 288⇥1 988G 3.6 (6⇥) 35.9M 4 75.8 92.1
TPN-50 8⇥8 K400
30 crops 64⇥30 1977G 2.1 71.8M 2.7 76.0 92.2
SFC 288⇥1 359G 12.5 (6⇥) 85.7M 17 76.1 92.1
R(2+1)D-152
IG-65M! 30 crops 64⇥30 9874G 0.4 118.2M <1 79.4 94.1
K400 SFC 288⇥1 1403G 2.8 (7⇥) 121.7M 5 80.0 94.5
Table 2: Comparison of different backbones with dense sampling inference and with SFC on Kinetics 400. Differently from the other
entries, R(2+1)D was initially pre-trained on the IG-65M datasets and later trained on K400. Given any backbone, we freeze it an train
our SFC module only on K400 dataset. Finally, we compare the results with dense sampling (30 crops).
⌧ Data Used FLOPs Throughput Top1 Top5
1 100% 352G 9.6 73.8 91.2
4/3 75% 299G 12.1 74.5 91.4
2 50% 247G 14.8 74.6 91.4
Method Throughput Top1
Baseline: 30 crops 1⇥ 74.5
Single Crop 30⇥ 7.2
バックボーンが&'(#)*+('(,されたモデル
どのバックボーンでも認識率をさげずに推論効率を上げている
入力サイズとクロップ数が違う
長時間の動画の推論
nデータセット
• Q?@%0%@*H.@
• 多くはT]=<分の^<AX>の動画
n推論方法
• >A!
• 一様にNクリップサンプリング
• ^;×N(&=;N×N
• 連結した=つの動画を入力
• ;T[(&=<;W
• 圧縮率𝜏Z;(&W(&N
• 既存手法
• [W×=(&[W×^<(&[W×=N<にサンプリング
SFC Insertion Point FLOPs Throughput Top 1 Top5
Res 1234 / Res 5 305G 14.0 73.7 90.7
Res 123 / Res 45 247G 14.8 74.6 91.4
Res 12 / Res 345 211G 15.2 73.2 90.5
Res 1 / Res 2345 163G 21.2 73.1 90.9
(a) Backbone Split Choices
pool(q) Top1
Average Pooling 72.9
Max Pooling 72.6
TopK Pooling 74.6
(b) Pooling Strategy
Model Top1
k = q = v = Fbead 72.7
k = q = v = Fabs 63.2
k = q = Fabs, v = Fhead 74.6
(c) KQV Choices
Model FLOPS #Params. Top1
Conv (3x3x3) 356G 48.52M 72.0
Conv (3x1x1) 247G 35.93M 74.6
Conv (1x1x1) 238G 34.89M 72.5
Method Input Size Usage FLOPS Top1
Single Crop 64⇥1 100% 54G 64.8
Uniform Sampling 64⇥30 100% 1620G 76.7
Dense Sampling 64⇥180 100% 9720G 78.2
SFC, ⌧ = 2 256⇥1 50% 247G 77.4
SFC, ⌧ = 4 256⇥1 25% 195G 75.2
SFC, ⌧ = 8 256⇥1 12.5% 167G 68.2
SFC, ⌧ = 2 1024⇥1 50% 988G 78.5
SFC, ⌧ = 4 1024⇥1 25% 780G 77.2
SFC, ⌧ = 8 1024⇥1 12.5% 668G 74.0
Table 7: Results on ActivityNet v1.3 using Slow I3D 8⇥8.
Convolution Designs (table 6d). We evaluate different
linear transformations for q and k in SFC. Results show
that employing 3D temporal kernels only in the temporal
channel is the best option, which is coherent with the goal
of SFC of temporal compression.
可視化
Input Dimension
Compressed
Dimension
Action: Extinguishing Fire
0.0
0.2
0.4
0.6
0.8
1.0
Feature
Importance
Input Dimension
Compressed
Dimension
Action: Swinging on Something
0.0
0.2
0.4
0.6
0.8
1.0
Feature
Importance
Input Dimension
Compressed
Dimension
Action: Plastering
0.0
0.2
0.4
0.6
0.8
1.0
Feature
Importance
Input Dimension
Compressed
Dimension
Action: Bungee Jumping
0.0
0.2
0.4
0.6
0.8
1.0
Feature
Importance
圧縮と入力の
時間軸のアテンション
まとめ
n>A!を提案
• 効率的な推論が可能
• 推論速度[]I倍
• メモリ使用量T][倍削減
• 圧縮率𝜏の調整で認識率とのトレードオフが可能
• アプリケーションの組み込みなどの応用
• 様々なモデルに適用可能
• モデルの再学習不要
• 長時間の動画の推論が可能
補足スライド
!"#34*%%"2%&.2との違い
conv conv conv
𝑇𝐻𝑊×𝐶!
×
𝐶!×𝑇𝐻𝑊
𝑇𝐻𝑊×𝑇𝐻𝑊
×
𝑇𝐻𝑊×𝐶
𝑇×𝐻×𝑊×𝐶
𝑇×𝐻×𝑊×𝐶
_
reshape reshape reshape
reshape
conv
conv
𝑇×𝐶!𝐻!𝑊!
×
𝐶!𝐻!𝑊!×
𝑇
𝜏
×
𝑇×𝐶𝐻𝑊
𝑇×𝐻×𝑊×𝐶
𝑇
𝜏
×𝐻×𝑊×𝐶
reshape
reshape
reshape
reshape
_
𝑇
𝜏
×𝑇
pooling
` a 0
0
a
`
>.3MV,@@.$@%-$ >A!
-#."$/#の軸が違う
推論方法による比較
TPN-50 8⇥8 K400
SFC 288⇥1 359G 12.5 (6⇥) 85.7M 17 76.1 92.1
R(2+1)D-152
IG-65M! 30 crops 64⇥30 9874G 0.4 118.2M <1 79.4 94.1
K400 SFC 288⇥1 1403G 2.8 (7⇥) 121.7M 5 80.0 94.5
Table 2: Comparison of different backbones with dense sampling inference and with SFC on Kinetics 400. Differently from the othe
entries, R(2+1)D was initially pre-trained on the IG-65M datasets and later trained on K400. Given any backbone, we freeze it an trai
our SFC module only on K400 dataset. Finally, we compare the results with dense sampling (30 crops).
⌧ Data Used FLOPs Throughput Top1 Top5
1 100% 352G 9.6 73.8 91.2
4/3 75% 299G 12.1 74.5 91.4
2 50% 247G 14.8 74.6 91.4
4 25% 195G 17.8 72.0 90.1
Table 3: The impact of different data using ratio to speed-
accuracy tradeoff, using I3D-50 8⇥8.
reported it over clips. For a fair comparison, clip-level
FLOPS is multiplied by N when N crops are sampled
for one video.
3. Video Throughput. As FLOPS cannot always estimate
precise GPU speed due to different implementations of
Method Throughput Top1
Baseline: 30 crops 1⇥ 74.5
Single Crop 30⇥ 7.2
TSN Single Sampling [27] 30⇥ 5.7
TSN Dense Sampling [27] 3⇥ 4.9
SCSampler [12] 2⇥ +2.5
SFC 6.2⇥ +0.2
Table 4: Comparison with input sampling methods.
Method FLOPS #Params. Top1
TSM [15] 64G ⇥ 10 24M 95.9
I3D [2] 65G ⇥ 30 44M 92.9
NonLocal R50 [28] 65G ⇥ 30 62M 94.6
Slow I3D 8⇥8, 1 Crop 54G ⇥1 32.5M 93.8
TSN Single Sampling [27] 30⇥ 5.7
TSN Dense Sampling [27] 3⇥ 4.9
SCSampler [12] 2⇥ +2.5
SFC 6.2⇥ +0.2
Table 4: Comparison with input sampling methods.
Method FLOPS #Params. Top1
TSM [15] 64G ⇥ 10 24M 95.9
I3D [2] 65G ⇥ 30 44M 92.9
NonLocal R50 [28] 65G ⇥ 30 62M 94.6
Slow I3D 8⇥8, 1 Crop 54G ⇥1 32.5M 93.8
Slow I3D 8⇥8, 30 Crops 54G ⇥30 32.5M 94.5
Slow I3D 8⇥8 + SFC 247G 35.9M 94.5
Table 5: Results on UCF 101.
5.1. Results on Kinetics-400 and UCF101
Evaluating different backbones. We equip several popular
action recognition backbones with SFC and evaluate their
U!A=<=
O%$.@%?5
• 7>H
• 認識率が低下
• >!>,C63.B
• 7"B-#8"6#@が低い
圧縮率による比較
n設定
• 圧縮率𝜏を変更
• =(&
!
"
(&;(&W
• データセット
• O%$.@%?5VW<<
• 評価指標
• 認識率
• A'KX5
• 7"B-#8"6#@
n結果
• 認識率
• 𝜏 = 2が一番良い結果
• A'KX5(&7"B-#8"6#@
• 認識率とトレードオフ
Learn to look by feature compression
Most state-of-the-art action recognition approaches [2,
22–24, 30] train on short video crops randomly sam-
d from long training videos. At inference, they densely
mple a test video, dividing it into a number of crops (i.e.,
[4, 5, 28] and 10 [12] for Kinetics), run inference on each
these independently, and average their predictions into a
eo-level output (Fig. 2top). While these methods have
cessfully achieved outstanding action recognition per-
mance, they lack inference efficiency. Instead, we pro-
se an efficient solution that removes the need for dense
mpling, while also reducing the inference time of the 3D
ckbone by compressing its features (fig. 2bottom).
Our method works by taking a pre-trained 3D network
coder and splitting it into two components, head and tail.
then insert our novel Selective Feature Compression
dule between the head and tail networks. We design our
C to: (i) remove redundant information while preserv-
useful parts of the video directly at the feature level and
reduce the inference computation of the tail network by
mpressing the feature.
Formally, given an input video V , our approach predicts
3D Backbone
Head
3D Backbone
Tail
Selective Feature
Compression
Conv3D θq
( 3×1×1)
Conv3D θk
( 3×1×1)
Fhead
TopK
Pooling
×
×
SFC Module
v
k
q M !
"
, !
$, !, %, &
$, !,
%
2
,
&
2
$,
!
"
,
%
2
,
&
2
$, !,
%
2
,
&
2
Γ
Γ
!,
$%&
4
$,
!
"
,
%
2
,
&
2
!
"
,
$%&
4
Γ
!, $%& !
"
, $%&
Γ!"
$,
!
"
, %, &
Abstraction
Layer
$, !,
%
2
,
&
2
Fabs
Reshape
Function
Γ Γ!"
× Matrix
Multiplication
Data Flow
$, !, %, &
Figure 3: Selective Feature Compression design. Our proposed
design is capable of modelling long-distance interactions among
features and better discover the most informative video regions.
3.1. Selective Feature Compression
Backbone Trained Inference Input
Slow Only I3D-50 8⇥8 K400
30 crops 64⇥30
SFC 288⇥1
Slow Only I3D-50 16⇥4 K400
30 crops 64⇥30
SFC 288⇥1
Slow Only I3D-50 32⇥2 K400
30 crops 64⇥30
SFC 288⇥1
TPN-50 8⇥8 K400
30 crops 64⇥30
SFC 288⇥1
R(2+1)D-152
IG-65M! 30 crops 64⇥30
K400 SFC 288⇥1
Table 2: Comparison of different backbones with dense sampling in
entries, R(2+1)D was initially pre-trained on the IG-65M datasets an
our SFC module only on K400 dataset. Finally, we compare the result
⌧ Data Used FLOPs Throughput Top1 Top5
1 100% 352G 9.6 73.8 91.2
4/3 75% 299G 12.1 74.5 91.4
2 50% 247G 14.8 74.6 91.4
4 25% 195G 17.8 72.0 90.1
認識率と推論効率
Figure 4: Inference speed (GFLOPS) vs Performance (Top1 Ac-
curacy) vs Model size (Number of parameters, bubble size). Solid
lines link the backbones evaluated in table 2 and dashed arrows
show the performance improvement between using 30 crops dense
sampling (blue) and SFC (yellow) with those backbones.
Multi Crop
Baseline
SFC
SCSampler
SFC + backbones
K=30
SlowI3D R50 16 × 4 + SFC
!=1
Accuracy
Throughput
(Video/Sec)
0 5 10 15 20 25
74
78
78
80
72
70
68
66
64
K=15
K=3
K=1
K=10
R(2+1D) 152 + SFC
!=2
!=4
K=6
TPN R50 8 × 8 + SFC
SlowI3D R50 8 × 8
+ Multi Cropping
SlowI3D R50 8 × 8 + SFC
Figure 5: Comparing speed-accuracy trade-off with different
3D Backbones
3D Backbones + FCP
TSM
Ip-CSN-152
NonLocal R50
NonLocal R101
TPN R101
Slow Fast R50 8x8
TPN R50 8 × 8
SlowOnly I3D R50 8 × 8
SlowOnly I3D R50 16 × 4
R(2+1)D 152
+ SFC
Accuracy
77
78
79
80
76
75
74
73
GFLOPS
0 2000 4000 6000 8000 10000 12000
to achieve the best accuracy at the same speed o
fast model. Moreover, dense sampling on R
requires 22GB of memory for a single video (
crops), which is extremely expensive compare
usage by SFC.
Speed-performance trade-off. To understan
off between the compression ratio and perfo
ablate the compression ratio in table 3 and fig
show that some compression is actually be
leads to small improvements in performance
able to disregard useless information for c
This was also observed in previous works [12
compressing more aggressively (i.e., ⌧ = 4)
inference, but slightly lower performance.
a compression ration of 50% (i.e., ⌧ = 2), o
圧縮率𝜏を変更
既存手法
提案手法
バックボーンの分割
nバックボーン
• 9^/VT<&N×N
nバックボーンの分割場所を変更
• R.53-?`で分割
• R.5&=;^WbR.5&T
• R.5&=;^bR.5&WT
• R.5&=;bR.5&^WT
• R.5&=bR.5&;^WT
n評価指標
• 認識率
• A'KX5
• 7"B-#8"6#@
SFC Insertion Point FLOPs Throughput Top 1 Top5
Res 1234 / Res 5 305G 14.0 73.7 90.7
Res 123 / Res 45 247G 14.8 74.6 91.4
Res 12 / Res 345 211G 15.2 73.2 90.5
Res 1 / Res 2345 163G 21.2 73.1 90.9
(a) Backbone Split Choices
pool(q) Top1
Average Pooling 72.9
Max Pooling 72.6
TopK Pooling 74.6
(b) Pooling Strategy
Model Top1
k = q = v = Fbead 72.7
k = q = v = Fabs 63.2
k = q = Fabs, v = Fhead 74.6
(c) KQV Choices
• 認識率
• R.5&=;^bR.5&WTが一番高い
• A'KX5(&7"B-#8"6#@
• 入力に近い場所ほど良い
• 推論速度を重視するならR.5&=bR.5&;^WT
!)-の比較実験
nQP5@B,?@%-$&',*.B
• QP5@B,?@%-$&',*.Bにかける値を変更
• QP5@B,?@%-$&',*.Bを使わない
• `(&a(&0は".,1からの値
• `(&a(&0をQP5@B,?@%-$&',*.Bにかける
• `(&aをQP5@B,?@%-$&',*.Bにかける
nX--3%$8
• X--3%$8方法を変更
• Q0.B,8.&6--3%$8
• 2,_&6--3%$8
• 7-6O 6--3%$8
n!-$0-3#@%-$
• 畳み込みフィルターのサイズを変更
. Learn to look by feature compression
Most state-of-the-art action recognition approaches [2,
0, 22–24, 30] train on short video crops randomly sam-
ed from long training videos. At inference, they densely
mple a test video, dividing it into a number of crops (i.e.,
0 [4, 5, 28] and 10 [12] for Kinetics), run inference on each
these independently, and average their predictions into a
deo-level output (Fig. 2top). While these methods have
uccessfully achieved outstanding action recognition per-
rmance, they lack inference efficiency. Instead, we pro-
ose an efficient solution that removes the need for dense
mpling, while also reducing the inference time of the 3D
ackbone by compressing its features (fig. 2bottom).
Our method works by taking a pre-trained 3D network
ncoder and splitting it into two components, head and tail.
We then insert our novel Selective Feature Compression
odule between the head and tail networks. We design our
FC to: (i) remove redundant information while preserv-
g useful parts of the video directly at the feature level and
i) reduce the inference computation of the tail network by
ompressing the feature.
Formally, given an input video V , our approach predicts
3D Backbone
Head
3D Backbone
Tail
Selective Feature
Compression
Conv3D θq
( 3×1×1)
Conv3D θk
( 3×1×1)
Fhead
TopK
Pooling
×
×
SFC Module
v
k
q M !
"
, !
$, !, %, &
$, !,
%
2
,
&
2
$,
!
"
,
%
2
,
&
2
$, !,
%
2
,
&
2
Γ
Γ
!,
$%&
4
$,
!
"
,
%
2
,
&
2
!
"
,
$%&
4
Γ
!, $%& !
"
, $%&
Γ!"
$,
!
"
, %, &
Abstraction
Layer
$, !,
%
2
,
&
2
Fabs
Reshape
Function
Γ Γ!"
× Matrix
Multiplication
Data Flow
$, !, %, &
Figure 3: Selective Feature Compression design. Our proposed
design is capable of modelling long-distance interactions among
features and better discover the most informative video regions.
3.1. Selective Feature Compression
!)-の比較実験
211G 15.2 73.2 90.5
163G 21.2 73.1 90.9
Backbone Split Choices
op1
2.9
2.6
4.6
Model Top1
k = q = v = Fbead 72.7
k = q = v = Fabs 63.2
k = q = Fabs, v = Fhead 74.6
(c) KQV Choices
FLOPS #Params. Top1
356G 48.52M 72.0
247G 35.93M 74.6
238G 34.89M 72.5
Different Conv Kernels
es on Kinetics-400, using I3D-R50 8⇥8.
Dense Sampling 64⇥180 100% 9720
SFC, ⌧ = 2 256⇥1 50% 247G
SFC, ⌧ = 4 256⇥1 25% 195G
SFC, ⌧ = 8 256⇥1 12.5% 167G
SFC, ⌧ = 2 1024⇥1 50% 988G
SFC, ⌧ = 4 1024⇥1 25% 780G
SFC, ⌧ = 8 1024⇥1 12.5% 668G
Table 7: Results on ActivityNet v1.3 using Slow
Convolution Designs (table 6d). We evalua
linear transformations for q and k in SFC. R
that employing 3D temporal kernels only in th
channel is the best option, which is coherent w
of SFC of temporal compression.
6. SFC on Untrimmed Videos
SFC Insertion Point FLOPs Thro
Res 1234 / Res 5 305G
Res 123 / Res 45 247G
Res 12 / Res 345 211G
Res 1 / Res 2345 163G
(a) Backbone Split C
pool(q) Top1
Average Pooling 72.9
Max Pooling 72.6
TopK Pooling 74.6
(b) Pooling Strategy
Model
k = q =
k = q =
k = q =
Max Pooling 72.6
TopK Pooling 74.6
(b) Pooling Strategy
k = q = v = Fabs
k = q = Fabs, v = Fhead
(c) KQV Choices
Model FLOPS #Params. Top1
Conv (3x3x3) 356G 48.52M 72.0
Conv (3x1x1) 247G 35.93M 74.6
Conv (1x1x1) 238G 34.89M 72.5
(d) Different Conv Kernels
Table 6: Ablation studies on Kinetics-400, using I3D-R50 8
components using a I3D-50 8⇥8 backbone:
Head/Tail Splitting Point. (table 6a). We inves
different splits of the action recognition backbone.
option corresponds to a different location for our
v = F#$%&が認識率が高い
• QP5@B,?@%-$&',*.Bが特徴量の意味を変える
• @,%3と互換がなくなる
• バックボーンの再学習が必要
k = q = F#$%&では認識率が下がる
• `(aではQP5@B,?@%-$&',*.Bが有効
Abstraction	Layer
F"#$%:"#$%からの値
F$&'012.*-$3*'4(56$7#-をかけた値
84(!49+*'4(
:449'(,

Weitere ähnliche Inhalte

Ähnlich wie 文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference

文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...Toru Tamaki
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video ClassificationToru Tamaki
 
SSII2014 チュートリアル資料
SSII2014 チュートリアル資料SSII2014 チュートリアル資料
SSII2014 チュートリアル資料Masayuki Tanaka
 
アドテク×Scala×パフォーマンスチューニング
アドテク×Scala×パフォーマンスチューニングアドテク×Scala×パフォーマンスチューニング
アドテク×Scala×パフォーマンスチューニングYosuke Mizutani
 
[AI05] 目指せ、最先端 AI 技術の実活用!Deep Learning フレームワーク 「Microsoft Cognitive Toolkit 」...
[AI05] 目指せ、最先端 AI 技術の実活用!Deep Learning フレームワーク 「Microsoft Cognitive Toolkit 」...[AI05] 目指せ、最先端 AI 技術の実活用!Deep Learning フレームワーク 「Microsoft Cognitive Toolkit 」...
[AI05] 目指せ、最先端 AI 技術の実活用!Deep Learning フレームワーク 「Microsoft Cognitive Toolkit 」...de:code 2017
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...Toru Tamaki
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleToru Tamaki
 
20150805卒研進捗LT (share)
20150805卒研進捗LT (share)20150805卒研進捗LT (share)
20150805卒研進捗LT (share)mohemohe
 
SWETの取り組むImage Based Testing
SWETの取り組むImage Based TestingSWETの取り組むImage Based Testing
SWETの取り組むImage Based TestingDeNA
 
【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識Hirokatsu Kataoka
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern TechniquesToru Tamaki
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 
[DL Hacks]Self-Attention Generative Adversarial Networks
[DL Hacks]Self-Attention Generative Adversarial Networks[DL Hacks]Self-Attention Generative Adversarial Networks
[DL Hacks]Self-Attention Generative Adversarial NetworksDeep Learning JP
 
プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610HIDEOMI SUZUKI
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
卒研発表 バースカ(確認済み)
卒研発表 バースカ(確認済み)卒研発表 バースカ(確認済み)
卒研発表 バースカ(確認済み)Baasanchuluun Batnasan
 
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強Kiyoshi Ogawa
 
【CVPR 2019】Learning spatio temporal representation with local and global diff...
【CVPR 2019】Learning spatio temporal representation with local and global diff...【CVPR 2019】Learning spatio temporal representation with local and global diff...
【CVPR 2019】Learning spatio temporal representation with local and global diff...cvpaper. challenge
 
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜Michiharu Niimi
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video InpaintingToru Tamaki
 

Ähnlich wie 文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference (20)

文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
 
SSII2014 チュートリアル資料
SSII2014 チュートリアル資料SSII2014 チュートリアル資料
SSII2014 チュートリアル資料
 
アドテク×Scala×パフォーマンスチューニング
アドテク×Scala×パフォーマンスチューニングアドテク×Scala×パフォーマンスチューニング
アドテク×Scala×パフォーマンスチューニング
 
[AI05] 目指せ、最先端 AI 技術の実活用!Deep Learning フレームワーク 「Microsoft Cognitive Toolkit 」...
[AI05] 目指せ、最先端 AI 技術の実活用!Deep Learning フレームワーク 「Microsoft Cognitive Toolkit 」...[AI05] 目指せ、最先端 AI 技術の実活用!Deep Learning フレームワーク 「Microsoft Cognitive Toolkit 」...
[AI05] 目指せ、最先端 AI 技術の実活用!Deep Learning フレームワーク 「Microsoft Cognitive Toolkit 」...
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
 
20150805卒研進捗LT (share)
20150805卒研進捗LT (share)20150805卒研進捗LT (share)
20150805卒研進捗LT (share)
 
SWETの取り組むImage Based Testing
SWETの取り組むImage Based TestingSWETの取り組むImage Based Testing
SWETの取り組むImage Based Testing
 
【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
[DL Hacks]Self-Attention Generative Adversarial Networks
[DL Hacks]Self-Attention Generative Adversarial Networks[DL Hacks]Self-Attention Generative Adversarial Networks
[DL Hacks]Self-Attention Generative Adversarial Networks
 
プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
卒研発表 バースカ(確認済み)
卒研発表 バースカ(確認済み)卒研発表 バースカ(確認済み)
卒研発表 バースカ(確認済み)
 
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
STARC RTL設計スタイルガイドによるVerilog HDL並列記述の補強
 
【CVPR 2019】Learning spatio temporal representation with local and global diff...
【CVPR 2019】Learning spatio temporal representation with local and global diff...【CVPR 2019】Learning spatio temporal representation with local and global diff...
【CVPR 2019】Learning spatio temporal representation with local and global diff...
 
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
 

Mehr von Toru Tamaki

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense PredictionsToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation LearningToru Tamaki
 
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image CaptioningToru Tamaki
 

Mehr von Toru Tamaki (20)

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
 
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
 

文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference

  • 2. Amazon Web Services {chunhliu, xxnl, hxen, dmodolo, tighej}@amazon.com Abstract Most action recognition solutions rely on dense sampling to precisely cover the informative temporal clip. Exten- sively searching temporal region is expensive for a real- world application. In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that an action model can accurately classify an action with a single pass over the video unlike the multi-clip sampling common with SOTA by learning to drop non-informative features. We present Se- lective Feature Compression (SFC), an action recognition inference strategy that greatly increases model inference efficiency without compromising accuracy. Different from previous works that compress kernel size and decrease the channel dimension, we propose to compress features along the spatio-temporal dimensions without the need to change backbone parameters. Our experiments on Kinetics-400, UCF101 and ActivityNet show that SFC is able to reduce in- ference speed by 6-7x and memory usage by 5-6x compared Jumping into water ✕ ✓ catching frisbee running ✕ Score Class Uniform Sampling: (1 2 3 4) + (5 6 7 8) Running Catching Frisbee Jumping into water (1) (2) (3) (4) (5) (6) (7) (8) Jumping into water ✕ ✓ catching frisbee running ✕ Score Class Dense Sampling: (1 2 3 4) + (3 4 5 6) +(5 6 7 8) SFC: Entire Clip ✕ ✓ ✕ Score Class Select Key Feature Jumping into water catching frisbee running (1) (2) (3) (4) (5) (6) (7) (8) (3) (4) (5) (6) 3D Model 3D Model Figure 1: With fast motion, uniform sampling (row 1, left) is not guaranteed to precisely locate the key region. Consequently, an 導入 n動画認識の既存の推論方法 • 短いフレーム数のクリップから推論 • =ビデオから複数のクリップを使う • 密にサンプリングをした方が良い • メモリ使用量が多い n>.3.?@%0.&A.,@#B.&!-C6B.55%-$&D>A!E • 長いフレーム数のクリップから推論 • =ビデオから=つのクリップを使う • 推論の効率化 • 様々なモデルに適用させやすい Chunhui Liu*, Xinyu Li* , Hao Chen, Davide Modolo, Joseph Tighe Amazon Web Services {chunhliu, xxnl, hxen, dmodolo, tighej}@amazon.com Abstract Most action recognition solutions rely on dense sampling to precisely cover the informative temporal clip. Exten- sively searching temporal region is expensive for a real- world application. In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that an action model can accurately classify an action with a single pass over the video unlike the multi-clip sampling common with SOTA by learning to drop non-informative features. We present Se- lective Feature Compression (SFC), an action recognition inference strategy that greatly increases model inference efficiency without compromising accuracy. Different from previous works that compress kernel size and decrease the channel dimension, we propose to compress features along the spatio-temporal dimensions without the need to change backbone parameters. Our experiments on Kinetics-400, UCF101 and ActivityNet show that SFC is able to reduce in- Jumping into water ✕ ✓ catching frisbee running ✕ Score Class Uniform Sampling: (1 2 3 4) + (5 6 7 8) Running Catching Frisbee Jumping into water (1) (2) (3) (4) (5) (6) (7) (8) Jumping into water ✕ ✓ catching frisbee running ✕ Score Class Dense Sampling: (1 2 3 4) + (3 4 5 6) +(5 6 7 8) SFC: Entire Clip ✕ ✓ ✕ Score Class Select Key Feature Jumping into water catching frisbee running (1) (2) (3) (4) (5) (6) (7) (8) (3) (4) (5) (6) 3D Model 3D Model Figure 1: With fast motion, uniform sampling (row 1, left) is not
  • 3. 効率的な推論方法 n入力サイズを変更する • 動画の全てのフレームを入力 • 長い動画を処理できない • 複数の短いクリップを入力 nモデルを軽量にする • 7>2&D'%$F(&9!!:;<=GE • ;/ベース • 7"%H.@&D'#-F(&9!!:;<=IE • J!K&DL-3M,8",B%F(&J!!:;<=NE n重要なフレームをサンプリング • >!>,C63.B&DO-BP,BF(&9!!:;<=GE • 2QR'&DS#F(&9!!:;<=GE (Wu+, ICCV2019) Untrimmed Video Recognition Wenhao Wu 1,2∗ Dongliang He 3 Xiao Tan 3 Shifeng Chen 1† Shilei Wen 3 1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China 2 University of Chinese Academy of Sciences, China 3 Department of Computer Vision Technology (VIS), Baidu Inc., China {wh.wu,shifeng.chen}@siat.ac.cn {hedongliang01,tanxiao01,wenshilei}@baidu.com Abstract Video Recognition has drawn great research interest and great progress has been made. A suitable frame sampling strategy can improve the accuracy and efficiency of recogni- tion. However, mainstream solutions generally adopt hand- crafted frame sampling strategies for recognition. It could degrade the performance, especially in untrimmed videos, due to the variation of frame-level saliency. To this end, we concentrate on improving untrimmed video classifica- tion via developing a learning-based frame sampling strat- egy. We intuitively formulate the frame sampling proce- dure as multiple parallel Markov decision processes, each of which aims at picking out a frame/clip by gradually ad- justing an initial sampling. Then we propose to solve the problems with multi-agent reinforcement learning (MARL). Our MARL framework is composed of a novel RNN-based Agent 1 Agent 2 Agent 3 Predicted as Hopscotch Action Observation Observation Observation Action Action Step by step Untrimmed video All agents stop • • • • • • • • • Figure 1. A demonstration of our approach. At each time step, each agent relies on context information among nearby agents to
  • 4. 手法 30×3× $ % ×&×' 30×(′× $ % ×&′′×'′′ 3×$×&×' Late Fusion One Pass Prediction 1×(×$×&′×'′ 1×(× $ + ×&′×'′ 1×(′× $ + ×&′′×'′′ Long Video Chunk 3×$×&×' Long Video Chunk … Dense Sampling … I3D Head I3D Tail I3D Head Selective Feature Compression I3D Tail (a) Dense Sampling (b) Feature Compression 複数サンプリングをする 従来手法 提案手法 そのまま入力 モデルを分割して適応 時間軸を圧縮
  • 5. !"#"$%&'"()"*%+,"(-./0,"11&.2 3D Backbone Head 3D Backbone Tail Selective Feature Compression Conv3D θq ( 3×1×1) Conv3D θk ( 3×1×1) Fhead TopK Pooling × × SFC Module v k q M ! " , ! $, !, %, & $, !, % 2 , & 2 $, ! " , % 2 , & 2 $, !, % 2 , & 2 Γ Γ !, $%& 4 $, ! " , % 2 , & 2 ! " , $%& 4 Γ !, $%& ! " , $%& Γ!" $, ! " , %, & Abstraction Layer $, !, % 2 , & 2 Fabs Reshape Function Γ Γ!" × Matrix Multiplication Data Flow $, !, %, & 特徴量が大きい時間を抽出 !は"#$%からの値 圧縮と入力の時間軸の アテンション 畳み込み層
  • 6. 実験設定 nモデル • 事前学習済みモデルに適応 • >A!以外は重み固定 n学習 • ビデオ全体を入力 • 反復数:=Tエポック • データ拡張 • クロッピング(空間方向) • 反転 nデータセット • U!A=<=&D>--CB-F(&,B)%0;<=;E • O%$.@%?5VW<<&DO,*F(&,B)%0;<=IE • Q?@%0%@*H.@ D+.%3PB-$F(&!:XR;<=TE n評価指標 • 認識率 • A'KX5 • 7"B-#8"6#@:YXU=台が=秒に処理で きるビデオの数 • パラメータ数 • バッチ数:YXU=台で処理できるビデ オの数 • YXUZ7.53, :=<<(&=[Y
  • 7. 実験結果 Backbone Trained Inference Input Efficiency Memory Accuracy FLOPS Throughput #Params. #Videos Top1 Top5 Slow Only I3D-50 8⇥8 K400 30 crops 64⇥30 1643G 2.4 32.5M 3 74.4 91.4 SFC 288⇥1 247G 14.8 (6⇥) 35.9M 19 74.6 91.4 Slow Only I3D-50 16⇥4 K400 30 crops 64⇥30 3285G 1.2 32.5M 1 76.4 92.3 SFC 288⇥1 494G 7.4 (6⇥) 35.9M 8 76.9 92.5 Slow Only I3D-50 32⇥2 K400 30 crops 64⇥30 6570G 0.6 32.4M <1 75.7 92.3 SFC 288⇥1 988G 3.6 (6⇥) 35.9M 4 75.8 92.1 TPN-50 8⇥8 K400 30 crops 64⇥30 1977G 2.1 71.8M 2.7 76.0 92.2 SFC 288⇥1 359G 12.5 (6⇥) 85.7M 17 76.1 92.1 R(2+1)D-152 IG-65M! 30 crops 64⇥30 9874G 0.4 118.2M <1 79.4 94.1 K400 SFC 288⇥1 1403G 2.8 (7⇥) 121.7M 5 80.0 94.5 Table 2: Comparison of different backbones with dense sampling inference and with SFC on Kinetics 400. Differently from the other entries, R(2+1)D was initially pre-trained on the IG-65M datasets and later trained on K400. Given any backbone, we freeze it an train our SFC module only on K400 dataset. Finally, we compare the results with dense sampling (30 crops). ⌧ Data Used FLOPs Throughput Top1 Top5 1 100% 352G 9.6 73.8 91.2 4/3 75% 299G 12.1 74.5 91.4 2 50% 247G 14.8 74.6 91.4 Method Throughput Top1 Baseline: 30 crops 1⇥ 74.5 Single Crop 30⇥ 7.2 バックボーンが&'(#)*+('(,されたモデル どのバックボーンでも認識率をさげずに推論効率を上げている 入力サイズとクロップ数が違う
  • 8. 長時間の動画の推論 nデータセット • Q?@%0%@*H.@ • 多くはT]=<分の^<AX>の動画 n推論方法 • >A! • 一様にNクリップサンプリング • ^;×N(&=;N×N • 連結した=つの動画を入力 • ;T[(&=<;W • 圧縮率𝜏Z;(&W(&N • 既存手法 • [W×=(&[W×^<(&[W×=N<にサンプリング SFC Insertion Point FLOPs Throughput Top 1 Top5 Res 1234 / Res 5 305G 14.0 73.7 90.7 Res 123 / Res 45 247G 14.8 74.6 91.4 Res 12 / Res 345 211G 15.2 73.2 90.5 Res 1 / Res 2345 163G 21.2 73.1 90.9 (a) Backbone Split Choices pool(q) Top1 Average Pooling 72.9 Max Pooling 72.6 TopK Pooling 74.6 (b) Pooling Strategy Model Top1 k = q = v = Fbead 72.7 k = q = v = Fabs 63.2 k = q = Fabs, v = Fhead 74.6 (c) KQV Choices Model FLOPS #Params. Top1 Conv (3x3x3) 356G 48.52M 72.0 Conv (3x1x1) 247G 35.93M 74.6 Conv (1x1x1) 238G 34.89M 72.5 Method Input Size Usage FLOPS Top1 Single Crop 64⇥1 100% 54G 64.8 Uniform Sampling 64⇥30 100% 1620G 76.7 Dense Sampling 64⇥180 100% 9720G 78.2 SFC, ⌧ = 2 256⇥1 50% 247G 77.4 SFC, ⌧ = 4 256⇥1 25% 195G 75.2 SFC, ⌧ = 8 256⇥1 12.5% 167G 68.2 SFC, ⌧ = 2 1024⇥1 50% 988G 78.5 SFC, ⌧ = 4 1024⇥1 25% 780G 77.2 SFC, ⌧ = 8 1024⇥1 12.5% 668G 74.0 Table 7: Results on ActivityNet v1.3 using Slow I3D 8⇥8. Convolution Designs (table 6d). We evaluate different linear transformations for q and k in SFC. Results show that employing 3D temporal kernels only in the temporal channel is the best option, which is coherent with the goal of SFC of temporal compression.
  • 9. 可視化 Input Dimension Compressed Dimension Action: Extinguishing Fire 0.0 0.2 0.4 0.6 0.8 1.0 Feature Importance Input Dimension Compressed Dimension Action: Swinging on Something 0.0 0.2 0.4 0.6 0.8 1.0 Feature Importance Input Dimension Compressed Dimension Action: Plastering 0.0 0.2 0.4 0.6 0.8 1.0 Feature Importance Input Dimension Compressed Dimension Action: Bungee Jumping 0.0 0.2 0.4 0.6 0.8 1.0 Feature Importance 圧縮と入力の 時間軸のアテンション
  • 10. まとめ n>A!を提案 • 効率的な推論が可能 • 推論速度[]I倍 • メモリ使用量T][倍削減 • 圧縮率𝜏の調整で認識率とのトレードオフが可能 • アプリケーションの組み込みなどの応用 • 様々なモデルに適用可能 • モデルの再学習不要 • 長時間の動画の推論が可能
  • 12. !"#34*%%"2%&.2との違い conv conv conv 𝑇𝐻𝑊×𝐶! × 𝐶!×𝑇𝐻𝑊 𝑇𝐻𝑊×𝑇𝐻𝑊 × 𝑇𝐻𝑊×𝐶 𝑇×𝐻×𝑊×𝐶 𝑇×𝐻×𝑊×𝐶 _ reshape reshape reshape reshape conv conv 𝑇×𝐶!𝐻!𝑊! × 𝐶!𝐻!𝑊!× 𝑇 𝜏 × 𝑇×𝐶𝐻𝑊 𝑇×𝐻×𝑊×𝐶 𝑇 𝜏 ×𝐻×𝑊×𝐶 reshape reshape reshape reshape _ 𝑇 𝜏 ×𝑇 pooling ` a 0 0 a ` >.3MV,@@.$@%-$ >A! -#."$/#の軸が違う
  • 13. 推論方法による比較 TPN-50 8⇥8 K400 SFC 288⇥1 359G 12.5 (6⇥) 85.7M 17 76.1 92.1 R(2+1)D-152 IG-65M! 30 crops 64⇥30 9874G 0.4 118.2M <1 79.4 94.1 K400 SFC 288⇥1 1403G 2.8 (7⇥) 121.7M 5 80.0 94.5 Table 2: Comparison of different backbones with dense sampling inference and with SFC on Kinetics 400. Differently from the othe entries, R(2+1)D was initially pre-trained on the IG-65M datasets and later trained on K400. Given any backbone, we freeze it an trai our SFC module only on K400 dataset. Finally, we compare the results with dense sampling (30 crops). ⌧ Data Used FLOPs Throughput Top1 Top5 1 100% 352G 9.6 73.8 91.2 4/3 75% 299G 12.1 74.5 91.4 2 50% 247G 14.8 74.6 91.4 4 25% 195G 17.8 72.0 90.1 Table 3: The impact of different data using ratio to speed- accuracy tradeoff, using I3D-50 8⇥8. reported it over clips. For a fair comparison, clip-level FLOPS is multiplied by N when N crops are sampled for one video. 3. Video Throughput. As FLOPS cannot always estimate precise GPU speed due to different implementations of Method Throughput Top1 Baseline: 30 crops 1⇥ 74.5 Single Crop 30⇥ 7.2 TSN Single Sampling [27] 30⇥ 5.7 TSN Dense Sampling [27] 3⇥ 4.9 SCSampler [12] 2⇥ +2.5 SFC 6.2⇥ +0.2 Table 4: Comparison with input sampling methods. Method FLOPS #Params. Top1 TSM [15] 64G ⇥ 10 24M 95.9 I3D [2] 65G ⇥ 30 44M 92.9 NonLocal R50 [28] 65G ⇥ 30 62M 94.6 Slow I3D 8⇥8, 1 Crop 54G ⇥1 32.5M 93.8 TSN Single Sampling [27] 30⇥ 5.7 TSN Dense Sampling [27] 3⇥ 4.9 SCSampler [12] 2⇥ +2.5 SFC 6.2⇥ +0.2 Table 4: Comparison with input sampling methods. Method FLOPS #Params. Top1 TSM [15] 64G ⇥ 10 24M 95.9 I3D [2] 65G ⇥ 30 44M 92.9 NonLocal R50 [28] 65G ⇥ 30 62M 94.6 Slow I3D 8⇥8, 1 Crop 54G ⇥1 32.5M 93.8 Slow I3D 8⇥8, 30 Crops 54G ⇥30 32.5M 94.5 Slow I3D 8⇥8 + SFC 247G 35.9M 94.5 Table 5: Results on UCF 101. 5.1. Results on Kinetics-400 and UCF101 Evaluating different backbones. We equip several popular action recognition backbones with SFC and evaluate their U!A=<= O%$.@%?5 • 7>H • 認識率が低下 • >!>,C63.B • 7"B-#8"6#@が低い
  • 14. 圧縮率による比較 n設定 • 圧縮率𝜏を変更 • =(& ! " (&;(&W • データセット • O%$.@%?5VW<< • 評価指標 • 認識率 • A'KX5 • 7"B-#8"6#@ n結果 • 認識率 • 𝜏 = 2が一番良い結果 • A'KX5(&7"B-#8"6#@ • 認識率とトレードオフ Learn to look by feature compression Most state-of-the-art action recognition approaches [2, 22–24, 30] train on short video crops randomly sam- d from long training videos. At inference, they densely mple a test video, dividing it into a number of crops (i.e., [4, 5, 28] and 10 [12] for Kinetics), run inference on each these independently, and average their predictions into a eo-level output (Fig. 2top). While these methods have cessfully achieved outstanding action recognition per- mance, they lack inference efficiency. Instead, we pro- se an efficient solution that removes the need for dense mpling, while also reducing the inference time of the 3D ckbone by compressing its features (fig. 2bottom). Our method works by taking a pre-trained 3D network coder and splitting it into two components, head and tail. then insert our novel Selective Feature Compression dule between the head and tail networks. We design our C to: (i) remove redundant information while preserv- useful parts of the video directly at the feature level and reduce the inference computation of the tail network by mpressing the feature. Formally, given an input video V , our approach predicts 3D Backbone Head 3D Backbone Tail Selective Feature Compression Conv3D θq ( 3×1×1) Conv3D θk ( 3×1×1) Fhead TopK Pooling × × SFC Module v k q M ! " , ! $, !, %, & $, !, % 2 , & 2 $, ! " , % 2 , & 2 $, !, % 2 , & 2 Γ Γ !, $%& 4 $, ! " , % 2 , & 2 ! " , $%& 4 Γ !, $%& ! " , $%& Γ!" $, ! " , %, & Abstraction Layer $, !, % 2 , & 2 Fabs Reshape Function Γ Γ!" × Matrix Multiplication Data Flow $, !, %, & Figure 3: Selective Feature Compression design. Our proposed design is capable of modelling long-distance interactions among features and better discover the most informative video regions. 3.1. Selective Feature Compression Backbone Trained Inference Input Slow Only I3D-50 8⇥8 K400 30 crops 64⇥30 SFC 288⇥1 Slow Only I3D-50 16⇥4 K400 30 crops 64⇥30 SFC 288⇥1 Slow Only I3D-50 32⇥2 K400 30 crops 64⇥30 SFC 288⇥1 TPN-50 8⇥8 K400 30 crops 64⇥30 SFC 288⇥1 R(2+1)D-152 IG-65M! 30 crops 64⇥30 K400 SFC 288⇥1 Table 2: Comparison of different backbones with dense sampling in entries, R(2+1)D was initially pre-trained on the IG-65M datasets an our SFC module only on K400 dataset. Finally, we compare the result ⌧ Data Used FLOPs Throughput Top1 Top5 1 100% 352G 9.6 73.8 91.2 4/3 75% 299G 12.1 74.5 91.4 2 50% 247G 14.8 74.6 91.4 4 25% 195G 17.8 72.0 90.1
  • 15. 認識率と推論効率 Figure 4: Inference speed (GFLOPS) vs Performance (Top1 Ac- curacy) vs Model size (Number of parameters, bubble size). Solid lines link the backbones evaluated in table 2 and dashed arrows show the performance improvement between using 30 crops dense sampling (blue) and SFC (yellow) with those backbones. Multi Crop Baseline SFC SCSampler SFC + backbones K=30 SlowI3D R50 16 × 4 + SFC !=1 Accuracy Throughput (Video/Sec) 0 5 10 15 20 25 74 78 78 80 72 70 68 66 64 K=15 K=3 K=1 K=10 R(2+1D) 152 + SFC !=2 !=4 K=6 TPN R50 8 × 8 + SFC SlowI3D R50 8 × 8 + Multi Cropping SlowI3D R50 8 × 8 + SFC Figure 5: Comparing speed-accuracy trade-off with different 3D Backbones 3D Backbones + FCP TSM Ip-CSN-152 NonLocal R50 NonLocal R101 TPN R101 Slow Fast R50 8x8 TPN R50 8 × 8 SlowOnly I3D R50 8 × 8 SlowOnly I3D R50 16 × 4 R(2+1)D 152 + SFC Accuracy 77 78 79 80 76 75 74 73 GFLOPS 0 2000 4000 6000 8000 10000 12000 to achieve the best accuracy at the same speed o fast model. Moreover, dense sampling on R requires 22GB of memory for a single video ( crops), which is extremely expensive compare usage by SFC. Speed-performance trade-off. To understan off between the compression ratio and perfo ablate the compression ratio in table 3 and fig show that some compression is actually be leads to small improvements in performance able to disregard useless information for c This was also observed in previous works [12 compressing more aggressively (i.e., ⌧ = 4) inference, but slightly lower performance. a compression ration of 50% (i.e., ⌧ = 2), o 圧縮率𝜏を変更 既存手法 提案手法
  • 16. バックボーンの分割 nバックボーン • 9^/VT<&N×N nバックボーンの分割場所を変更 • R.53-?`で分割 • R.5&=;^WbR.5&T • R.5&=;^bR.5&WT • R.5&=;bR.5&^WT • R.5&=bR.5&;^WT n評価指標 • 認識率 • A'KX5 • 7"B-#8"6#@ SFC Insertion Point FLOPs Throughput Top 1 Top5 Res 1234 / Res 5 305G 14.0 73.7 90.7 Res 123 / Res 45 247G 14.8 74.6 91.4 Res 12 / Res 345 211G 15.2 73.2 90.5 Res 1 / Res 2345 163G 21.2 73.1 90.9 (a) Backbone Split Choices pool(q) Top1 Average Pooling 72.9 Max Pooling 72.6 TopK Pooling 74.6 (b) Pooling Strategy Model Top1 k = q = v = Fbead 72.7 k = q = v = Fabs 63.2 k = q = Fabs, v = Fhead 74.6 (c) KQV Choices • 認識率 • R.5&=;^bR.5&WTが一番高い • A'KX5(&7"B-#8"6#@ • 入力に近い場所ほど良い • 推論速度を重視するならR.5&=bR.5&;^WT
  • 17. !)-の比較実験 nQP5@B,?@%-$&',*.B • QP5@B,?@%-$&',*.Bにかける値を変更 • QP5@B,?@%-$&',*.Bを使わない • `(&a(&0は".,1からの値 • `(&a(&0をQP5@B,?@%-$&',*.Bにかける • `(&aをQP5@B,?@%-$&',*.Bにかける nX--3%$8 • X--3%$8方法を変更 • Q0.B,8.&6--3%$8 • 2,_&6--3%$8 • 7-6O 6--3%$8 n!-$0-3#@%-$ • 畳み込みフィルターのサイズを変更 . Learn to look by feature compression Most state-of-the-art action recognition approaches [2, 0, 22–24, 30] train on short video crops randomly sam- ed from long training videos. At inference, they densely mple a test video, dividing it into a number of crops (i.e., 0 [4, 5, 28] and 10 [12] for Kinetics), run inference on each these independently, and average their predictions into a deo-level output (Fig. 2top). While these methods have uccessfully achieved outstanding action recognition per- rmance, they lack inference efficiency. Instead, we pro- ose an efficient solution that removes the need for dense mpling, while also reducing the inference time of the 3D ackbone by compressing its features (fig. 2bottom). Our method works by taking a pre-trained 3D network ncoder and splitting it into two components, head and tail. We then insert our novel Selective Feature Compression odule between the head and tail networks. We design our FC to: (i) remove redundant information while preserv- g useful parts of the video directly at the feature level and i) reduce the inference computation of the tail network by ompressing the feature. Formally, given an input video V , our approach predicts 3D Backbone Head 3D Backbone Tail Selective Feature Compression Conv3D θq ( 3×1×1) Conv3D θk ( 3×1×1) Fhead TopK Pooling × × SFC Module v k q M ! " , ! $, !, %, & $, !, % 2 , & 2 $, ! " , % 2 , & 2 $, !, % 2 , & 2 Γ Γ !, $%& 4 $, ! " , % 2 , & 2 ! " , $%& 4 Γ !, $%& ! " , $%& Γ!" $, ! " , %, & Abstraction Layer $, !, % 2 , & 2 Fabs Reshape Function Γ Γ!" × Matrix Multiplication Data Flow $, !, %, & Figure 3: Selective Feature Compression design. Our proposed design is capable of modelling long-distance interactions among features and better discover the most informative video regions. 3.1. Selective Feature Compression
  • 18. !)-の比較実験 211G 15.2 73.2 90.5 163G 21.2 73.1 90.9 Backbone Split Choices op1 2.9 2.6 4.6 Model Top1 k = q = v = Fbead 72.7 k = q = v = Fabs 63.2 k = q = Fabs, v = Fhead 74.6 (c) KQV Choices FLOPS #Params. Top1 356G 48.52M 72.0 247G 35.93M 74.6 238G 34.89M 72.5 Different Conv Kernels es on Kinetics-400, using I3D-R50 8⇥8. Dense Sampling 64⇥180 100% 9720 SFC, ⌧ = 2 256⇥1 50% 247G SFC, ⌧ = 4 256⇥1 25% 195G SFC, ⌧ = 8 256⇥1 12.5% 167G SFC, ⌧ = 2 1024⇥1 50% 988G SFC, ⌧ = 4 1024⇥1 25% 780G SFC, ⌧ = 8 1024⇥1 12.5% 668G Table 7: Results on ActivityNet v1.3 using Slow Convolution Designs (table 6d). We evalua linear transformations for q and k in SFC. R that employing 3D temporal kernels only in th channel is the best option, which is coherent w of SFC of temporal compression. 6. SFC on Untrimmed Videos SFC Insertion Point FLOPs Thro Res 1234 / Res 5 305G Res 123 / Res 45 247G Res 12 / Res 345 211G Res 1 / Res 2345 163G (a) Backbone Split C pool(q) Top1 Average Pooling 72.9 Max Pooling 72.6 TopK Pooling 74.6 (b) Pooling Strategy Model k = q = k = q = k = q = Max Pooling 72.6 TopK Pooling 74.6 (b) Pooling Strategy k = q = v = Fabs k = q = Fabs, v = Fhead (c) KQV Choices Model FLOPS #Params. Top1 Conv (3x3x3) 356G 48.52M 72.0 Conv (3x1x1) 247G 35.93M 74.6 Conv (1x1x1) 238G 34.89M 72.5 (d) Different Conv Kernels Table 6: Ablation studies on Kinetics-400, using I3D-R50 8 components using a I3D-50 8⇥8 backbone: Head/Tail Splitting Point. (table 6a). We inves different splits of the action recognition backbone. option corresponds to a different location for our v = F#$%&が認識率が高い • QP5@B,?@%-$&',*.Bが特徴量の意味を変える • @,%3と互換がなくなる • バックボーンの再学習が必要 k = q = F#$%&では認識率が下がる • `(aではQP5@B,?@%-$&',*.Bが有効 Abstraction Layer F"#$%:"#$%からの値 F$&'012.*-$3*'4(56$7#-をかけた値 84(!49+*'4( :449'(,