This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Action Recognitionの歴史と最新動向
1. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Action Recognition
September 3, 2018
Katsunori Ohnishi
DeNA Co., Ltd.
1
2. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
n
n Action recognition
n
n
n
Deep
Deep
Temporal Aggregation
n Tips
n
n
2
3. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
n ( )
Twitter: @ohnishi_ka
n
2014 4 -2017 9 : B4~M2.5 Computer Vision
• ( ) : http://katsunoriohnishi.github.io/
CVPR2016 (spotlight oral, acceptance rate=9.7%): egocentric vision (wrist-mounted camera)
ACMMM2016 (poster, acceptance rate=30%): action recognition ( state-of-the-art)
AAAI2018 (oral, acceptance rate=10.9%): video generation (FTGAN)
2017 10 - : DeNA AI
• DeNA
→ https://www.wantedly.com/projects/209980
3
4. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Action Recognition
n
Image classification
action recognition = human action recognition
• fine-grained egocentric
4
Fine-grained
egocentric
Dog-centric
Action recognition
RGBD
Evaluation of video activity localizations integrating quality and quantity measurements [C. Wolf+, CVIU14]
Recognizing Activities of Daily Living with a Wrist-mounted Camera [K. Ohnishi+, CVPR16]
A Database for Fine Grained Activity Detection of Cooking Activities [M. Rohrbach+, CVPR12]
First-Person Animal Activity Recognition from Egocentric Videos [Y. Iwashita+, ICPR14]
Recognizing Human Actions: A Local SVM Approach [C. Schuldt+, ICPR04]
HMDB: A Large Video Database for Human Motion Recognition [H. Kuehne+, ICCV11]
Ucf101: A dataset of 101 human actions classes from videos in the wild [K. Soomro+, arXiv2012]
5. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
n
KTH, UCF101, HMDB51
• UCF101 101 13320 …
n
Activity-net, Kinetics, Youtube8M
n
AVA, Moments in times, SLAC
5
UCF101
6. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
n YouTube-8M Video Understanding
Challenge
https://www.kaggle.com/c/youtube8m
CVPR17 ECCV18 workshop ,
Kaggle
frame-level
test
• kaggle , action recognition
n ActivityNet Challenge
http://activity-net.org/challenges/2018/
ActivityNet 3
• Temporal Proposal (T )
• Temporal localization (T )
• Video Captioning
• Kinetics: classification (human action)
• AVA: Spatio-temporal localization (XYT)
• Moments-in-time: classification (event)
6
7. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN
n
2000
SIFT
local descriptor→coding global feature→
n
STIP [I. Laptev, IJCV04]
Dense Trajectory [H. Wang+, ICCV11]
Improved Dense Trajectory [H. Wang+, ICCV13]
7
•
http://hirokatsukataoka.net/temp/presen/170121STAIRLab_slideshar
e.pdf
•
https://arxiv.org/pdf/1605.04988.pdf
On space-time interest points [I. Laptev, IJCV04]
Action Recognition by Dense Trajectories [H. Wang+, ICCV11]
Action Recognition with Improved Trajectories [H. Wang+, ICCV13]
8. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN
n Improved Dense Trajectories (iDT) [H. Wang+, ICCV13]
Dense Trajectories [H. Wang+, ICCV11]
8
2
optical flow
foreground
optical flow
Improved dense trajectories (green)
(background dense trajectories (white))
9. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN
n
9
SIFT Fisher Vector
Fisher vector
http://www.isi.imi.i.u-tokyo.ac.jp/~harada/pdf/SSII_harada20120608.pdf
https://www.slideshare.net/takao-y/fisher-vector
…
input Local descriptor
iDT
Video descriptor
Fisher Vector
[F. Perronnin+, CVPR07]
Classifier
SVM
Fisher kernels on visual vocabularies for image categorization [F. Perronnin, CVPR07]
[F. Pedregosa+, JMLR11]
10. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition
n
CNN
Two-stream
• Hand-crafted feature ( )
3D Convolution
• C3D
• C3D Two-stream
• 3D conv
Optical flow
10
11. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: CNN
n Spatio-temporal ConvNet [A. Karpathy+, CVPR 14]
CNN
AlexNet RGB ch → 10 frames ch (gray)
multi scale Fusion
Sports1M pre-training UCF101 65.4 (iDT 85.9%)
11
Large-scale video classification with convolutional neural network [A. Karpathy+, CVPR14]
• 10 frames conv1 ch
• RGB gray frame-by-frame
score ( )
13. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n C3D [D. Tran +, ICCV15]
16frame 3D convolution CNN
• XYT 3D convolution
UCF101 pre-training
ICCV15 arxiv 2 reject
13
Learning Spatiotemporal Features with 3D Convolutional Networks [D. Tran +, ICCV15]
UCF101 HMDB51
iDT 85.9% 57.2%
Two-steam 88.0% 59.4%
C3D (1net) 82.3% -
3D conv
14. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n P3D [Z. Qiu+, ICCV17]
C3D ,
3D conv → 2D conv (XY) + 1D conv (T)
pre-training
14
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [Z. Qiu+, ICCV17]
UCF101 HMDB51
iDT 85.9% 57.2%
Two-steam (Alexnet) 88.0% 59.4%
P3D (ResNet) 88.6% -
Spatial 2D conv
Temporal 1D conv
15. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n P3D [Z. Qiu+, ICCV17]
C3D ,
3D conv → 2D conv (XY) + 1D conv (T)
pre-training
15
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [Z. Qiu+, ICCV17]
UCF101 HMDB51
iDT 85.9% 57.2%
Two-steam (Alexnet) 88.0% 59.4%
P3D (ResNet) 88.6% -
Two-stream (ResNet152) 91.8%Spatial 2D conv
Temporal 1D conv
3D conv
again
16. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n C3D, P3D
3D conv
n
3D conv [K. Hara+, CVPR18]
16
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [K. Hara+, CVPR18]
2012 2011 2015 2017
17. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n C3D, P3D
3D conv
n
3D conv [K. Hara+, CVPR18]
17
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [K. Hara+, CVPR18]
2012 2011 2015 20172017
Kinetics!
18. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n Kinetics
human action dataset!
3D conv
• Pre-train UCF101
18
The Kinetics human action video dataset [W. Kay+, arXiv17]
• Youtube8M
•
19. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n I3D [J. Carreira +, ICCV17]
Kinetics dataset DeepMind
3D conv Inception
64 GPUs for training, 16 GPUs for predict
state-of-the-art
• RGB
• Two-stream optical flow
score
19
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J. Carreira +, ICCV17]
UCF101 HMDB51
RGB-I3D 95.6% 74.8%
Flow-I3D 96.7% 77.1%
Two-stream I3D 98.0% 80.7%
…
20. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n I3D [J. Carreira +, ICCV17]
Kinetics dataset DeepMind
3D conv Inception
64 GPUs for training, 16 GPUs for predict
state-of-the-art
• RGB
• Two-stream optical flow
score
20
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J. Carreira +, ICCV17]
UCF101 HMDB51
RGB-I3D 95.6% 74.8%
Flow-I3D 96.7% 77.1%
Two-stream I3D 98.0% 80.7%
…
?
21. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n I3D Two-stream
3D convolution
n ( )
3D conv XY T
• XY T
3D conv
21
time
22. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n 3D convolution [D.A. Huang+, CVPR18]
• 3D CNN
• →
•
• Two-stream I3D Optical flow 3D conv
22
What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets [D.A. Huang+, CVPR18]
23. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: 3D convolution
n 3D conv
CVPR18
CVPR/ICCV/ECCV
3D conv 3D
conv
• GPU
23
24. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
CNN action recognition: Optical flow
n Optical flow [L Sevilla-Lara+, CVPR18]
• Optical flow
• Optical flow (EPE) action recognition
• flow action recognition
•
Optical flow appearance
• Optical flow
24
On the Integration of Optical Flow and Action Recognition [L Sevilla-Lara+, CVPR18]
25. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
25
AVA
XYZT bounding box
human action localization
Moments-in-time
3
Kinetics-600
Kinetics 400 600
[C. Gu+, CVPR18] [M. Monfort+, arXiv2018] [W. Kay+, arXiv2017]
26. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Temporal Aggregation
n
2D conv frame-by-frame 3D conv
(100 frames, 232 frames, 50 frames)
26
27. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Temporal Aggregation
n
Score
→
LSTM
→
• FC
?
• fencing → fencing
→…
27
…
…
CNN
LSTM
FC
CNN
LSTM
FC
CNN
LSTM
FC
CVPR ACMMM AAAI
…
28. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
…
input Local descriptor
iDT
Video descriptor
Fisher Vector
[F. Perronnin+, CVPR07]
Classifier
SVM
[F. Pedregosa+, JMLR11]
Temporal Aggregation
n ,
→ …!
Fisher Vector
• CNN SIFT GMM
• FV VLAD [H. Jegou+, CVPR10]
28
Aggregating local descriptors into a compact image representation [H. Jegou+, CVPR10]
29. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Temporal Aggregation
n LCD [Z. Xu+, CVPR15]
VGG16 pool5 XY 512dim feature
• 224x224 feature 7x7=49
• VLAD global feature
29
A discriminative CNN video representation for event detection [Z. Xu+, CVPR15]
…
input
CNN
Pool5
(e.g. 2x2x512)
Local descriptors
VLAD
SVM
global feature
CNN
CNN
30. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Temporal Aggregation
n ActionVLAD [R. Girdhar+, CVPR17]
NetVLAD [R Arandjelović+, CVPR16]
• NetVLAD VLAD NN Cluster assign softmax
assign
• VLAD LCD
VLAD
• End2end CNN !
30
ActionVLAD: Learning spatio-temporal aggregation for action classification [R. Girdhar+, CVPR17]
31. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Temporal Aggregation
n TLE [A. Diba+, CVPR17]
VLAD Compact Bilinear Pooling [Y. Gao+, CVPR16]
Temporal Aggregation
VLAD
• SVM VLAD NN
31
Deep Temporal Linear Encoding Networks [A. Diba+, CVPR17]
32. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Tips
n
Two-stream (ResNet) 2D conv Optical flow
n Single model State-of-the-art
I3D + TLE BA
64GPU
n
Two-stream optical flow GPU
• optical flow stream
• RGB-stream
Optical flow
32
33. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Tips
n
CNN TLE coding
• TLE ActionVLAD
iDT
• CNN
• FisherVector iDT
Tips: PCA (dim=64). K=256. FV power norm
• CPU
33
34. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Temporal Aggregation
n
Score
→
LSTM
→
• FC
?
• fencing → fencing
→…
34
…
…
CNN
LSTM
FC
CNN
LSTM
FC
CNN
LSTM
FC
CVPR ACMMM AAAI
…
input
↓
Two-stream
35. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
n
LSTM
3D conv
Optical flow
•
[L Sevilla-Lara+, CVPR18]
35
…
…
CNN
LSTM
FC
CNN
LSTM
FC
CNN
LSTM
FC
36. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
2D conv + LSTM 3D conv 3D conv
Two-stream
Optical flow
MoCoGAN
[S. Tulyakov+, CVPR18]
VGAN
[C. Vondrick+, NIPS16]
TGAN
[M. Saito+, ICCV17]
FTGAN
[K. Ohnishi+, AAAI18]
LRCN
[J. Donahue+, CVPR15]
C3D
[D. Tran+, ICCV15]
P3D
[Z. Qiu+, ICCV17]
Two-stream [K. Simonyan+, NIPS15]
I3D [J. Carreira +, ICCV17]
( )VGAN
38. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
n !
Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture
K. Ohnishi+, AAAI 2018 (oral presentation)
https://arxiv.org/abs/1711.09618
38
Optical flow
39. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
n
Action classification
• Temporal action localization Spatio-temporal localization
3D conv
Augmentation
n Pose
Pose
• pose
• data distillation
n Tips
&optical flow
Kinetics Youtube
39
40. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
n
XY XYT O(n2)→ O(n3)
• !
n
n
n
40